DTBuild email parsing

For each email in an input folder DTBuild loads the header and body as a single input stream.

The header contains one line for each of the following fields :

 Field: Field starts with: 
  subject     "Subject:"
  sender     "From:"
  recipient     "To:"
  date / time     "Date:"

The header and body are separated by a blank line, per SMTP specifications.

The techniques described in the DTBuild tutorial can be used to parse emails.

Email parsing example

screen shot: eBay end-of-auction email parser

Notice that the header and the body are parsed separately.   While parsing the header the start node is set to "Header".   While parsing the body the start node is set to "Body".   After "EndHeader" (the first blank line) is recognized the start node is set to "Body".   The start node must be reset to "Header" in the pre-stream actions.   This practice is advisable for parsing emails whose bodies may contain header contents ("Subject:", "From:", "To:", etc.), for example: when replying or forwarding.

Note the use of a null node, "EndMessage".   The null node is recognized at the end of the input stream (email body).   The values collected from the email are sent to the database at that point by action group "NewEMail".

See sample EMail-To-Database.dxd.

In the bodies of these emails the general format of the data we're interested in is :

   label ... spaces ... data ... end-of-line.

A label is descriptive text, typically followed by a colon, for example:  "Item name:".

This definition makes use of the following patterns :

 *  - zero or more of any character  
 Num    - one or more numeric digits  
 WS0+    - zero or more whitespace characters (blank, tab, etc.) 
 EndHeader    - a blank line (two line feeds)  
 EndMessage    - null pattern indicating end of email body  

It also makes use of string sets :

 SubjectFilter   - text that begins the subject field in the email header
 TextFields   - a list of text fields expected in the email body, and their destinations in the database
 NumericFields   - a list of numeric field labels and their respective database destinations
 CurrencyFields   - a list of currency field labels and their respective database destinations

and node groups :

 Name-Address   - extracts the name and email address (RName + RMail) from the from/to fields in the email header 
 Decimal-Number   - gets a decimal number: digits + decimal point + digits
 Text   - gets text from the input, strips leading / training blanks, stops at end of line 

Here's the definition of the TextFields string set :

screen shot: TextFields string set, maps email fields to database

This definition also makes use of a database / ODBC connection.   File "SQL.txt" accompanies the installation:  it contains an SQL statement for creating the "eauction" table used by this sample.