Guide to the sample definitions accompanying the DTBuild installation


EMail-Search-Text.dxd

Finds messages containing any of the text strings listed in the "SearchText" string set.   User must specify input folders and "SearchText" items.   Produces HTML output.

Makes use of post-processor File-HTML-Generate-Line-Breaks.dxd, which restores the line-by-line appearance of the original emails that is lost in the initial conversion to HTML format.

Note:  Prior to running the sample email parsers it may be advisable to set the message profile in Options | Message.


EMail-Search-Word-Pairs.dxd

Finds messages with proximate text strings, i.e. two strings near each other.   User must specify input folders and the "Word1/2" lists.   Produces HTML output.

Makes use of 3 node groups :

HTML-Strip   - converts HTML tags to text so they will display as-is in the HTML output
EMail-SearchWordPairsInner    - does the actual formatting of an email that has been determined to have word pairs near each other
HTML-Add-Line-Breaks    - restores line breaks in HTML, similar to the File-HTML-Generate-Line-Breaks post-processor

This example locates emails with the words "e-mail" OR "email" and variations on the word "parse" somewhere near each other in the message body, "near" being defined as within 200 characters.   See definition of Pattern "(near)".


EMail-To-Database.dxd

Sample email parser that transmits extracted fields to a database.   This example parses eBay end-of-auction notification messages into database table "eauction".   Can be customized for other generated email formats by modifying the string sets :

SubjectFilter    - text that begins the subject field in the email header
TextFields    - a list of text field labels and their respective database destinations
NumericFields    - a list of numeric field labels and their respective database destinations
CurrencyFields    - a list of currency field labels and their respective database destinations

Makes use of 3 node groups :

Name-Address    - extracts the name and email address (RName + RMail) from the from/to fields in the email header 
Decimal-Number    - gets a decimal number: digits + decimal point + digits
Text    - gets text from the input, strips leading / training blanks, stops at end of line 

A single action group, "NewEMail", transmits the parsed fields to the database.  

Also requires a database / ODBC connection.   File SQL.txt accompanies the installation:  it contains an SQL statement for creating the "eauction" table used by EMail-To-Database.dxd.


File-Count-Keywords.dxd

Counts words and keywords in a group of HTML files (keywords relating to "email" and "parsing" in this example, see the "Keywords" string set definition).   User must specify input files and keywords of interest.   Produces HTML and text output.

Note:  The link from the Start ("*") node to the "Word" node must always have the largest number (i.e. be the last link in the sequence).   Keywords in nodes with link numbers greater than the general-case "Word" node will never be found (think about this one)!

Uses node group HTML-element.dxg to skip over HTML tags.


File-Count-Lines.dxd

Determines the number of lines (new line characters) in a group of text files.   User must specify input files.   Produces HTML output.


File-CRLF-LF.dxd

Replaces carriage return / line feed (CRLF) sequences with single line feed characters (LF).   User must specify input files and and an existing output directory.


File-Filter-Unprintables.dxd

Extracts printable characters (ASCII 30-126) from the input, discards everything else.   User must specify input files.   Produces "cleaned up" output with newlines where the unprintable characters were.


File-Generate-Site-Index.dxd

Generates a website index from a group of HTML files.   Extracts the content of the <TITLE> tag and the "description" META tag, and generates a single HTML file, siteindex.htm.   User must specify the input files and may have to modify the hard-wired META tag search string :

<META name="description" content=
... depending on how those tags are coded in the input files.


File-HTML-Generate-Line-Breaks.dxd

Transforms line breaks (carriage return / line feed pairs) to HTML <BR> tags.   Used to post-process the output of sample EMail-Search-Text.dxd (above).

There are two nodes in this definition to handle the possibility that the input contains a mixture of CRLF and LF newlines.   It looks for CRLF 1st, LF 2nd, and converts both to HTML line breaks.


File-LF-CRLF.dxd

Replaces line feed characters with carriage return / line feed sequences.   User must specify input files and and an existing output directory.


File-Search.dxd

Searches for one or more text strings in the input files.   User must specify input files and the "search items" string set.   Specify the search text in the "Text" column of "search items".

Running File-Search.dxd produces :


File-Search-Replace.dxd

Searches for and replaces one or more text strings in the input files.   Specify the search text in the "Text" column of the "new text" string set;   specify the replacement text in the "Other text" column.


Web-Extract-Number.dxd

This is an HTML parser that extracts two items from each web page in the input URL list :

Specify the label by pressing the Define Label button.   Press Run and wait for scanning to complete, then press View Results.

The output is an HTML file with a table containing, for each input URL :

The HTML header for the output file is contained in text file Web-Extract-Number-Header.txt (file variable HTMLHeader).   You can modify this text file to change the output's appearance (font, color, etc.).

This example extracts the content of an <h1> tag and a decimal number.   It can be modified to extract from another identifying tag (<title>, for example), and to extract other data formats (numbers without decimal points, text, etc.).


Web-Extract-Title-Header.dxd

This HTML parser extracts two items from each web page in the input URL list :

The output is a web page (HTML file) containing a brief listing for each input URL :

The HTML header for the output file is contained in text file Web-Extract-Title-Header.txt (file variable HTMLHeader).   You can modify this text file to change the output's appearance (font, color, etc.).