Guide to the sample definitions accompanying the DTBuild installation
- EMail-Search
- EMail-Search-Word-Pairs
- EMail-To-Database
- File-Count-Keywords
- File-Count-Lines
- File-CRLF-LF
- File-Filter-Unprintables
- File-Generate-Site-Index
- File-HTML-Generate-Line-Breaks
- File-LF-CRLF
- File-Search
- File-Search-Replace
- Web-Extract-Number
- Web-Extract-Title-Header
EMail-Search.dxd
Finds messages containing any of the text strings listed in the "SearchText" string set. User must specify input folders and "SearchText" items. Produces HTML output.
Makes use of post-processor File-HTML-Generate-Line-Breaks.dxd, which restores the line-by-line appearance of the original emails that is lost in the initial conversion to HTML format.
- View the final results in output file "Results" (View | Results).
- View the intermediate results without the line breaks in output file "TempFile" (View | TempFile).
Note: Prior to running the sample email parsers it may be advisable to set the message profile in Options | Message.
EMail-Search-Word-Pairs.dxd
Finds messages with proximate text strings, i.e. two strings near each other. User must specify input folders and the "Word1/2" lists. Produces HTML output.
Makes use of 3 node groups :
| HTML-Strip | converts HTML tags to text so they will display as-is in the HTML output |
| EMail-SearchWordPairsInner | does the actual formatting of an email that has been determined to have word pairs near each other |
| HTML-Add-Line-Breaks | restores line breaks in HTML, similar to the File-HTML-Generate-Line-Breaks post-processor |
This example locates emails with the words "e-mail" OR "email" and variations on the word "parse" somewhere near each other in the message body, "near" being defined as within 200 characters. See definition of Pattern "(near)".
EMail-To-Database.dxd
Sample email parser that transmits extracted fields to a database. This example parses eBay end-of-auction notification messages into database table "eauction". Can be customized for other generated email formats by modifying the string sets :
| SubjectFilter | text that begins the subject field in the email header |
| TextFields | a list of text field labels and their respective database destinations |
| NumericFields | a list of numeric field labels and their respective database destinations |
| CurrencyFields | a list of currency field labels and their respective database destinations |
Makes use of 3 node groups :
| Name-Address | extracts the name and email address (RName + RMail) from the from/to fields in the email header |
| Decimal-Number | gets a decimal number: digits + decimal point + digits |
| Text | gets text from the input, strips leading / training blanks, stops at end of line |
A single action group, "NewEMail", transmits the parsed fields to the database.
Also requires a database / ODBC connection. File SQL.txt accompanies the installation: it contains an SQL statement for creating the "eauction" table used by EMail-To-Database.dxd.
File-Count-Keywords.dxd
Counts words and keywords in a group of HTML files (keywords relating to "email" and "parsing" in this example, see the "Keywords" string set definition). User must specify input files and keywords of interest. Produces HTML and text output.
Note: The link from the Start ("*") node to the "Word" node must always have the largest number (i.e. be the last link in the sequence). Keywords in nodes with link numbers greater than the general-case "Word" node will never be found (think about this one)!
Uses node group HTML-element.dxg to skip over HTML tags.
File-Count-Lines.dxd
Determines the number of lines (new line characters) in a group of text files. User must specify input files. Produces HTML output.
File-CRLF-LF.dxd
Replaces carriage return / line feed (CRLF) sequences with single line feed characters (LF). User must specify input files and and an existing output directory.
File-Filter-Unprintables.dxd
Extracts printable characters (ASCII 30-126) from the input, discards everything else. User must specify input files. Produces "cleaned up" output with newlines where the unprintable characters were.
File-Generate-Site-Index.dxd
Generates a website index from a group of HTML files. Extracts the content of the <TITLE> tag and the "description" META tag, and generates a single HTML file, siteindex.htm. User must specify the input files and may have to modify the hard-wired META tag search string :
<META name="description" content=
... depending on how those tags are coded in the input files.
File-HTML-Generate-Line-Breaks.dxd
Transforms line breaks (carriage return / line feed pairs) to HTML <BR> tags. Used to post-process the output of sample EMail-Search.dxd (above).
There are two nodes in this definition to handle the possibility that the input contains a mixture of CRLF and LF newlines. It looks for CRLF 1st, LF 2nd, and converts both to HTML line breaks.
File-LF-CRLF.dxd
Replaces line feed characters with carriage return / line feed sequences. User must specify input files and and an existing output directory.
File-Search.dxd
Searches for one or more text strings in the input files. User must specify input files and the "search items" string set. Specify the search text in the "Text" column of "search items".
Running File-Search.dxd produces :
- an HTML output file
- a text output file
- a list of the input files that contain "hits"
- a list of "stats" (totals, etc.)
File-Search-Replace.dxd
Searches for and replaces one or more text strings in the input files. Specify the search text in the "Text" column of the "new text" string set; specify the replacement text in the "Other text" column.
Web-Extract-Number.dxd
This is an HTML parser that extracts two items from each web page in the input URL list :
- the content of the first <h1> tag
- the decimal number following the specified label
Specify the label by pressing the Define Label button. Press Run and wait for scanning to complete, then press View Results.
The output is an HTML file with a table containing, for each input URL :
- the extracted items
- the source URL
- the date and time the web page was fetched
The HTML header for the output file is contained in text file Web-Extract-Number-Header.txt (file variable HTMLHeader). You can modify this text file to change the output's appearance (font, color, etc.).
This example extracts the content of an <h1> tag and a decimal number. It can be modified to extract from another identifying tag (<title>, for example), and to extract other data formats (numbers without decimal points, text, etc.).
Web-Extract-Title-Header.dxd
This HTML parser extracts two items from each web page in the input URL list :
- the content of the web page's <title> tag
- the content of the first header tag as defined in string set HeaderTag
The output is a web page (HTML file) containing a brief listing for each input URL :
- the extracted title tag
- the extracted header tag
- the source URL
- the date and time the web page was fetched
The HTML header for the output file is contained in text file Web-Extract-Title-Header.txt (file variable HTMLHeader). You can modify this text file to change the output's appearance (font, color, etc.).