DT command line program

Overview
Requirements
Running DT
DSN configuration
HTML parsing
Options
Scanning web page sequences with NextURL
DT samples

Overview

DT is a Windows command line program that runs DTBuild.   There are two command formats :

dt  <dxd file name>
dt  <data source name>  <task table name>

In the first, 1-argument format the specified definition (DXD) file controls DT.   The definition must contain a list of input files, emails or URLs.   If multiple input lists are specified one will be chosen; see the Run topic for more information.   The transformation defined in the DXD file will be run on the chosen input list.

In the second, 2-argument format a task table in a database controls DT.   The first argument is the ODBC Data Source Name (DSN) of the database.   The second argument is the name of a task table in that database.   The remainder of this document describes how to configure and run DT under the control of a database task table.

DT database control

Under database control DT is driven by 3 interrelated database table types :

The table names are user-defined, but each table type has a specific format.   The fields indicated in the table descriptions below must be present, but the user can add fields to the tables as needed.

The 2 command line arguments identify a task table within a database :

dt  <DSN>  <task table>
  1. the ODBC Data Source Name (DSN),
  2. the name of a task table in that database.

A database can contain more than one task table, but a DT run is controlled by a single task table.

For example, this command runs the census table scraper sample :

"C:\Program Files\DTUtilities\DTBuild\DT"  DTUtilities-Census-01  Tasks

"DTUtilities-Census-01" is the DSN, "Tasks" is the task table name.   See the Running DT topic for more information.

The task table in turn refers to source and SQL tables, as described below.


The task table

A single task table controls the process, instructing DT to :

 Task table 
 Field:   Description: 
 N   The action index  (1..N) 
 Exec   Execute this task  (yes/no)? 
 Action   The type of task to perform:  "Scan" or "ExecSQL" 
 Parameter   Data to be used by the task (a source or SQL table) 

DT scans the task table executing the user-defined actions, which can be :

Scan - scan the URLs in the Source table whose name is specified in the parameter field
ExecSQL - execute the statements in the SQL table whose name is specified in the parameter field

The Scan and ExecSQL tasks can be run in any order.   For example, you might want to execute SQL before and after a URL scan.  

Example

The following task table :

  1. runs the SQL in table PreSQL,
  2. scans the URL list in table WebSources,
  3. runs the SQL in table PostSQL.
 N   Exec   Action   Parameter 
 1   X   ExecSQL   PreSQL 
 2   X   Scan   WebSources 
 3   X   ExecSQL   PostSQL 

The Exec field allows the user to switch tasks on and off without having to add / remove them from the table.

This design is flexible enough to allow multiple passes, where, for example, a list of URLs generated by one step is scanned by the next step.   Data from multi-level websites can be extracted using this feature.


Source tables

The URLs and associated data are specified in "source" database tables containing the following information :

 Source table 
 Field:  Description: 
 Scan   Scan this URL (yes/no)? 
 Source   User's description of the URL; passed to DTBuild variable "Source" 
 URL   The Uniform Resource Locator 
 Parser   The DTBuild .DXD file to be used to parse the URL's data 
 Options   Option string 
 UserName   The user ID (if any) required to access the URL 
 Password   The password (if any) required to access the URL 
 Notes   User notes, free-format, ignored by DT 

DT scans Source tables from beginning to end.   Each URL is parsed per the associated DTBuild parser (.DXD file).

Example

The following Source table instructs DT to fetch and parse data from 3 URLs :

 Scan   Source   URL   Parser 
 X   Con Ed   http://finance.yahoo.com/q?s=ED   Quotes01 
 X   Borland   http://finance.yahoo.com/q?s=BORL   Quotes01 
 X   GMOT   http://finance.yahoo.com/q?s=GMA   Quotes01 

All three URLs are parsed with the Quotes01.DXD parser.   The Options, UserName and Password fields, which are not typically used, are not shown in this example.


SQL tables

The task table can trigger the execution of SQL (database Structured Query Language) statements in tables containing the following information :

 SQL table 
 Field:  Description: 
 N   An index (1..N) controlling order of statement execution 
 Exec   Execute this SQL statement  (yes/no)? 
 Command   An SQL statement  (must be valid!) 

SQL tables can be used to prepare data in the Source tables, for example, or to clear an output table prior to a scan.

Examples

The following SQL table instructs DT to execute an SQL statement to clear the Output table prior to a scan :

 N   Exec   Command 
 1   X   delete from Output 

The following SQL table instructs DT to execute 2 SQL statements to prepare the ContributionsSources table for a URL scan :

 N   Exec   Command 
 1   X   update ContributionsSources set Parser='Contributions' 
 2   X   update ContributionsSources set Source='Contributions - '+zip 

In this case the ContributionsSources table has an extra field, "zip".   Fields can be added to the DT tables if necessary to facilitate SQL manipulations.

Once again, an Exec field allows the user to switch tasks on and off without having to add / remove them from the table.

The Command field must contain a valid SQL statement.


Running DT

DT is run from the command line with 1 or 2 arguments :

dt  <dxd file name>
dt  <DSN>  <task table>

In the first, single-argument format, DT runs under the control of the specified definition (DXD) file.

In the second, 2-argument format, DT runs under database control using :

  1. the database identified by the ODBC Data Source Name (DSN),
  2. the specified task table within that database.

Enter these command lines at the Windows command prompt, or other places where command lines are used, such as :

For example, to run the census table scraper sample at the command prompt, enter :

"C:\Program Files\DTUtilities\DTBuild\DT"  DTUtilities-Census-01  Tasks

DT is installed with DTBuild.


DSN configuration

DT uses ODBC, "Open Database Connectivity", to interact with user-defined databases.   Under ODBC databases are assigned labels, or Data Source Names (DSNs).   The DSNs for the sample databases are automatically configured during installation.

To use another database the user must configure a DSN for the database.   DSN configuration means, essentially, applying a system-wide label to an existing database.   Use the "Configure data sources" shortcut created during installation to start the ODBC Data Source Administrator, then :

Visit dtutilities.com/database.html for more information about databases, DSNs and ODBC.


HTML parsing

Web pages are written using HTML, Hypertext Markup Language.   DT parsers are, therefore, HTML parsers, so configuring DT requires some knowledge of HTML.   For example, parsing an HTML table requires an understanding of the TABLE, TR, TH, and TD tags, among others.

In simple cases HTML tags can be ignored.   For more sophisticated parsing the user should be familiar with the basics of HTML:  many good books are available on the subject.

Most web browsers allow the user to view the HTML source of a web page (View / Source in Microsoft Internet Explorer 8, View / Page Source in Mozilla Firefox 3).   The source is displayed with tags and content in different colors, making it easier to see the structure of the web page.   DT supports a "noparse" option that allows saving the URL content to a local file.

DTBuild's design allows creation of reusable parsing components:  node groups.   For example, the following HTML parsing node groups are installed with DT Census table scraper sample :

HTML-element.dxg  - Generic HTML tag parser
HTML-entity.dxg  - Generic HTML entity parser
TR.dxg - HTML "TR" (table row) parser
TD.dxg - HTML "TD" (table data) parser
Extract-Digits.dxg  - digit ('0' .. '9') extractor

Node groups greatly simplify top-level .DXD definitions.   See DTBuild Help for more information.


DT options

DT supports several options that can be specified on a per-URL basis, in the Source table "options" field :

 Option:   Purpose:
 noparse   Writes the URL's content directly to a file without parsing it ¹  
 trace   Increases the amount of diagnostic output produced
 retries   Sets the maximum number of retries if an HTTP redirect occurs ²
 decode   Parses the URL using the AfxParseURLEx ICU_DECODE option ³ 
 noencode   " AfxParseURLEx ( ICU_NO_ENCODE )
 nometa   " AfxParseURLEx ( ICU_NO_META )
 encodespacesonly   " AfxParseURLEx ( ICU_ENCODE_SPACES_ONLY )
 browsermode   " AfxParseURLEx ( ICU_BROWSER_MODE )

¹ The "noparse" option enables you to write the fetched URL's content directly to a local file, without modification.   The output file path is specified in the sources table parser field.   This option provides the unmodified results of the HTTP Get command.   This can be very useful for parser development.

² The maximum number of HTTP redirects defaults to 3.

³ DT uses a Microsoft Foundation Class (MFC) function called "AfxParseURLEx" to parse the URL prior to fetching the data.   Visit http://msdn.microsoft.com for more information on AfxParseURLEx and its options.

Example

 Scan   Source   URL   Parser   Options 
 X   Sample data   http://dtutilities.com/sample.html   Census-01   retries=0 
 X   Sample data   http://dtutilities.com/sample.html   c:\urldump\sample.html   noparse 
 X   Sample data   http://dtutilities.com/sample.html   Census-01   trace noencode 

The 1st example sets the number of HTTP redirects to zero.

The 2nd example writes the fetched URL's content, unmodified, to file c:\urldump\sample.html.

The 3rd example illustrates the combination of two options, separated by a blank.


Scanning web page sequences with NextURL

DT can scan a sequence of web pages if those web pages contain links to "next" pages.   The DTBuild parser must define a string variable called "NextURL" and set it appropriately.   DT will continue fetching pages as long as "NextURL" contains a valid URL.   The NextURL feature allows an unlimited sequence of web pages to be fetched for a single URL entry in the Sources table.

The parser will typically clear NextURL at first, then set it only if a "next" link is recognized.   This is a powerful feature, but care must be taken to avoid infinite loops!


DT requirements

DT is installed with the DTBuild installation package.


DT samples

DT sample databases are available at the DTBuild website :

The sample installers automatically configure their ODBC DSNs.   The installers run the DSNConfig program, which use the *-DSN.txt files to configure the DSNs.   The sample uninstallers remove the DSNs created at installation.


DT sample :  Census table scraper

This DT sample extracts data from an HTML table at http://factfinder.census.gov.

It uses 4 database tables :

  1. Tasks
  2. Sources
  3. Output
  4. SQL1

Keep in mind that the word "table" is used two ways here :

Two tasks are configured in the Tasks table :

 N   Exec   Action   Parameter 
 1   X   ExecSQL   SQL1 
 2   X   Scan   Sources 

The SQL1 database table contains a single SQL statement :

delete from Output

This clears the Output table prior to the scan.

The Sources database table contains the URL / parser list :

 Scan   Source   URL   Parser 
 X   U.S. Census Bureau   http://factfinder.census.gov/ ...    Census-01.dxd 

The Census-01.DXD parser starts by looking for a "TableLocator" (specified as an entry in a DTBuild string set), in order to skip over page content preceding the HTML table to be scraped.   The entries in the TableLocator string set contain keywords that distinguish the desired table from other tables that may precede it in the web page being parsed :

screen shot: HTML table parser

Sample output

 Name   Population   Area   Grabbed 
 California   33871648   163695.57   10/25/2009 12:03:28 AM 
 Texas   20851820   268580.82   10/25/2009 12:03:28 AM 
 New York   18976457   54556   10/25/2009 12:03:28 AM 
 ...   ...   ...   ... 

Questions, comments?   Contact DTBuild support.