| Overview |
| Requirements |
| Running DT |
| DSN configuration |
| HTML parsing |
| Options |
| Scanning web page sequences with NextURL |
| DT samples |
DT is a Windows command line program that runs DTBuild. There are two command formats :
dt <dxd file name>
dt <data source name> <task table name>
In the first, 1-argument format the specified definition (DXD) file controls DT. The definition must contain a list of input files, emails or URLs. If multiple input lists are specified one will be chosen; see the Run topic for more information. The transformation defined in the DXD file will be run on the chosen input list.
In the second, 2-argument format a task table in a database controls DT. The first argument is the ODBC Data Source Name (DSN) of the database. The second argument is the name of a task table in that database. The remainder of this document describes how to configure and run DT under the control of a database task table.
Under database control DT is driven by 3 interrelated database table types :
The table names are user-defined, but each table type has a specific format. The fields indicated in the table descriptions below must be present, but the user can add fields to the tables as needed.
The 2 command line arguments identify a task table within a database :
dt <DSN> <task table>
A database can contain more than one task table, but a DT run is controlled by a single task table.
For example, this command runs the census table scraper sample :
"C:\Program Files\DTUtilities\DTBuild\DT" DTUtilities-Census-01 Tasks
"DTUtilities-Census-01" is the DSN, "Tasks" is the task table name. See the Running DT topic for more information.
The task table in turn refers to source and SQL tables, as described below.
A single task table controls the process, instructing DT to :
| Task table | |
|---|---|
| Field: | Description: |
| N | The action index (1..N) |
| Exec | Execute this task (yes/no)? |
| Action | The type of task to perform: "Scan" or "ExecSQL" |
| Parameter | Data to be used by the task (a source or SQL table) |
DT scans the task table executing the user-defined actions, which can be :
| Scan | - scan the URLs in the Source table whose name is specified in the parameter field |
|---|---|
| ExecSQL | - execute the statements in the SQL table whose name is specified in the parameter field |
The Scan and ExecSQL tasks can be run in any order. For example, you might want to execute SQL before and after a URL scan.
The following task table :
| N | Exec | Action | Parameter |
|---|---|---|---|
| 1 | X | ExecSQL | PreSQL |
| 2 | X | Scan | WebSources |
| 3 | X | ExecSQL | PostSQL |
The Exec field allows the user to switch tasks on and off without having to add / remove them from the table.
This design is flexible enough to allow multiple passes, where, for example, a list of URLs generated by one step is scanned by the next step. Data from multi-level websites can be extracted using this feature.
The URLs and associated data are specified in "source" database tables containing the following information :
| Source table | |
|---|---|
| Field: | Description: |
| Scan | Scan this URL (yes/no)? |
| Source | User's description of the URL; passed to DTBuild variable "Source" |
| URL | The Uniform Resource Locator |
| Parser | The DTBuild .DXD file to be used to parse the URL's data |
| Options | Option string |
| UserName | The user ID (if any) required to access the URL |
| Password | The password (if any) required to access the URL |
| Notes | User notes, free-format, ignored by DT |
DT scans Source tables from beginning to end. Each URL is parsed per the associated DTBuild parser (.DXD file).
The following Source table instructs DT to fetch and parse data from 3 URLs :
| Scan | Source | URL | Parser |
|---|---|---|---|
| X | Con Ed | http://finance.yahoo.com/q?s=ED | Quotes01 |
| X | Borland | http://finance.yahoo.com/q?s=BORL | Quotes01 |
| X | GMOT | http://finance.yahoo.com/q?s=GMA | Quotes01 |
All three URLs are parsed with the Quotes01.DXD parser. The Options, UserName and Password fields, which are not typically used, are not shown in this example.
The task table can trigger the execution of SQL (database Structured Query Language) statements in tables containing the following information :
| SQL table | |
|---|---|
| Field: | Description: |
| N | An index (1..N) controlling order of statement execution |
| Exec | Execute this SQL statement (yes/no)? |
| Command | An SQL statement (must be valid!) |
SQL tables can be used to prepare data in the Source tables, for example, or to clear an output table prior to a scan.
The following SQL table instructs DT to execute an SQL statement to clear the Output table prior to a scan :
| N | Exec | Command |
|---|---|---|
| 1 | X | delete from Output |
The following SQL table instructs DT to execute 2 SQL statements to prepare the ContributionsSources table for a URL scan :
| N | Exec | Command |
|---|---|---|
| 1 | X | update ContributionsSources set Parser='Contributions' |
| 2 | X | update ContributionsSources set Source='Contributions - '+zip |
In this case the ContributionsSources table has an extra field, "zip". Fields can be added to the DT tables if necessary to facilitate SQL manipulations.
Once again, an Exec field allows the user to switch tasks on and off without having to add / remove them from the table.
The Command field must contain a valid SQL statement.
DT is run from the command line with 1 or 2 arguments :
dt <dxd file name>
dt <DSN> <task table>
In the first, single-argument format, DT runs under the control of the specified definition (DXD) file.
In the second, 2-argument format, DT runs under database control using :
Enter these command lines at the Windows command prompt, or other places where command lines are used, such as :
For example, to run the census table scraper sample at the command prompt, enter :
"C:\Program Files\DTUtilities\DTBuild\DT" DTUtilities-Census-01 Tasks
DT is installed with DTBuild.
DT uses ODBC, "Open Database Connectivity", to interact with user-defined databases. Under ODBC databases are assigned labels, or Data Source Names (DSNs). The DSNs for the sample databases are automatically configured during installation.
To use another database the user must configure a DSN for the database. DSN configuration means, essentially, applying a system-wide label to an existing database. Use the "Configure data sources" shortcut created during installation to start the ODBC Data Source Administrator, then :
Visit dtutilities.com/database.html for more information about databases, DSNs and ODBC.
Web pages are written using HTML, Hypertext Markup Language. DT parsers are, therefore, HTML parsers, so configuring DT requires some knowledge of HTML. For example, parsing an HTML table requires an understanding of the TABLE, TR, TH, and TD tags, among others.
In simple cases HTML tags can be ignored. For more sophisticated parsing the user should be familiar with the basics of HTML: many good books are available on the subject.
Most web browsers allow the user to view the HTML source of a web page (View / Source in Microsoft Internet Explorer 8, View / Page Source in Mozilla Firefox 3). The source is displayed with tags and content in different colors, making it easier to see the structure of the web page. DT supports a "noparse" option that allows saving the URL content to a local file.
DTBuild's design allows creation of reusable parsing components: node groups. For example, the following HTML parsing node groups are installed with DT Census table scraper sample :
| HTML-element.dxg | - Generic HTML tag parser |
| HTML-entity.dxg | - Generic HTML entity parser |
| TR.dxg | - HTML "TR" (table row) parser |
| TD.dxg | - HTML "TD" (table data) parser |
| Extract-Digits.dxg | - digit ('0' .. '9') extractor |
Node groups greatly simplify top-level .DXD definitions. See DTBuild Help for more information.
DT supports several options that can be specified on a per-URL basis, in the Source table "options" field :
| Option: | Purpose: |
|---|---|
| noparse | Writes the URL's content directly to a file without
parsing it ¹ |
| trace | Increases the amount of diagnostic output produced |
| retries | Sets the maximum number of retries if an HTTP redirect occurs ² |
| decode | Parses the URL using the AfxParseURLEx ICU_DECODE option ³ |
| noencode | " AfxParseURLEx ( ICU_NO_ENCODE ) |
| nometa | " AfxParseURLEx ( ICU_NO_META ) |
| encodespacesonly | " AfxParseURLEx ( ICU_ENCODE_SPACES_ONLY ) |
| browsermode | " AfxParseURLEx ( ICU_BROWSER_MODE ) |
¹ The "noparse" option enables you to write the fetched URL's content directly to a local file, without modification. The output file path is specified in the sources table parser field. This option provides the unmodified results of the HTTP Get command. This can be very useful for parser development.
² The maximum number of HTTP redirects defaults to 3.
³ DT uses a Microsoft Foundation Class (MFC) function called "AfxParseURLEx" to parse the URL prior to fetching the data. Visit http://msdn.microsoft.com for more information on AfxParseURLEx and its options.
| Scan | Source | URL | Parser | Options |
|---|---|---|---|---|
| X | Sample data | http://dtutilities.com/sample.html | Census-01 | retries=0 |
| X | Sample data | http://dtutilities.com/sample.html | c:\urldump\sample.html | noparse |
| X | Sample data | http://dtutilities.com/sample.html | Census-01 | trace noencode |
The 1st example sets the number of HTTP redirects to zero.
The 2nd example writes the fetched URL's content, unmodified, to file c:\urldump\sample.html.
The 3rd example illustrates the combination of two options, separated by a blank.
DT can scan a sequence of web pages if those web pages contain links to "next" pages. The DTBuild parser must define a string variable called "NextURL" and set it appropriately. DT will continue fetching pages as long as "NextURL" contains a valid URL. The NextURL feature allows an unlimited sequence of web pages to be fetched for a single URL entry in the Sources table.
The parser will typically clear NextURL at first, then set it only if a "next" link is recognized. This is a powerful feature, but care must be taken to avoid infinite loops!
DT is installed with the DTBuild installation package.
DT sample databases are available at the DTBuild website :
The sample installers automatically configure their ODBC DSNs. The installers run the DSNConfig program, which use the *-DSN.txt files to configure the DSNs. The sample uninstallers remove the DSNs created at installation.
This DT sample extracts data from an HTML table at http://factfinder.census.gov.
It uses 4 database tables :
Keep in mind that the word "table" is used two ways here :
Two tasks are configured in the Tasks table :
| N | Exec | Action | Parameter |
|---|---|---|---|
| 1 | X | ExecSQL | SQL1 |
| 2 | X | Scan | Sources |
The SQL1 database table contains a single SQL statement :
delete from Output
This clears the Output table prior to the scan.
The Sources database table contains the URL / parser list :
| Scan | Source | URL | Parser |
|---|---|---|---|
| X | U.S. Census Bureau | http://factfinder.census.gov/ ... | Census-01.dxd |
The Census-01.DXD parser starts by looking for a "TableLocator" (specified as an entry in a DTBuild string set), in order to skip over page content preceding the HTML table to be scraped. The entries in the TableLocator string set contain keywords that distinguish the desired table from other tables that may precede it in the web page being parsed :
| Name | Population | Area | Grabbed |
|---|---|---|---|
| California | 33871648 | 163695.57 | 10/25/2009 12:03:28 AM |
| Texas | 20851820 | 268580.82 | 10/25/2009 12:03:28 AM |
| New York | 18976457 | 54556 | 10/25/2009 12:03:28 AM |
| ... | ... | ... | ... |
Questions, comments? Contact DTBuild support.