| Home | |
| Download | |
| Register | |
| Tutorial | |
| Help | |
| Site index | |
| Contact info |
| What's the idea? |
| What's parsing? |
| State machines |
| Regular expressions |
DTBuild's design is based on a patented invention: Configurable Pattern Recognition and Filtering Tool.
The fundamental building block is called a subpattern. It consists of :
Examples of sets :
The minimum can be zero or more occurrences; the maximum is greater than or equal to the minimum.
Examples of subpatterns :
These subpatterns can be linked together in any order. One of the subpatterns is designated as the "start node". As the input is scanned the machine moves from one subpattern to the next, deciding at certain points that subpatterns have been recognized in the input. When this recognition occurs actions can be performed.
Examples of actions :
It is possible using this scheme to perform a wide variety of useful data transformation tasks.
DTBuild's design is based on the idea that many data transformation tasks - searches, conversions, extractions, parsing, ... , involve the same fundamental repetitive process :
DTBuild is very general. It knows about sets and patterns, states and transitions, and views the input as a stream of values (e.g. bytes or characters). It has no internal knowledge of XML, HTML, RTF, or even text files; it can be configured to work with all of them. The details of the transformation task are specified in the configuration (definition).
Another way to look at it: the task-specific logic is contained in the configuration instead of the program itself. The user has several options :
Of course, there are limits to what DTBuild can do. Developers can extend DTBuild's capability with user-defined functions in a custom DLL. See DTBuild help and the development tools topic for more information.
Parsing a stream of data means breaking it down into component parts according to a set of rules.
Parsing programs typically check each character in a data stream and group the characters into units known as tokens. What constitutes a token can differ from one program to the next, or from one set of grammatical rules to the next. With DTBuild the tokens are entirely user-defined.
In a web page, for example, the tokens would typically be HTML tags (<TABLE>, for example), and the data between the tags.
In an email the tokens are typically labels ("Subject:", for example) and their associated data.
Many programming systems use regular expressions to parse data. Here's a regular expression for parsing an email address :
'^[a-zA-Z0-9_\.\-]+@[a-zA-Z0-9\-]+\.[a-zA-Z0-9\-\.]+$'
This means - in a nutshell - characters with an ampersand in the middle. This part :
a-zA-Z0-9
means alphanumeric characters: A-Z upper- and lower-case, plus digits.
With DTBuild you would define the alphanumeric set one time, calling it "alphanum" or something similar.
DTBuild has no special characters to worry about. With regular expressions some characters have special meanings, so if you need to handle these special characters in your input you have to "escape" them with the backslash ('\') character. The period is a special character; it can also appear in email addresses - that's why you see it, escaped :
\.
three times in the above regular expression. DTBuild's design avoids the special-character issue entirely.
DTBuild provides an alternative to regular expressions. As such, support for regular expressions is not planned for any future release.
This is a little technical, but it is helpful to understand the general idea: A state machine can consider the overall structure of an "input stream", for example a file or an email message :
A state machine -
It knows where it's been.
Consider email messages. In general, an email is composed of a header followed by a body. So, the first state entered when scanning an email can be the "header" state, followed by the "body" state. Within the header state there can be a state for each component, i.e. the "subject" state, the "from" state, the "date" state, etcetera.
To illustrate the importance of state-awareness, consider a message format with "From" and "To" addresses :
From:
Address:
…
To:
Address:
Simply looking for "Address:" isn't enough; you have to know which address you're dealing with, i.e. whether it's "From" or "To". DTBuild's state-aware design can handle this type of format easily. Some parsing utilities can't handle this situation or have to be specially "rigged" to do so.
DTBuild allows configuration of a state machine. All computer programs are themselves state machines, but only a few parsing utilities (Yacc, for example) allow programming of overall state machine behavior. DTBuild is unique in that it provides (requires!) a pictorial representation of the state machine that does the job.
| Home | Download | Register | Tutorial | Help | Site index | Contact info |