A SYSTEM AND METHOD FOR PARSING DATA Inventor: John Fairweather
BACKGROUND OF THE INVENTION
The analysis and parsing of textual information is a well-developed field of study, falling primarily within what is commonly referred to as 'compiler theory'. At its most basic, a compiler requires three components, a lexical analyzer which breaks the text stream up into known tokens, a parser which interprets streams of tokens according to a language definition specified via a meta-language such as Backus-Naur Form (BNF), and a code generator/interpreter. The creation of compilers is conventionally a lengthy and off-line process, although certain" industry standard tools exist to facilitate this process such as LEX and YACC from the Unix world. There are a large number of textbooks available on the theory of predictive parsers and any person skilled in this art would have basic familiarity with this body of theory.
Parsers come in two basic forms, "top-down" and "bottom-up". Top-down parsers build the parse tree from the top (root) to the bottom (leaves), bottom-up parsers build the tree from the leaves to the root. For our purposes, we will consider only the top-down parsing strategy known as a predictive parser since this most easily lends itself to a table driven (rather than code driven) approach and is thus the natural choice for any attempt to create a configurable and adaptive parser. In general, predictive parsers can handle a set of possible grammars referred to as LL(1) which is a subset of those potentially handled by LR parsers (LL(1) stands for 'Left-to-right, using Leftmost derivations, using at most 1 token look-ahead'). Another reason that a top-down algorithm is preferred is the ease of specifying these parsers directly in BNF form, which makes them easy to understand by most programmers. Compiler generators such as LEX and YACC generally use a far more complex specification methods including generation of C code which must then be compiled, and thus is not adaptive, or dynamic. For this reason, bottom-up table driven techniques such as LR parsing (as used by YACC) are not considered suitable.
What is needed is a process that can rapidly (i.e., within seconds) generate a complete compiler from scratch and then apply that compiler in an adaptive manner to new input, the ultimate goal being the creation of an adaptive compiler, i.e., one that can alter itself in response to new. input patterns in order to 'learn' to parse new patterns appearing in the input
and to perform useful work as a result without the need to add any new compiled code. This adaptive behavior is further described Appendix 1 with respect to a lexical analyzer (referred to in the claims as the "claimed lexical analyzer"). The present invention provides a method for achieving the same rapid, flexible, and extensible generation in the corresponding parser.
SUMMARY OF INVENTION
The present invention dislcoses a parser that is totally customizable via the BNF language specifications as well as registered functions as described below. The are two principal routines: (a) PS_MakeDB(), which is a predictive parser generator algorithm, and (b) PS_Parse(), which is a generic predictive parser that operates on the tables produced by PS_MakeDBQ. The parser generator PS_MakeDB() operates on a description of language grammar, and constructs predictive parser tables that are passed to PS_Parse() in order to parse the grammar correctly. There are many algorithms that may be used by PS_MakeDB() to generate the predictive parser tables, as described in many books on compiler theory. It consists essentially of computing the FIRST and FOLLOW sets of all grammar symbols (defined below) and then using these to create a predictive parser table. In order to perform useful action in response to inputs, this invention extends the BNF language to allow the specification of reverse-polish plug-in operation specifiers by enclosing such extended symbols between '<' and '>' delimiters. A registration API is provided that allows arbitrary plug-in functions to be registered with the parser and subsequently invoked as appropriate in response to a reverse-polish operator appearing on the top of the parser stack. The basic components of a complete parser/interpreter in this methodology are as follows:
• The routine PS_Parse() itself (described below)
. • The language BNF and LEX specifications.
• A plug-in 'resolver 400' function, called by PS_Parse() to resolve new input (described below)
• One or more numbered plug-in functions used to interpret the embedded reverse- polish operators.
The 'langLex' parameter to PS_Parse() allows you to pass in the lexical analyzer database (created using LX_MakeDB()) to be used to recognize the target language. There
are a number of restrictions on the token numbers that can be returned by this lexical analyzer when used in conjunction with the parser. These are as follows:
1) The parser generator has it's own internal lexical analyzer which reserves token numbers 59..63 for recognizing certain BNF symbols (described below) therefore these token numbers cannot be used by the target language recognizer. Token numbers from 1..63 are reserved by the lexical analyzer to represent 'accepting' states in the 'catRange' token recognizer table, these token numbers are therefore not normally used by a lexical analyzer OneCat' token recognizer. What this means then is that instead of having capacity for 63 variable content tokens (e.g., names, numbers, symbols etc) in your target language, you are restricted to a maximum of 58 when using the parser.
2) If there are multiple names for a give symbol, then the multiplicity should be restricted to the lexical analyzer description, only one of the alternatives should be used in the parser tables.
3) In order to construct predictive parser tables, it is necessary to build up a 2- dimensional array where one axis is the target language token number and the other axis is the non-terminal symbols of the BNF grammar. The parser-generator is limited to grammars having no more than 256 non-terminal grammar symbols, however in order to avoid requiring MASSIVE amounts of memory and time to compute the parsing table, the number of terminal symbols (i.e., those recognized by the lexical analyzer passed in 'langLex') should be limited to 256 also. This means that the lexical analyzer should never return any token number that is greater than 'kMaxTerminalSym'. For example, token numbers 1..59 are available for use as accepting states for the 'catRange' recognizer while tokens 64..255 are available for use with the OneCat' recognizer.
The invention also provides a solution for applications in which a language has token numbers that use the full 32-bits provided by LEX. Immediately after calling the 'langLex' lexical analyzer to fetch the next token in the input stream, PS_ParseQ calls the registered 'resolver 400' function with a 'no action' parameter, (normally no action is exactly what is required) but this also provides an opportunity to the plug-in code to alter the token number (and token size etc.) to a value that is within the permitted range.
There are also many other aspects of the invention that allow the parser to accept or process languages that are considerably more complex than LL(1). For example, suppose a recognizer is programmed to recognize the names of people (for which there are far more than 256 possibilities) so when a 'no-action' call is initiated, the function PS_SetCurrToken() could be used to alter the token number to 58 say. Then in your BNF grammar, you specify a token number of 58 (e.g., <58:Person Name>) wherever you expect to process a name. The token string will be available to the plug-in and resolver 400 functions on subsequent calls and could easily reconstitute the original token number and the plug-in code could be programmed to call 'langLex' using PS_LangLex(). Other applications and improvements are also disclosed and claimed in this application as described in further detail below.
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 provides a sample BNF specification;
Figure 2 is a block diagram illustrating a set of operations as performed by the parser of the present invention;
Figure 3 provides a sample code fragment for a predefined plug-in that can work in conjunction with the parser of the present invention; and
Figure 4 provides sample code for a resolver of the present invention.
Appendix A provides code for a sample Application Programming Interface (API) for the parser of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
As described above, the parser of this invention utilizes the lexical analyzer described in Appendix 1, and the reader may refer to this incorporated patent' application for a more detailed explanation of some of the terms used herein. For illustration purposes, many of the processes described in this application are accompnaied by samples of the computer code that could be used to perform such functions. It would be clear to one skilled in the art that these code samples are for illustration purposes only and should not be interepreted as a limitation on the claimed inventions.
The present invention dislcoses a parser that is totally customizable via the BNF language specifications as well as registered functions as described below. The are two principal routines: (a) PS_MakeDB(), which is a predictive parser generator algorithm, and (b) PS_Parse(), which is a generic predictive parser that operates on the tables produced"by PS_MakeDB(). The parser generator PS_MakeDB() operates on a description of language grammar, and constructs predictive parser tables that are passed to PS_Parse() in order to parse the grammar correctly. PS_MakeDB() has the following function prototype:
ET_ParseHdl PS_MakeDB ( // Make a predictive parser for
PS_Parse() charPtr bnf, // I:C string specifying grammar's BNF
ET_LexHdl langLex, // LTarget language lex (from
LX_MakeDB) int32 options, // 1: Various configuration options int32 parseStackSize,// LMax. depth of parser stack, 0=default int32 evalStackSize // LMax. depth of evaluation stack, 0=default
) // R:handle to created DB,
The 'bnf parameter to PS_MakeDB() contains a series of lines that specify the BNF for the grammar in the form:
nonterminal : := production_l <or> production_2 <or> .
Where production^ and production_2 consist of any sequence of Terminal (described in lexical analyzer passed in to PS_MakeDB), or Non-Terminal (langLex) symbols provided that such symbols are greater than or equal to 64. Productions may continue onto the next line if required but any time a non-blank character is encountered in the first position of the line, it is assumed to be the start of a new production list. The grammar supplied must be unambiguous and LL(1).
The parser generator uses the symbols ::=, <or>, and <null> to represent BNF productions. The symbols <opnd>, <bkup>, and the variable ('catRange') symbols <@nn:mm[:hint text]> and <nn:arbitrary text> also have special meaning and are recognized by the built in parser-generator lexical analyzer. The parser generator will interpret any sequence of upper or lower case letters (a..z) or numbers (0..9) or the underscore character '_', that begins with a letter or underscore, and which is not recognized by, or which is assigned a token number in the range 1-63 by, the lexical analyzer passed in 'langLex', as a non-terminal grammar symbol (e.g., program, expression, if_statement etc.), these symbols are added to the parser generators grammar symbol list (maximum of 256 symbols) and define the set of non-terminals that make up the grammar. There is no need to specify this set, it is deduced from the BNF supplied. One thing that is very important however, is that the first such symbol encountered in the BNF becomes the root non-terminal of the grammar (e.g., program). This symbol is given special meaning by the parser and thus it must appear on the left hand side of the first production specified in the BNF. The <endf> symbol is used to indicate where the expected end of the input string will occur and its specification cannot be omitted from the BNF. Normally, as in the example below <endf> occurs at the end of the root non-terminal production.
Referring now to Figure 1, a sample BNF specification is provided. This BNF gives a relatively complete description of the C language expression syntax together with enforcement of all operator precedence specified by ANSI and is sufficient to create a program to recognize and interpret C expressions. As Figure 1 demonstrates, the precedence order may be specified simply by choosing the order in which one production leads to
another with the lowest precedence grammar constructs/operators being refined through a series of productions into the higher precedence ones. Note also that many productions lead directly to themselves (e.g., more_statements ::= <null> <or> statement more_statements); this is the mechanism used to represent the fact that a list of similar constructs is permitted at this point.
The syntax for any computer language can be described either as syntax. diagrams or as a series of grammar productions similar to that above (ignoring the weird '@' BNF symbols for now). Using this syntax, the code illustrated in Figure 1 could easily be modified to parse any progams in any number of different computer languages simply by entering the grammar productions as they appear in the language's specification." The way of specifying a grammar as illustrated in Figure 1 is a custom variant of the Backus-Naur Form (or BNF). It is the oldest and easiest to understand means of describing a computer language. The symbols enclosed between '<' '>' pairs plus the '::=' symbol are referred to as "meta-symbols". These are symbols that are not part of the language but are part of the language specification. A production of the form (nonterminal ::= production^ <or> production_2) means that there are two alternative constructs that 'non-terminal' can be comprised or, they are 'production_l ' or 'production_2'.
The grammar for many programming languages may contain hundreds of these productions, for example, the definition of Algol 60 contains 117. An LL(1) parser must be able to tell at any given time what production out of a series of productions is the right one simply by looking at the current token in the input stream and the non-terminal that it currently has on the top of it's parsing stack. This means, effectively, that the sets of all possible first tokens for each production appearing on the right hand side of any grammar production must not overlap. The parser must be able to look at the token in the input stream and tell which production on the right hand side is the 'right one'. The set of all tokens that might start any given non-terminal symbol in the grammar is known as the FIRST set of that non-terminal. When designing a language to be processed by this package, it is important to ensure that these FIRST sets are not defined consistently. In order to understand how to write productions for an LL(1) parser, it is important to understand recursion in a grammar and the difference between left and right recursion in particular.
Recursion is usually used in grammars to express a list of things separated by some separator symbol (e.g. comma). This can be expressed either as "<A> ::= <A> , <B>" or
"<A> ::= <B> , <A>". The first form is left recursive the second form is known as right recursive. The production "more_statements ::= <null> <or> statement more_statements" above is an example of a right recursive production. Left recursive statements are not permitted because of the risk of looping during parsing. For example, if the parser tries to use a production of the form '<A> ::= <A> anything' then it will fall into an infinite loop trying to expand <A>. This is known as left recursion. Left recursion may be more subtle, as in the pair of productions '<S> ::= <X> a or> b' and '<X> ::= <S> c <or> d'. Here the recursion is indirect; that is the parser expands '<S>' into '<X> a', then it subsequently expands '<X>' into '<S> c' which gets it back to trying to expand '<S>', thereby creating an infinite loop. This is known as indirect left recursion. All left recursion of this type must be eliminated from grammar before being processed by the parser. A simple method for accomplishing this proceeds as follows: replace all productions of the form ,'<A> ::= <A> anything' (or indirect equivalents) by a set of productions of the form "<A> ::= tl more_tl <or> ... <or> tn more_tn" where tl..tn are the language tokens (or non-terminal grammar symbols) that start the various different forms of '<A>'.
A second problem with top down parsers, in general, is that the. order of the alternative productions is important in determining if the parser will accept the complete language or not. On way to avoid this problem is to require that the FIRST sets of all productions on the right hand side be non-overlapping. Thus, in conventional BNF, it is permissible to write:
expression ::= element <or> element + expression <or> element * expression
To meet the requirements of PS_MakeDB() and of an LL(1) parser, this BNF statement may be reformulated into a pair of statements viz:
expression: := element rest_of_expression rest_of_expression ::= <null> <or> + expression <or> * expression
As can be seen, the 'element' token has been factored out of the two alternatives (a process known as left-factoring) in order to avoid the possibility of FIRST sets that have been defined more than once. In addition, this process has added a new symbol to the BNF meta- language, the <null> symbol. A <null> symbol is used to indicate to the parser generator that
a particular grammar non-terminal is nullable, that is, it may not in fact be present at all in certain input streams. There are a large number of examples of the use of this technique in the BNF grammar illustrated in Figure 1 such as statement 100.
The issues above discuss the manner in which LL(1) grammars may be created and used. LL(1) grammars, however, can be somewhat restrictive and the parser of the present invention is capable of accepting a much larger set by the use of deliberate ambiguity. Consider the grammar:
operand ::= expression <or> ( address_register )
This might commonly occur when specifying assembly language syntax. The problem is that this is not LL(1) since expression may itself start with a '(' token, or it may not, thus when processing operand, the parser may under certain circumstances need to look not at the first, but at the second token in the input stream to determine which alternative to take. Such a parser would be an LL(2) parser. The problem cannot be solved by factoring out the '(' token as in the expression example above because expressions do not have to start with a '('. Thus without extending the language beyond LL(1) the nornal parser be unable to handle this situation. Consider however the modified grammar fragment:
operand ::= .... <or> ( expr_or_indir <or> expression expr_or_indir : := Aregister ) <or> expression )
Here we have a production for operand which is deliberately ambiguous because it - has a multiply defined first set since '(' is in FIRST of both of the last two alternatives. The modified fragment arranges the order of the alternatives such that the parser will take the "( expr_or_indir" production first and should it fail to find an address register following the initial '(' token, the parser will then take the second production which correctly processes "expression )" since expression itself need not begin with a '(' token. If this case were pmeritted, the parser would have the equivalent of a two token look-ahead hence the language it can accept is now LL(2).
Alternatively, an options parameter 'klgnoreAmbiguities' could be passed to PS_MakeDB() to cause it to accept grammars containing such FIRST set ambiguities. On problem with this approach, however, is that it can no longer verify the correctness of the
grammar meaning that the user must ensure that the first production can always be reduced to the second production when such a grammatical trick is used. As such, such a paramter should only be used when the grammar is well-understood.-
Grammars can get considerably nastier than LL(2). Consider the problem of parsing the complete set of 68K assembly language addressing modes, or more particularly the absolμte, indirect, pre-decrement and post-increment addressing modes. The absolute and indirect syntax was presented above, however the pre-decrement addressing mode adds the form "-( Aregister )", while the post-increment adds the form "( Aregister )+" . An LL(3) parser would be needed to handle the predecrement mode since the parser cannot positively identify the predecrement mode until it has consumed both the leading '-' and '(' tokens in the input stream. An LL(4) parser is necessary to recognize the postincrement form. One option is to just left-factor out the "( Aregister )" for the postincrement form. This approach would work if the only requirement was recognition of a valid assembly syntax. To the. extent that the parser is being used to perform some useful function, however, this approach will not work. Instead, this can be accomplished by inserting a reverse polish plug-in operator. The polish plug-in operator calls for the form <@n:m[:hint text]> into the grammar. Whenever the parser is exposed to such an operator on the top of the parsing stack, it calls it in order to accomplish some sort of semantic action or processing. Assuming a different plug-in is called in order to handle each of the different 68K addressing modes, it is important to know what addressing mode is presented in order to ensure that the proper plug-in is called. In order to do this, the present invention extends the parser language set to be LL(n) where 'n' could be quite large.
The parser of the present invention extend the parser language in this fashion by providing explicit control of limited parser back-up capabilities. One way to provide these capabilities is by adding the <bkup> meta-symbol. Backing up a parser is complex since the parsing stack must be repaired and the lexical analyzer backed-up to an earlier point in the , token stream in order to try an alternative production. Nonetheless, the PS_Parse() parser is capable of limited backup within a single input line by use of the <bkup> flag. Consider the modified grammar fragment:
operand ::= ... <or> ( Aregister <bkup> areg_indirect <or> abs_or_displ <or> ... abs_or_displ ::= - ( ARegister <bkup> ) <@1:1> <or> expression <@1:2>
areg ndirect ::= ) opt_postinc opt_postinc ::= <@1 :3> <or> + <@1 :4>
A limited backup is provided through the following methodology. Let us assume that <@1 : 1> is the handler for the predecrement mode, <@1 :2> for the absolute mode, <@1 :3> for the indirect mode, and <@1 :4> for the postincrement mode. When the parser encounters a '(' token it will push on the "( Aregister <bkup> areg_indirect" production. Whenever the parser notices the presence of the <bkup> symbol in the production being pushed, however, it saves it's own state as well as that of the input lexical analyzer. Parsing continues and the '(' is accepted. Now lets assume instead that the' input was actually an expression so when the parser tries' to match the 'ARegister' terminal that is now on the top of it's parsing stack, it fails. Without the backup flag, this is considered a syntax error and the parser aborts. Because the parser has a saved state, however, the parser restores the backup of the parser and lexical analyzer state to that which existed at the time it first encountered the '(' symbol. This time around, the parser causes the production that immediately follows the one containing the <bkup> flag to be selected in preference to the original. Since the lexical analyzer has also been backed up, the first token processed is once again '(' and parsing proceeds normally through "abs_or_displ" to "expression" and finally to invokation of plug- in <@1 :2> as appropriate for the absolute mode.
, Note that a similar but slightly different sequence is caused by the <bkup> flag in the first production for "abs_or_displ" and that in all cases, the plug-in that is appropriate to the addressing mode encountered will be invoked and no other. Thus, by using explicit ambiguity plus controlled parser backup, the present, invention provides a parser capable of recognizing languages frorn a set of grammars that are considerably larger than those normally associated with predictive parsing techniques. Indeed the set is sufficiently large that it can probably handle practically any computer programming language. By judicious use of the plug-in and resolver 400 architectures described below, this language set can be further extended to include grammars that are not context-free (e.g., English,) and that cannot be handled by conventional predictive parsers.
In order to build grammars for this parser, it is also important to understand is. the concept of a FOLLOW set. For any non-terminal grammar symbol X, FOLLOW(X) is the set of terminal symbols that can appear immediately to the right of X in some sentential form.
In other words, it is the set of things that may come immediately after that grammar symbol. To build a predictive parser table, PS_MakeDB() must compute not only the FIRST set of all non-terminals (which determines what to PUSH onto, the parsing stack), but also the FOLLOW sets (which determine when to POP the parsing stack and move to a higher level production). If the FOLLOW sets' are not correct, the parser will never pop its stack and eventually willf ail. For this reason; unlike for FIRST sets, ambiguity in the FOLLOW sets is not permitted. What this means is that for any situation in a grammar, the parser must be able to tell when it is done with a production by looking at the next token in the input stream (i.e., the first token of the next production). PS_MakeDB() will reject any grammar containing ambiguous FOLLOW sets.
Before illustrating how the" parser of the present invention can be used to accomplish specific tasks, it is important understand how PS_P.arse() 205 actually accomplishes the parsing operation. Referring now to Figure 2,.the parsing function of the present invention is shown. PS^ParseO 205 maintains two stacks, the first is called the parsing stack 210 and contains encoded versions of the grammar productions specified in the BNF. The second stack is called the evaluation stack 215. Every time the parser accepts/consumes a token in the input stream in the range 1..59, it pushes a record onto this evaluation stack 215. Records on this stack 215 can have values that are either integer, real, pointer or symbolic. When the record is first pushed onto the- stack 215, the value is always 'symbolic' since the parser itself does not know how to interpret symbols returned by the lexical analyzer 250 that lie in this range. A symbolic table entry 220 contains the token number recognized by the 'langLex' lexical analyzer 250, together with the token string. In the language defined in Figure 1, the token number for identifier is 1 (i.e. line 110) while that for a decimal integer is 3 (i.e., line 115), thus if the parser 205 were to encounter the token stream "A + 10", it would add two symbol records to the evaluation stack 215. The first would have token number 1 and token string "A" and, the- second would have token number 3 and token string "10". At the time the parser 205 processes an additive expression such as "A + 10", it's parsing (not evaluation) stack 210 would appear as "mult_expr + mult_expr <@0:15>" where the symbol on the left is at the top of the parsing stack 210. As the parser 205 encounters the 'A' in the string "A + 10", it resolves mult_expression until it eventually accepts the 'A' token, pops it off the parsing stack 210, and pushes a record onto the evaluation stack 215. So now the parsing stack 210 looks like "+ mult_expr <@0:15>" and, the evaluation stack 215 contains just one element "[token=l,String='A']". The parser 205 then matches the '+' operator on the stack
with the one in the input and pops the parsing stack 210 to obtain "mult_expr <@0:15>". Parsing continues with the input token now pointing at the 10 until it too is accepted. This process yields a parsing stack 210 of "<@0:15>" and an evaluation stack 215 of "[token=3,String- 10'][token=l,String='A']" where the left hand record is considered to be the top of the stack.
At this point, the parser 205 recognizes that it has exposed a reverse-polish plug-in operator on the top of its parsing stack 210 and pops it, and then calls the appropriate plug-in, which, in this case, is the built in add operation provided by PS_Evaluate() 260, a predefined plug-in called plug-in zero 260. When the parser 205 calls plug-in zero 260, the parser 205 passes the value 15 to the plug-in 260. In this specific case, 15 means add the top two elements of the parsing stack, pop the stack by one, and put the result into the new top of stack. This behavior is exactly analogous to that performed by any reverse polish calculator. This means that the top of the evaluation stack 215 now contains the value A+10 and the parser 205 has actually been used to interpret and execute a fragment of C code. Since there is provision for up to 63 application defined plug-in functions, this mechanism can be used to perform any arbitrary processing as the language is parsed. Since the stack 215 is processed in reverse polish manner, grammar constructs may be nested to arbitrary depth without causing confusion since the parser 205 will already have collapsed any embedded expressions passed to a higher construct. Hence, whenever a plug-in is called, the evaluation stack 215 will contain the operands to that plug-in in the expected positions.
To illustrate how a plug-in might look, Figure 3 provides a sample code fragment from a predefined plug-in that handles the '+' operator (TOF_STACK is defined as 0, NXT_STACK as 1). As Figure 3 illustrates, this plug-in first evaluates 305 the values of the top two elements of the stack by calling PS_EvalIdent(). This function invokes the registered 'resolver 400' function in order to convert a symbolic evaluation stack record to a numeric value (see below for description of resolver 400). Next the plug-in must determine 310 the types of the two evaluation stack elements (are they real or integer?). This information is used in a case statement to ensure that C performs the necessary type conversions on the values before they are used in a computation. After selecting the correct case block for the types of the two operands, the function calls PS_SetiValue() or PS_SetfValue() 315 as appropriate to set the numeric value of the NXT_STACK element of the evaluation stack 215 to the result of adding the two top stack elements. Finally, at the end of the routine, the
evaluation stack 215 is popped 220 to move the new top of the stack to what was the NXT STACK element. This is all it takes to -write a reverse polish plug-in operator. This aspect of the invention permits a virtually unlimited number of support routines that could be developed to allow plug-ins to manipulate the evaluation stack 215 in this manner.
Another problem that has been addressed with the plug-in architecture of the present invention is the problem of having the plug-in function determine the number of parameters that were passed to it; for instance, a plug-in would need to know the number of parameters in order to process the C printf() function (which takes a variable number of arguments). If a grammar does not force the number of arguments (as in the example BNF above for the production "<opnd> ( parameterjist ) <@1 : 1>", then a <opnd> meta-symbol can be added at the point where the operand list begins. The parser 205 uses this symbol to determine how many operands were passed to a plug-in in response to a call requesting this information. Other than this purpose, the <opnd> meta-symbol is ignored during parsing. The <opnd> meta-symbol should always start the right hand side (RHS) of a production in order to ensure correct operand counting. For example, the production:
primary ::= <9:Function> <opnd> ( parameterjist ) <@1 : 1>
Will result in an erroneous operand count at run time, while the production pair below will not:
primary : := <9:Function> rstof_fh_call <@1 : 1> restof_fn_call : := <opnd> ( parameter_list )
The last issue is how to actually get the value of symbols into the parser 205. This is what the symbols in the BNF of the form "<n:text string>" are for. The numeric value of 'n' must lie between 1 and 59 and it refers to the terminal symbol returned by the lexical analyzer 250 passed in via 'langLex' to PS_MakeDB(). It is assumed that all symbols in the range 1..59 represent 'variable tokens' in the target language. That is, tokens whose exact content may vary (normally recognized by a LEX catRange table) in such a way that the string of characters within the token carry additional meaning that allows a 'value' to be assigned to that token. Examples of such variable tokens are identifiers, integers, real numbers etc. A routine known as a 'resolver 400' will be called whenever the- value of one of
these tokens is required or as each token is first recognized. In the BNF illustrated in Figure 1, the lexical analyzer 250 supplied returns token numbers 3,7,8,9,10 or 11 for various types of C integer numeric input; 4,5, and 6 for various C real number formats; 1 for a C identifier (i.e., non-reserved word); and 2 for a character constant.
Referring now to Figure 4, a simple resolver 400 which converts these tokens into the numeric values required by the parser 205 (assuming that identifiers are limited to single character values from A..Z or a..z) is shown. As Figure 3 illustrates, when called to evaluate a symbol, the resolver 400 determines which type of symbol is involved by the lexical analyzer token returned. It then calls whatever routine is appropriate to convert the contents of the token string to a numeric value. In the example above, this is trivial because the lexical analyzer 250 has been arranged to recognize C language constructs. Hence we can call the C I/O library routines to make the conversion. Once the value has been obtained, the resolver 400 calls the applicable routine and the value is assigned to the designated evaluation stack 215 entry. The resolver 400 is also called whenever a plug-in wishes to assign a value to a symbolic evaluation stack 215 entry by running the 'kResolver Assign' case block code. In this case, the value is passed in via the function parameters and the resolver 400 uses the token string in the target evaluation stack 215 entry to determine how and where to store the value.
The final purpose of the resolver function 400 is to examine and possibly edit the incoming token stream in order to effectively provide unlimited grammar complexity. For example, consider the problem of a generalized query language that uses the parser. It must define a separate sub-language for each different container type that may be encountered in a query. In such a case, a resolver function 400could be provided that recognizes the beginning of such a sub-language sequence (for example a SQL statement) and modifies the token returned to consume the entire sequence. The parser 205 itself would then not have to know the syntax of SQL but would simply pass the entire SQL statement to the selected plug-in as the token string for the symbol returned by the recognizer. By using this approach, an application using PS_Parse() is capable of processing virtually any grammar can be built.
The basic Application Programming Interface (API) to the parser 205 of this invention is given below. The discussion that follows describes the basic purpose of these various API calls. Sample code for many of these functions is provided in Appendix A..
PS_SetParserTag05 PS_GetParserTag(). These functions get and permit modification of a number of numeric tag values associated with a parser 205. These values are not used by internal parser 205 code and are available for custom purposes. This is often essential when building custom parsing applications upon this API.
PS_PopO? PS_Push(). The functions pop or push the parser 205 evaluation stack 215 and are generally called by plug-ins.
PS_PushParserState(), PS_PopParserState(). Push/Pop the entire internal parser 205 state. This capability can be used to implement loops, procedure calls or other similar interpreted language constructs. These functions may be called within a parser plug-in in order to cause a non-local transfer of the parser state. The entire parser state, including as a minimum the evaluation stack 215, parser stack 210, and input line buffer must be saved/restored.
PS_ParseStackElem(). This function returns the current value of the specified parsing stack 210 element (usually the top of the stack). This stack should not be confused with the evaluation stack 215 to which most other stack access functions in this API refer. As described above, the parser stack 210 is used internally by the parser 205 for predictive parsing purposes. Values below 64 are used for internal purposes, and to recognize complex tokens such as identifiers or numbers, values above 64 tend to be either terminal symbols in the language being parsed, or non-terminals that are part of the grammar syntax definition (>= 32256). Plug-ins have no direct control of the parsing stack 210, however they may accomplish certain language tricks by knowing the current top of stack and altering the input stream perceived by the parser 205 as desired.
. PS_PopTopOfParseStackO,PS_PushTopOfParseStackO.
PS_PopTopOfParseStack() pops and discards the top of the parsing stack 210 (see PS TopOfParseStack). This is not needed under normal circumstances, however this technique can be used to discard unwanted terminal symbols off the stack 210 in cases where the language allows these to be optional under certain circumstances too complex to describe by syntax.
PS_WillPopParseStack(). In certain circumstances, it may be necessary for a parser recognizer function to determine if the current token will cause the existing parser stack 210 to be popped, that is "is the token in the FOLLOW set of the current top of the parse?" This
information can be used to terminate specialized modes where the recognizer loops through a set of input tokens returning -3, which causes the parser 205 to bulk consume input. A parameter is also provided that allows the caller to determine where in the parsing stack 210 the search can begin, normally this would be the top of the stack i.e., parameter = 0.
PS_IsLegalToken(). This function can be used to determine if a specific terminal token is a legal starting point for a production from the specified non-terminal symbol. Among other things, this function may be used within resolver 400 functions to determine if a specific token number will cause a parsing error if returned given the current state of the parsing stack. This ability allows resolver 400 functions to adjust the tokens they return based on what the parse state is.
PS_GetProduction(). This function obtains the parser production that would replace the specified non-terminal on the stack 210,215 if the specified terminal were encountered in • the input. This information can be used to examine future parser 205 behavior given the current parser 205 state and input. The [0] element of each element of the production returned contains the terminal or non-terminal symbol concerned and can be examined using routines like PS_IsPostFixOperator().
PS_IsPostFixOperator() determines if the specified parse stack element corresponds to the postfix operator specified.
PS_MakeDB0. This function creates a complete predictive parsing database for use with PS_Parse(). If successful, returns a handle to the created DB, otherwise returns zero. The algorithm utilized by this function to construct a predictive parser 205 table can be found in any good reference on compiler theory. The parser 205 utilizes a supplied lexical analyzer as described in Appendix 1. When no longer required, the parser 205 can be disposed using ps αiiDBO.
PS_DisgardTokenO- This function can be called from a resolver 400 or plug-in to cause the current token to be discarded. In the case of a resolver 400, the normal method to achieve this effect is to return -3 as the resolver 400 result, however, calling this function is an alternative. In the case of a plug-in, a call to this function will cause an immediate call to the resolver 400 in order to acquire a new token.
PS_RegisterParser()j PS_DeRegisterParser(), PSJResolveParserO?
PS_CloneDB0. These routines are all associated with maintaining a cache of recently constructed parsers so that subsequent invocations of parsers for identical languages can be met instantaneously. The details of this cache are not pertinent to this invention.
PSJLoadBNFO, PSJLoadBlockO, PS_ListLanguages(). These routines are all associated with obtaining the BNF specification for a parser 205 from a text file containing a number of such specifications. The details of this process are not pertinent to this invention.
PS_StackCopy(). This function copies one element of a parser stack 210 to another.
PS_SetStack() sets an element of a parsing stack 210 to the designated type and value.
PS_CallBuiltInLex(). This function causes the parser to move to the next token in the input stream. In some situations, a resolver 400 function may wish to call it's own lexical analyzer prior to calling the standard one, as for example, when processing a programming language where the majority of tokens appearing in the input stream will be symbol table references. By calling it's own analyzer first and only calling this function if it fails to recognize a token, a resolver 400 can save a considerable amount of time on extremely large input files.
PS_GetLineCount(). This function returns the current line count for the parse. It is only meaningful from within the parse itself (i.e., in a plug-in or a resolver 400 function).
PS_GetStackDepth(). This function returns the current depth of the parsing evaluation stack. This may be useful in cases where you do not want to pay strict attention to the popping of the stack during a parse, but wish to ensure that it does not overflow by restoring it to a prior depth (by successive PS_Pop()'s) from a plug-in at some convenient synchronizing grammatical construct.
PS_SetOptions(), PS_ClrOptions(), PS_GetOptions0- The function
PS_SetOptions() may be used to modify the options for a parse DB (possibly while it is in progress). One application of such a function is to turn on full parse tracing (from within a plug-in or resolver 400) when the line count reaches a line at which you know the parse will fail. PS_ClrOptions performs the converse operation, that is, it clears the parsing options bits specified. The function PS_GetOptions() returns the current options settings.
PS_FIagError(). In addition to invoking an underlying error logging facility if something goes wrong in a plug-in or resolver 400, this routine can be called to force the parser to abort. If this routine is not called, the parse will continue (which may be appropriate if the erroneous condition has been repaired).
PS_ForceReStart(). This function causes the parse to re-start the parse from scratch. It is normally used when plug-ins or resolver 400s have altered the source text as a result of the parsing process, and wish the parser to re-scan in order to force a new behavior. This function does not alter the current lexical analyzer position (i.e., it continues from where it left off). If you wish to do this also you must call PS_SetTokenState().
PS_StackType() This function gets the contents type of a parser stack element and return the stack element type. PS_GetOpCount() gets the number of operands that apply to the specified stack element which should be a plug-in reverse polish operator, it returns the number of operands passed to the plug-in or -1 if no operand list is found. PS_GetValue() gets the current value of a parser stack element and returns a pointer to the token string, or NULL if not available.
PS_SetElemFIagsO, PS_ClrElemFIags(), PS_GetElemFlags(). The first two routines set or clear flag bits in the stack element flag word. PS_GetElemFlags() returns the whole flags word. These flags may be used by resolver 400s and plug-ins to maintain state information associated with elements on the evaluation stack 215.
PS_SetiValue(), PS_SetfValue(), PS_SetpValue(), PS_SetsValue(). These routines set the current value and type of a parser stack element to the value supplied where:
PS_SetiValue() — sets the element to a 64 bit integer
PS_SetfValue() — sets the element to a double
PS_SetpValue() ~ sets the element to a pointer value
PS_SetsValue() — sets the element to a symbol number
PS_GetToken0. Gets the original token string for a parsing stack element. If the stack element no longer corresponds to an original token (e.g., it is the result of evaluating an
expression) then this routine will return NULL, otherwise' it will return the pointer to the token string.
PS_AssignIdent(). This routine invokes the registered identifier resolver 400 to assign a value of the specified type to that identifier; it is normally called by plug-ins in the course of their operation.
PS_EvalIdent(). This routine invokes the registered identifier resolver 400 to evaluate the specified identifier, and assign the resulting value to the corresponding parser stack element (replacing the original identifier record); it is normally called by plug-ins in the course of their operation. Unlike all other assignments to parser stack elements, the assignment performed by the resolver 400 when called from this routine does not destroy the original" value of the- token string that is still available for use in other plug-in calls. If a resolver 400 wishes to preserve some kind of token number in the record, it should do so in the tag field that is preserved under most conditions.
PS_SetResolver 40005PS_SetPlugIn(). These two functions allow the registration of custom resolver 400 and plug-in functions as described above. Note that when calling a plug-in, the value of the 'pluginHint' will be whatever string followed the plug-in specifier in the BNF language syntax (e.g., <@1 :2: Arbitrary string>). If this optional string parameter is not specified OR if the 'kPreserveBNF symbols' option is not specified when creating the parser, 'pluginHint' will be NULL. This capability is very useful when a single plug-in variant is to be used for multiple purposes each distinguished by the value of 'pluginHint' from the BNF. One special and very powerful form of this that will be explored in later patents is for the 'pluginHint' text to be the source for interpretation by an embedded parser, that is executed by the plug-in itself.
PS_SetLineFinderO- Set the line-finder function for a given parser database. Line- finder functions are only required when a language may contain embedded end-of-line characters in string or character constants, otherwise the default line-finder algorithm is sufficient.
PS_SetContextID0,PS_GetContextID0. The set function may be called just once for a given parser database and sets the value for the 'aContextID' parameter that will be passed to all subsequent resolver 400 and plug-in calls, and which is returned by the function PS_GetContextID(). The context ID value may be used by the parser application for
whatever purpose it requires, it effectively serves as a global common to all calls related to a particular instance of the parser. Obviously an application may chose to use this value as a pointer to additional storage.
PS_AbortParse(). This function can be called from a resolver 400 or plug-in to abort a parse that is in progress.
PS_GetSourceContext().This function can be used to obtain the original source string base address as well as the offset within that string corresponding to the current token pointer. This capability may be useful in cases where parser 205 recognizers or plug-ins need to see multiple lines of source text in order to operate.
PS_GetTokenStateO, PS_SetTokenState(). These routines are provided to allow a resolver 400 function to alter the sequence of tokens appearing at the input stream of the parser 205. This technique is very powerful in that it allows the grammar to be extended in arbitrary and non-context-free ways. Callers to these functions should make sure that they set all the three token descriptor fields to the correct value to accomplish the behavior they require. Note also that if resolver 400 functions are going to actually edit the input text (via - the token pointer) they should be sure that the source string passed to PS_Parse() 205 is not pointing to a constant string but is actually in a handle for which source modification is permissible. The judicious use of token modification in this manner is key to the present invention's ability to extend the language set that can be handled far beyond LL(1).
PS_SetFIags(), PS_ClrFlags0, PS_GetFIags(). Set or clear flag bits in the parsers flag word. PS_GetFlags() returns the whole flags word. These flags may be used by resolver 400s and plug-ins to maintain state information.
PS_GetIntegerStackVaIue(), PS GetRealStackValueO- These functions obtain an integer or real value from the parse evaluation stack 215.
PS_Sprintf(). This function implements a standard C library sprintf() capability within a parser 205 for use by embedded languages where the arguments to PS_Sprintf() are obtained from the parser evaluation stack 215. This function is simply provided as a convenience for implementing this common feature.
PS_Parse(). This function parses an input string according to the grammar provided, as set forth above. Sample code illustrating one embodiment of this function is also provided in Appendix A..
The foregoing description of the preferred embodiments of the invention has been presented for the purposes of illustration and description. For example, the term "parser" throughout this description is addressed as it is currently used in the computer arts related to compiling. This term should not be narrowly construed to only apply to compilers or related technology, however, as the method and system could be used to enhance any sort of data management system. The descriptions of the header structures should also not be limited to the embodiments described. While the sample code provides examples of the code that may be used, the plurality of implementations that could in fact be developed is nearly limitless. For these reasons, this description is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.
AITJiNDIX A void PS_SetParserTag ( //■•Sets the tag value for a parser
ΞT_ParseHdl aParseDB, // IO:handle to parser DB unslnt64 tag, // I: ag value > int32 tagNum // I: Tag number (0..7)
); // :void unsIntS4 PS_GetParserTag . ( '// Gets the tag value for a.- parser
ET_ParseHdl aParseDB, // IO:handle to parser DB int32 . tagNum // I:Tag number (0..7)
); // R: current tag value
Boolean PS_Pop ( // Pop the parsing stack
ET_ParseHdl aParseDB // IO:handle to parser DB
); // R: FALSE if no pop (TOS reached)
Boolean PS_Push ( // Push the parsing stack
ET_ParseHdl aParseDB // 10: handle to parser DB
); // R:TRUE for success, FALSE otherwise int32 PS_PushParserState ( // Push current parser state
ET_ParseHdl aParseDB ..// I: handle to parser DB
); // R: stack depth after push int32 PS_PopParserState ( // Pop current parser state
ET_ParseHdl aParseDB, // I: handle to parser DB
Boolean disgard // I: TRUE to disgard the value popped
); // R: stack depth after pop int32 PS_ParseStackElem ( // get contents of parser stack element
ET_ParseHdl aParseDB, //. I: andle to parser DB int32 element // I: Element required (0 = T0S,1...).
); // R:top of parser stack void PS_PopTopO ParseStack ( // pop parser stack
ET_ParseHdl aParseDB, // I: handle to parser DB
ET_PStackElem* popped // 10: If IWULL holds popped elements .
) // R:void void PS_Pus TopOfParseStack ( // push parser stack
ET_ParseHdl aParseDB, // I: handle to parser DB
ET_PStackElem* pushed // I: Element (s) to be pushed
) ; // R:void
Boolean PS_ illPopParseStack ( // would token 'pop' the parser stack?
ET_ParseHdl aParseDB, // I: handle to parser DB int32 element, // I: Stack element (0 = TOS, l...etc . ) int32 terminal // I.-The terminal to be tested
) ; // R:TRUE if token will pop TOS
Boolean PS_IsLegalToken ( // Token in FIRST (production) ?
ET_ParseHdl aParseDB , // I:handle to parser DB int32 nonTerminal ,// I: The non-terminal to be considered int32 terminal // I.-The terminal to be tested
); // R:TRUE if it is, FALSE if not int32 PS_GetProduction ( //'Get a production
ET_ParseHdl aParseDB, // I: handle to parser DB int32 nonTerminal ,// I:The non-terminal to be considered int32 terminal , // I: he terminal to be tested
ET_PStackElem* *preduction // 0:Holds production on exit (if any)
); // R:# of elements in 'production'
Boolean PS_IsPostFixOperator ( // match postfix operator?
ET_ParseHdl aParseDB , // I.:handle to parser DB int32 element, // I: Stack element (0 = T0S,1 etc.) charPtr aPostfixOperator , // I: Postfix operator string to match charPtr aHint // 10: holds hint string if present
); // R:TRUE for a match, FALSE otherwise
ET ParseHdl PS MakeDB ( // Make a predictive parser database
A-l
A-2
A-4
A-5
int32 PS_Parse ■ ( • // Parse an Pipnp"uIt" /st yriSngOa3cco.-••r'"dipint '3g Oto gr'ammar
ET_ParseHdl aParseDB, • // 10:handle to parser DB from PS_MakeDB charPtr- aString // I: String to be parsed
) // R:0 success, -1 fail, 1 if aborted
{ int32 tok, size, i,x; unsigned short tmp ;
Boolean accep , asStartL , ateSome ; charPtr tString; charHdl • c ; unsChar subs; unslntlδ subsI;
ET_-ParseTblPtr ptr; short jmp;
ET_LexHdl IH; tString = 0; initialize fields 'nptr' , ' str' , ' baseString' of parser to aString
IH = lexical analyzer associated with parser ps_re_start : allocate parseStack and evalStack if not already allocated while ( PS_Pop (aParseDB) ) ; // loop till evaluation stack is empty initialize parseStack to kRootNonTerminal ; if ( kRestartParser flag !set )
{ ' lineCount = 0; // on a parser re-start we do not alter lineptr, if a plug-in tokCount = 0; // wants to force a new line pointer if must do it with if ( ! String ) // PS_SetTokenState
{ allocate tString; lineBuffer = tString;
' } tString [0] = ' \0 ' ; lineptr = tString; tokptr = tString; bkLineptr = tString;
} clear kerrDetect+kbkFlag flags // clear the flags that might have caused a restart toksize = 0; bkPtr = 0; bkTop = 0; loop ( till end of string, error or aborted ) ;
{ if (kRestartParser flag I set )
{ if ( lineFinder function present ) // a custom line finder? tString = call it else tString = extract one line from string ' " . . reset tokPtr, linePtr, tokSize lineCount++; } clear kRestartParser flag tok = 1; wasStartL = YES; while ( tok && ! ( kerrDetect or kAbortTheParse flags) ) { // repeat till end of line or error
A-7
while Uacflept S.& kerrDetect+kAbortTheParse,,,4set),, „) „ ,. ,,,.,,, ,n, ...„ ,.,,,,«.' ,,..,,
{ // now keep going till we accept the tfekiaϊl br' dle-f 1 "" •' ' " "Aϊtli- * tmp = parseStack [top] // examine top of the parser stack if ( tmp >= kMaxTerminalSym ) // it's a non-terminal symbol so { // keep resolving productions subs = (tmp & OXFF) ; // lop off MS byte for rapid testing ptr = search for appropriate parsing record if ( ptr->tofStack == subs ) // got a record for this token/TOF stack
{ subsl = ptr->substlndex; if ( ! subsl ) pop parsing stack (but don't accept token) else if ( subsl == OxFFFF ) pop parsing stack & do accept token else push new prodn onto stack if ( PS_IsPlugI (parseStack [top] ) ) ) goto eat_plugins; // we exposed a plugm, munch
} // it BEFORE we fetch the next token! else if ( flags & kbkFlag ) // alternative production so back
{ // up parser and lexical analyzer.
PS_Restore 0 ; // and re-start down alternative parse goto ps_re_try; // tree
} else report syntax error
} else if ( PS_IsPlugI (parseStack [top] ) )
{ // we hit a plug in reverse polish operator so do it eat_plugins : jmp = PS_EatPlugins (aParseDB, tString, ... ) ; if ( jmp == kJu pReTry ) goto ps_re_try; else if ( jmp == kJumpReStart ) goto ps_re_start; else if ( jmp == kJu pPopState ) goto ps_pop_state; } else if ( tok == tmp ) // top of stack terminal & matches { // input so pop and accept accept = PS_Accept (aParseDB, tok) ; if ( PS_IsPlugI (parseStack [top] ) ) goto eat_plugins; // we exposed a plugin, munch it!
} else if (flags & kbkFlag ) // parser back-up alternative available.
{
PS_Restore () ; // and re-start alternative parse tree. goto ps_re_try; // NB : backup is within one line.
} else report syntax error
} if (( tok == kEndOfLineToken ) // we forced an <eoln> token so now get tok = 0; // rid of it zero is true end of line
} - } set kReachedEndOfString flag; if ( (flags & kerrDetect) ) // a lower level error occured which we { // didn't report yet so force a PS_Parse report internal error // error so he can see what line failed etc. else if ( (flags & kAbortTheParse) )
// reached EOF so force <endf> token and continue do // parse until stack exposes a matching <endf>
{ // same logic as above but no backups, toknum = tok = kEndOfFileToken; if ( resolver ) // pop states etc...
A-8
60091607_1.DOC
A-9
SYSTEM AND METHOD FOR ANALYZING DATA
Inventor: John Fairweather
BACKGROUND OF THE INVENTION
Lexical analyzers are generally used to scan sequentially through a sequence or "stream" of characters that is received as input and returns a series of language tokens to the parser. A token is simply one of a small number of values that tells the parser what kind of language element was encountered next in the input stream. Some tokens have associated semantic values, such as the name of an identifier or the value of an integer. For example if the input stream was:
dst = src + dst->moveFrom
After passing through the lexical analyzer, the stream of tokens presented to the parser might be:
(tok=l, string="dst") — i.e., 1 is the token for identifier
(tok=100, string="=") (tok=l,string="src") (tok=101, string="+"). (tok=l,string="dst") (tok=102, string="->") (tόk= 1 , string="moveFrom")
To implement a lexical analyzer, one must first construct a Deterministic Finite Automaton (DFA) from the set of tokens to be recognized in the language. The DFA is a kind of state machine that tells the lexical analyzer given its current state and the current input character in the stream, what new state to move to. A finite state automaton is deterministic if it has no transitions on input G (epsilon) and for each state, S, and symbol, A, there is at most one edge labeled A leaving S. In the present art, a DFA is constructed by first constructing a Non-deterministic Finite Automaton (N A). Following construction of the NFA, the NFA is converted into a corresponding DFA. This process is covered in more detail in most books on compiler theory.
APPENDIX 1
In Figure 1, a state machine that has been programmed to scan all incoming text for any occurrence of the keywords "dog", "cat", and "camel" while passing all other words through unchanged is shown. The NFA begins at the initial state (0). If the next character in the stream is 'd', the state moves to 7, which is a non-accepting state. A non-accepting state is one in which only part of the token has been recognized while an accepting state represents the situation in which a complete token has been recognized. In Figure 1 , accepting states are denoted by the double border. From state 7, if the next character is 'o', the state moves to 8. This process will then repeat for the next character in the stream. If the lexical analyzer is in an accepting state when either the next character in the stream does not match or in the event that the input stream terminates, then the token for that accepting state is returned. Note that since "cat" and "camel" both start with "ca", the analyzer state is "shared" for both possible "Lexemes". ' By sharing the state in this manner, the lexical analyzer does not need to examine each complete string for a match against all possible tokens, thereby reducing the . search space by roughly a factor of 26 (the number of letters in the alphabet) as each character of the input is processed. If at any point the next input token does not match any of the possible transitions from a given state, the. analyzer should revert to state 10 which will accept any other word (represented by the dotted lines above). For example if the input word were "doctor", the state would get to 8 and then there would be no valid transition for the 'c' character resulting in taking the dotted line path (i.e., any other character) to state 10. As will be noted from the definition above, this state machine is an NFA not a DFA. This is because from state 0, for the characters 'c' and 'd', there are two possible paths, one directly to state 10, and the others to the beginnings of "dog" and "cat", thus we violate the requirement that there be one and only one transition for each state-character pair in a DFA.
Implementation of the state diagram set forth in Figure 1 in software would be very inefficient. This is in part because, for any non-trivial language, the analyzer table will need to be very large in order to accommodate all the "dotted line transitions". A standard algorithm, often called 'subset construction', is used to convert an NFA to a corresponding DFA. One of the problems with this algorithm is that, in the worst-case scenario, the number of states in the resulting DFA can be exponential to the number of NFA states. For these reasons, the ability to construct languages and parsers for complex languages on the fly is needed. Additionally, because lexical analysis is occurring so pervasively and often on many systems, lexical analyzer generation and operation needs to be more efficient.
SUMMARY OF INVENTION
The following system and method provides the ability to construct lexical analyzers on the fly in an efficient and pervasive manner. Rather than using a single DFA table and a single method for lexical analysis, the present invention splits the table describing the automata into two distinct tables and splits the lexical analyzer into two phases, one for each table. The two phases consist of a single transition algorithm and a range transition algorithm, both of which are table driven and, by eliminating the need for NFA to DFA conversion, permit the dynamic modification of those tables during operation. A third 'entry point' table may also be used to speed up the process of finding the first table element from state 0 for any given input character (i.e, states 1 and 7 in Figure 1). This third table is merely an optimization and is not essential to the algorithm. The two tables are referred to as the 'onecat' table and the 'catrange' tables. The onecat table includes records, of type "ET_onecat", that include a flag field, a catalyst field, and an offset field. The catalyst field of an ET_onecat record specifies the input stream character to which this record relates. The offset field contains the positive (possibly scaled) offset to the next record to be processed as part of recognizing the stream. Thus the 'state' of the lexical analyzer in this implementation is actually represented by the current 'onecat' table index. The 'catrange' table consists of an ordered series of records of type ET_CatRange, with each record having the fields 'Istat' (representing the lower bound of starting states), 'hstat' (representing the upper bound of starting states), 'lcat' (representing the lower bound of catalyst character), 'heat' (representing the upper bound of catalyst character) and 'estat' (representing the ending state if the ' transition is made).
The method of the present invention begins when the analyzer first loops through the 'onecat' table until it reaches a record with a catalyst character of 0, at which time the 'offset' field holds the token number recognized. If this is not the final state after the loop, the lexical analyzer has failed to recognize a token using the 'onecat' table and must now re-process the input stream using the 'catrange' table. The lexical analyzer loops re-scanning the 'catrange' table from the beginning for each input character looking for a transition where the initial analyzer state lies between the 'lstat' and 'hstat' bounds, and the input character lies between the 'leaf and 'heat' bounds. If such a(state is found, the analyzer moves to the new state specified by 'estat'. If the table runs out (denoted by a record with 'lstat' set to 255) or the input string runs out, the loop exits.
The invention also provides a built-in lexical analyzer generator to create the catrange and onecat tables. By using a two-table approach, the generation phase is extremely fast but more importantly, it can be incremental, meaning that new symbols can be added to the analyzer while it is running. This is a key difference over conventional approaches because it opens up the use of the lexical analyzer for a variety of other purposes that would not normally be possible. The two-phase approach of the present invention also provides significant advantages over standard techniques in terms of performance and flexibility when implemented in software, however, more interesting applications exist when one considers the possibility of a hardware implementation. As further described below, this invention may be implemented in hardware, software, or both.
BRIEF DESCRIPTION OF THE FIGURES
Figure 1 illustrates a sample non-deterministic finite automaton.
Figure 2 illustrates a sample ET_onecat record using the C programming language.
Figure 3 illustrates a sample ET_catrange record using the C programming language.
Figure 4 illustrates a state diagram representing a directory tree.
Figure 5 illustrates a sample structure for a recognizer DB.
Figure 6 illustrates a sample implementation of the Single Transition Module.
Figure 7 illustrates the operation of the Single Transition Module.
Figure 8 illustrates a logical representation of a Single Transition Module implementation.
Figure 9 illustrates a sample implementation of the Range Transition Module.
Figure 10 illustrates a complete hardware implementation of the Single Transition Module and the Range Transition Module.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
The following description of the invention references various C programming code examples that are intended to clarify the operation of the method and system. This is not intended to limit the invention as any number of programming languages or implementations may be used.
The present invention provides an improved method and system for performing lexical analysis on a given stream of input. The present invention comprises two distinct tables that describe the automata and splits the lexical analyzer into two phases, one for each table. The two phases consist of a single transition algorithm and a range transition algorithm. A third 'entry point' table may also be used to speed up the rocess of finding the first table element from state 0 for any given input character (i.e, states 1 and 7 in Figure 1). This third table is merely an optimization and is not essential to the algorithm. The two tables are referred to as the 'onecat' table and the 'catrange' tables.
Referring now to Figure 2, programming code illustrating a sample ET_onecat record 200 is provided.The onecat table includes records, of type "ET_onecat", that include a flag field, a catalyst field, and an offset field. The catalyst field of an ET_onecat record specifies the input stream character to which this record relates. The offset field contains the positive (possibly scaled) offset to the next record to be processed as part of recognizing the stream. Thus the 'state' of the lexical analyzer in this implementation is actually represented by the current 'onecat' table index.The 'onecat' table is a true DFA and describes single character transitions via a series of records of type ET_onecat 200. A variety of specialized flag definitions exist for the flags field 210 but for the purposes of clarity, only 'kLexJump' and 'kNeedDelim' will be considered. The catalyst field 205 of an ET_onecat record 200 specifies the input stream character to which this record relates. The offset field 215 contains the positive (possibly scaled) offset to the next record to be processed as part of recognizing the stream. Thus the 'state' of the lexical analyzer in this implementation is actually represented by the current 'onecat' table index. For efficiency, the various 'onecat' records may be organized so that for any given starting state, all possible transition states are ordered alphabetically by catalyst character.
The basic algorithm for the first phase of the lexical analyzer, also called the onecat algorithm, is provided. The algorithm begins by looping through the 'onecat' table (not shown) until it reaches a record with a catalyst character of 0, at which time the 'offset' field 215 holds the token number recognized. If this is not the final state after the loop, the algorithm has failed to recognize a token using the 'onecat' table and the lexical analyzer must now re-process the input stream from the initial point using the 'catrange' table.
ch = *ptr; // 'ptr' tbl = cέonecat[entryPoint[ch]]; // initialize using 3rd table for ( done = NO ;; )
{ tch = tbl->catalyst; state = tbl->flags; if ( !*ptr ) done = YES; // oops! the source string ran out! if ( tch = ch ) // if 'ch' matches catalyst char
{ // match found, increment to next if ( done ) break; // exit if past the terminating NULL tbl++; // increment pointer if char accepted ptr-H-; II in the input stream. ch = *ptr;
} else if ( tbl->flags & kLexJump ) tbl += tbl->offset; // there is a jump alternative available else break; // no more records, terminate loop
} match = Itch && (*ptr is a delimiter || !(state & (kNeedPelim+kLexJump))); if ( match ) return tbl->offset; // on success, offset field holds token#
Referring now to Figure 3, sample programming code for creating an ET_Catrange record 300 is shown. The 'catrange' table (not shown) consists of an ordered series of records of type ET_CatRange 3001 In this -implementation, records of type ET_CatRange 300 include the fields 'lstat' 305 (representing the lower bound of starting states), 'hstat' 310 (representing the upper bound of starting states), 'leaf 315 (representing the lower bound of catalyst character), 'heat' 320 (representing the upper bound of catalyst character) and 'estat' 325 (representing the ending state if the transition is made). These are the minimum fields required but, as described above, any number of additional fields or flags may be incorporated.
A sample code implementation of the second phase of the lexical analyzer algorithm, also called the catrange algorithm, is set forth below.
tab = tabl = &catRange[0]; state = 0; ch = *ptr; for (;;)
{ // LSTAT byte = 255 ends table if ( tab->lstat == 255 ) break; else if ( ( tab->lstat <= state && state <= tab->hstat ) &&
( tab->lcat <= ch && ch <= tab->hcat ) ) -
{ // state in range & input char a valid catalyst state = tab->estat; // move to final state specified ptr-H-; // accept character ch = *ptr; if ( !ch ) break; // whoops! the input string ran out tab = tabl; // start again at beginning of table
} else tab++; // move to next record if not end
} if ( state > maxAccState || *ptr not a delimiter && *(ptr-l) not a delimiter ) return bad token error return state
- As the code above illustrates, the process begins by looping and re-scanning the 'catRange' table from the beginning for each input character looking for a transition where the initial analyzer state lies between the 'lstat' 305 and 'hstat' 310 bounds, and the input . character lies between the 'leaf 315 and 'hcaf 320 bounds. If such a state is found, the analyzer moves to the new state specified by 'estat' 325. If the table runs out (denoted by a record with 'lstaf set to 255) or the input string runs out, the loop exits. In the preferred - embodiment, a small number of tokens will be handled by the 'catRange' table (such an numbers, identifiers, strings etc.) since the reserved words of the language to be tokenized will be tokenized by the 'onecaf phase. Thus, the lower state values (i.e. <64) could be reserved as accepting while states above that would be considered non^accepting. This boundary line is specified for a given analyzer by the value of 'maxAccState' (not shown).
To illustrate the approach, the table specification below is sufficient to recognize all required 'catRange' symbols for the C programming language: .
0 1 1 a z <eol> 1 = Identifier
0 1 1 <eol> more identifier
1 1 1 0 9 . <eol> more identifier
0 0 100 " <eol> ' begins character constant
100 100 101 W <eol> a \ begins character escape sequence
10110210207 <eol> numeric character escape sequence
101101103 xx <eol> hexadecimal numeric character escape sequence
103103103 a f <eoi> more hexadecimal escape sequence
10310310309 <eol> more hexadecimal escape sequence
1001002 " <eol> ' terminates the character sequence
1021032" <eol> you can have multiple char constants
100103100 <eol> 2 = character constant
001000 <eol> 10 = octal constant
10101007 <eol> more octal constant
00319 <eol> 3 = decimal number
33309 <eol> more decimal number
00110.. <eol> start of fp number
334.. <eol> 4 = floating point number ■ ,
10104.. <eol> change octal constant to fp #
44409 <eol> more fp number J
1101104.. <eol> more fp number
34111ee <eol> 5 = fp number with exponent lOlOlllee <eol> change octal constant to fp #
111111509 <eol> more exponent
111111112 + + <eol> more exponent
OOOW <eol> continuation that does not belong to anything
111111112-- <eol> more exponent
112112509 <eol> more exponent
55509 <eol> more exponent
456ff <eol> 6 = fp number with optional float marker
45611 <eol> more float marker
1010120 xx <eol> beginning hex number
120120709 <eol> 7 = hexadecimal number
1201207 af <eol> more hexadecimal
77709 <eol> more hexadecimal
777af <eol> more hexadecimal
77811 <eol> 8 = hex number with L or U specifier
778uu <eol>
3 3 9 1 1 <eol> 9 = decimal number with L or U specifier
3 3 9 u u <eol>
10 10 11 1 1 <eol> 11 = octal constant with L or U specifier lO lO ll u u <eol>
0 0 130 " " <eol> begin string constant...
130 130 12 " " <eol> 12 = string constant
130 130 13 \ \ <eol> 13 = string const with line continuation 'V
13 13 131 0 7 <eol> numeric character escape sequence
131 131 131 0 7 <eol> numeric character escape sequence
13 13 132 x x <eol> hexadecimal numeric character escape sequence
131 132 12 " " <eol> end of string
13 13 130 <eol> anything else must be char or escape char
132 132 132 a f <eol> more hexadecimal escape sequence
132 132 132 0 9 <eol> more hexadecimal escape sequence
130 132 130 <eol> anything else is part of the string
In this example, the 'catRange' algorithm would return token numbers 1 through 13 to signify recognition of various C language tokens. In the listing above (which is actually valid input to the associated lexical analyzer generator), the 3 fields correspond to the staf 305, 'hstaf 310, 'estaf 325, 'leaf 315 and 'hcaf 320 fields of the ET CatRange record 300. This is a very compact and efficient representation of what would otherwise be a huge number of transitions in a conventional DFA table. The use of ranges in both state and input character allow us to represent large numbers of transitions by a single table entry. The fact that the table is re-scanned from the beginning each time is important for ensuring that correct recognition occurs by arranging the table elements appropriately. By using this two pass approach, we have trivially implemented all the dotted-line transitions shown in the initial state machine diagram as well as eliminating the need to perform the NFA to DFA transformation. Additionally since the OneCaf table can ignore the possibility of multiple transitions, it can be optimized for speed to a level not attainable with the conventional NFA- >DFA approach.
The present invention also provides a built-in lexical analyzer generator to create the tables described. 'CatRange' tables are specified in the format provided in Figure 3, while
OneCaf tables may be specified via application programming interface or "API" calls or simply by specifying a series of lines of the form provided below.
[ token# ] tokenString [ . ]
As shown above, in the preferred embodiment, a first field is used to specify the token number to be returned if the symbol is recognized. This field is optional, however, and other default rules may be used. For example, if this field is omitted, the last token number + 1 may be used instead. The next field is the token string itself, which may be any sequence of characters including whitespace. Finally, if the trailing period is present, this indicates that the 'kNeedDelim' flag (the flags word bit for needs delimiter, as illustrated in Figure 2) is false, otherwise it is true.
Because of the two-table approach, this generation phase is extremely fast. More importantly, however, the two table approach can be incremental. That is, new symbols can be added to the analyzer while it is running. This is a key difference over conventional approaches because it opens up the use of the lexical analyzer for a variety of other purposes that would not normally be possible. For example, in many situations there is a need for a symbolic registration database wherein other programming code can register items identified by a unique 'name'. In the preferred embodiment, such registries are implemented by dynamically adding the symbol to a OneCaf table, and then using the token number to refer back to whatever was registered along with the symbol, normally via a pointer. The advantage of this approach is the speed with which both the insertion and the lookup can occur. Search time in the registry is also dramatically improved over standard searching techniques (e.g., binary search). Specifically, search time efficiency (the "Big O". efficiency) to lookup a given word is proportional to the log (base N) of the number of characters in the token, where 'N' is the number of different ASCII codes that exist in significant proportions in the input stream. This is considerably better than standard search techniques. Additionally, the trivial nature of the code needed to implement a lookup registry and the fact that no structure or code needs to be designed for insertion, removal and lookup, make this approach very convenient.
In addition to its use in connnection with flat registries, this invention may also be used to represent, lookup, and navigate through hierarchical data. For example, it may be desirable to 'flatten' a complete directory tree listing with all files within it for transmission
to another machine. This could be easily accomplished by iterating through all files and directories in the tree and adding the full file path to the lexical analyzer database of the present invention. The output of such a process would be a table in which all entries in the table were unique and all entries would be automatically ordered and accessible as a hierarchy.
Referring now to Figure 4, a state diagram representing a directory tree is shown. The directory tree consists of a directory A containing sub-directories B and C and files FI and F2 and sub-directory C contains FI and F3. A function, LXJListfJ, is provided to allow alphabetized listing of all entries in the recognizer database. When called successively for the state diagram provided in Figure 6, it will produce the sequence:
" "A:", "A:B:", "A:C:", "A:C:F1", "A:C:F3", "A:F1", "A:F2"
Furthermore, additional routines may be used to support arbitrary navigation of the tree. For example, routines could be provided that will prune the list (LX_PruneList()), to save the list (LX_SaveListContext()) and restore the list (LX_RestoreListGontext()). The routine LX_PruneList() is used to "prune" the list when a recognizer database is being navigated or treated as a hierarchical data structure. In one embodiment, the routine LXJPruneListO consists of nothing more than decrementing the internal token size used during successive calls to LX_List(). The effect of a call to LX_PruneList() is to remove all descendant tokens of the currently listed token from the list sequence. To illustrate the point, assume that the contents of the recognizer DB represent the file/folder tree on a disk and that any token ending in ':' is a folder while those ending otherwise are files. A program could easily be developed to enumerate all files within the folder folder "Disk:MyFiles:" but not any files contained within lower level, folders. For example, the following code demonstrates how the LX_PruneListQ routine is used to "prune" any lower level folders as desired:
tokSize = 256; // set max file path length prefix = "Disk:MyFiles:"; toknum = LX_List(theDB,0,&tokSize,0,prefix); // initialize to start folder path while ( toknum != -1 ) // repeat for all files
{ toknum = LX_List(theDB,fName,&tokSize,0,prefix); // list next file name
if (toknum != -l ) // is it a file or a folder ? if ( fName[tokSize-l] = ':' ) // it is a folder
LX_PruneList(theDB) // prune it and all it's children else // it is a file...
— process the file somehow }
In a similar manner, the routines LX_SaveListContextQ and LX_RestoreListContextO may be used to save and restore the internal state of the listing process as manipulated by successive calls to LXJ istO in order to permit nested/recursive calls to LXJ istfJ as part of processing a hierarchy. These functions are also applicable to other non-recursive situations where a return to a previous position in the listing/navigation process is desired. Taking the recognizer DB of the prior example (which represents the file/folder tree on a disk), the folder tree processing files within each folder at every level could be recursively walked non-recursively by simply handling tokens containing partial folder paths. If a more direct approach is desired, the recursiveness could be simplified. The following code illustrates one direct and simple process for recursing a tree:
void myFunc ( charPtr" folderPath )
{ tokSize = 256; // set max file path length toknum = LX_List(theDB,0,&tokSize,0,folderPath); // initialize to start folder while ( toknum != -1 ) // repeat for all files
{ toknum = LX_List(theDB,fName,&tokSize,0,prefix); // list next file name if (toknum != -l ) // is it a file or a folder ? if ( fName[tokSize-l] = ':' ) // it is a folder sprintf(nuPath,"%s%s",folderPath,fName); // create new folder path tmp = LX_SaveListContext(theDB); // prepare for recursive listing myFunc(nuPath); // recurse!
LX_RestoreListContext(theDB,tmp); // restore listing context else // it is a file...
— process the file somehow } }
These routines are only a few of the routines that could be used in conjunction with the present invention. Those in the prior art will appreciate that any number of additional routines could be provided to permit manipulation of the DB and lexical analyzer. For example, the following non-exclusive list of additional routines are basic to lexical analyzer use but will not be described in detail since their implementation may be easily deduced from the basic data structures described above:
LX_Add() - Adds a new symbol to a recognizer table. The implementation of this routine is similar to LX_LexQ except when the algorithm reaches a point where the input token does not match, it then enters a second loop to append additional blocks to the recognizer table that will cause recognition of the new token.
LX_Sub() - Subtracts a symbol from a recognizer table. This consists of removing or altering table elements in order to prevent recognition of a previously entered symbol.
LX_Set() - Alters the token value for a given symbol. Basically equivalent to a call to LXJLexO followed by assignment to the table token value at the point where the symbol was recognized.
LXJhiitQ - Creates a new empty recognizer DB.
LX_KilLDB() - Disposes of a recognizer DB.
LX_FindToken() - Converts a token number to the corresponding token string using LXJ istO-
In addition to the above routines, additional routines and structures within a recognizer DB may be used to handle certain aspects of punctuation and white space that may vary between languages to be recognized. This is particularly true if a non-Roman script system is involved, such as is the case for many non-European languages. In order to distinguish between delimiter characters (i.e., punctuation etc.) and non-delimiters (i.e., alphanumeric characters), the invention may also include the routines LX_AddDelimiter() and LX_SubDelimiter(). When a recognizer DB is first created by LX_InitO, the default delimiters are set to match those used by the English language. This set can then be selectively modified by adding or subtracting the ASCII codes of interest. Whether an ASCII character is a delimiter or not is determined by whether the corresponding bit is set in a bit- array 'Dels' associated with the recognizer DB and it is this array that is altered by calls to add or subtract an ASCII code. In a similar manner, determining whether a character is white-space is crucial to determining if a given token should be recognized, particularly where a longer token with the same prefix exists (e.g., Smith and Smithsonian). For this reason, a second array 'whitespace' is associated with the recognizer DB and is used to add new whitespace characters. For example an Arabic space character has the ASCII value of the English space plus 128. This array is accessed via LX_AddDelimiterQ and LX_SubDelimiter() functions.
A sample structure for a recognizer DB 500 is set forth in Figure 5. The elements of the structure 500 are as follows: onecatmax 501 (storing the number of elements in 'onecat'), catrangemax 502 (storing number of elements in 'catrange'), lexFlags 503 (storing behavior configuration options), maxToken 504 (representing the highest token number in table), nSymbols 505 (storing number of symbols in table), name 506 (name of lexical recognizer DB 500), Dels 507 (holds delimiter characters for DB), MaxAccState 508 (highest accepting state for catrange), whitespace 509 (for storing additional whitespace characters), entry 510
(storing entry points for each character), onecat 511 (a table for storing single state transitions using record type ET_onecat 200) and catrange 512 (a table storing range transitions and is record type ET_CatRange 400).
As the above description makes clear, the two-phase approach to lexical analysis provides significant advantages over standard techniques in terms of performance and flexibility when implemented in software. Additional applications are enhanced when the invention is imlemented in hardware.
Referring now to Figure 6, a sample implementation of a hardware device based on the 'OneCaf algorithm (henceforth referred to as a Single Transition Module 600 or STM 600) is shown. The STM module 600 is preferably implemented as a single chip containing a large amount of recognizer memory 605 combined with a simple bit-slice execution unit 610, such as a 2610 sequencer standard module and. a control input 645. In operation the STM 600 would behave as follows:
1) The system processor on which the user program resides (not shown) would load up a recognizer DB 800 into the recognizer memory 605 using the port 615 formatted as a record of type ET_onecat 200.
2) The system processor would initialize the source of the text input stream to be scanned. The simplest external interface for text stream processing might be to tie the 'Next' signal 625 to an incrementing address generator 1020 such that each pulse on the 'Next' line 625 is output by the STM 600 and requests the system processor to send the next byte of text to the port 630. The contents of the next external memory location (previously loaded with the text to be scanned) would then be presented to the text port 630. The incrementing address generator 1020 would be reset to address zero at the same time the STM 600 is reset by the system processor.
Referring now to Figure 7, another illustration of the operation of the STM 600 is shown. As the figure illustrates, once the 'Reset' line 620 is released, the STM 600 fetches successive input bytes by clocking based on the 'Next' line 620, which causes external circuitry to present the new byte to input port 630. The execution unit 610 (as shown in Figure 6) then performs the 'OneCaf lexical analyzer algorithm described above. Other
hardware implementations, via a sequencer or otherwise, are possible and would be obvious to those skilled in the art. In the simple case, where single word is to be recognized, the algorithm drives the 'Break' line 640 high at which time the state of the 'Match' line 635 determines how the external processor/circuitry 710 should interpret the contents of the table address presented by the port 615. The 'Break' signal 640 going high signifies that the recognizer (not shown) has completed an attempt to recognize a token within the text 720. In the case of a match, the contents presented by the port 615 may be used to determine the token number. The 'Break' line 640 is fed back internally within theLexical Analyzer Module or 'LAM' (see Figure 14) to cause the recognition algorithm to re-start at state zero when the next character after the one that completed the cycle is presented.
Referring now to Figure 8, a logical representation of an internal STM implementation is shown. The fields/memory described by the ET_onecat 200 structure is now represented by three registers 1110, 1120, 1130, two of 8 bits 1110, 1120 and one of at least 32 bits 1130 which are connected logically as shown. The 'Break' signal 640 going high signifies that the STM 600 has completed an attempt to recognize a token -within the text stream. At this point external circuitry or software can examine the state of the 'Match' line 635 in order to decide between the following actions:
1) If the 'Match' line 635 is high, the external system can determine the token number recognized simply by examining recognizer memory 605 at the address presented via the register 1145.
2) If the 'Match' line 635 is low, then the STM 600 failed to recognize a legal token and the external system may either ignore the result, reset the STM 600 to try for a new match, or alternatively execute the range transition algorithm 500 starting from the original text point in order to determine if a token represented by a range transition exists. The choice of which option makes sense at this point is a function of the application to which the STM 600 is being applied.
The "=?" block 1150, "0?" blocks 1155, 1160, and "Add" block 1170 in Figure 11 could be implemented using standard hardware gates and circuits. Implementation of the "delim?" block 1165 would require the external CPU to load up a 256*1 memory block with 1 bits for all delimiter characters and 0 bits for all others. Once loaded, the "delim?" block
1165 would simply address this memory with the 8-bit text character 1161 and the memory output (0 or 1) would indicate whether the corresponding character was or was not a delimiter. The same approach can be used to identify white-space characters and in practice a 256*8 memory would be used thus allowing up to 8 such determinations to be made simultaneously for any given character. Handling case insensitive operation is possible via lookup in a separate 256*8 memory block.
In the preferred implementation, the circuitry associated with the 'OneCaf recognition algorithm is segregated from the circuitry/software associated with the 'CatRange' recognition algorithm. The reason for this segregation is to preserve the full power and flexibility of the distinct software algorithms while allowing the 'OneCaf algorithm to be executed in hardware at far greater speeds and with.no load on the main system processor. This is exactly the balance needed to speed up the kind of CAM and text processing applications that are described in further detail below. This separation and implementation in hardware has the added advantage of permitting arrangements whereby a large number of STM modules (Fig 6 and 7) can be operated in parallel permitting the scanning of huge volumes of text while allowing the system processor to simply coordinate the results of each STM module 600. This supports the development of a massive and scaleable scanning bandwidth.
Referring now to Figure 9, a sample hardware implementation for the 'CatRange' algorithm 500 is shown. The preferred embodiment is a second analyzer module similar to the STM 600, which shall be referred to as the Range Transition Module or RTM 1200. The RTM module 1200 is preferably implemented as a single chip containing a small amount of range table memory 1210 combined with a simple bit-slice execution unit 1220, such as a 2910 sequencer standard module. In operation the RTM would behave as follows:
1) The system processor (on which the user program resides) would load up a range table into the range table memory 1210 via the port 1225, wherein the the range table is formatted as described above with reference to ET_CatRange 300.
2) Initialization and external connections, such as the control/reset line 1230, next line 1235, match line 1240 and break line 1245, are similar to those for the STM 900.
3) Once the 'Reset' line 1230 is released, the RTM 1200 fetches successive input bytes by clocking based on the 'Next' line 1235 which causes external circuitry to present the new byte to port 1250. The execution unit 1220 then performs the 'CatRange' algorithm 500. Other implementations, via a sequencer or otherwise are obviously possible.
In a complete hardware implementation of the two-phase lexical analyzer algorithm, the STM and RTM are combined into a single circuit component known as the Lexical Analyzer Module or LAM 1400. Referring now to Figure 10, a sample LAM 1400 is shown. The LAM 1400 presents a similar external interface to either the STM 600 or RTM 1200 but contains both modules internally together with additional circuitry and logic 1410 to allow both modules 600, 1200 to be run in parallel on the incoming text stream and their results to be combined. The combination logic 1410 provides the following basic functions in cases where both modules are involved in a particular application (either may be inhibited):
1) The clocking of successive characters from the text stream 1460 via the sub- module 'Next' signals 925, 1235 must be synchronized so that either module waits for the other before proceeding to process the next text character.
2) The external LAM 'Match' signals 1425 and 'Break' signals 1430 are coordinated so that if the STM module 900 fails to recognize a token but the RTM module 1200 is still processing characters, the RTM 1200 is allowed to continue until it completes. Conversly, if the RTM 1200 completes but the STM 600 is still in progress, it is allowed to continue until it completes. If the STM 600 completes and recognizes a token, further RTM 1200 processing is inhibited.
3) An additional output signal "S/R token" 1435 allows external circuitry/software to determine which of the two sub-modules 600, 1200 recognized the token and if appropriate allows the retrieval of the token value for the RTM 1200 via a dedicated location on port 1440. Alternately, this function may be achieved by driving the address latch to a dedicated value used to pass RTM 1200 results. A control line 1450 is also provided.
The final stage in implementing very high performance hardware systems based on this technology is to implement the LAM as a standard module within a large programmable
gate array which can thus contain a number of LAM modules all of which can operate on the incoming text stream in parallel. On a large circuit card, multiple gate arrays of this type can be combined. In this configuration, the table memory for all LAMs can be loaded by external software and then each individual LAM is dynamically 'tied' to a particular block of this memory, much in the same manner that the ETJLexHdl structure (described above) achieves in software. Once again, combination logic similar to the combination logic 1410 utilized between STM 600 and RTM 1200 within a given LAM 1400 can be configured to allow a set of LAM modules 1400 to operate on a single text stream in parallel. This allows external software to configure the circuitry so that multiple different recognizers, each of which may relate to a particular recognition domain, can be run in parallel. This implementation permits the development and execution of applications that require separate but simultaneous scanning of text streams for a number of distinct purposes. The external software architecture necessary to support this is not difficult to imagine, as are the kinds of sophisticated applications, especially for intelligence purposes, for which this capability might find application.
Once implemented in hardware and preferably as a LAM module 1400, loaded and configured from software, the following applications (not exhaustive) can be created:
1) Content-addressable memory (CAM). In a CAM system, storage is addressed by name, not by a physical storage address derived by some other means. In other words, in a CAM one would reference and obtain the information on "John Smith" simply using the name, rather than by somehow looking up the name in order to obtain a physical memory reference to the corresponding data record. This significantly speeds and simplifies the software involved in the process. One application area for such a system is in ultra-high performance database search systems, such as network routing (i.e., the rapid translation of domains and IP addresses that occurs during all internet protocol routing) advanced computing architectures (i.e., non-Von Neuman systems), object oriented database systems, and similar high performance database search systems.
2) Fast Text Search Engine. In extremely high performance text search applications such as intelligence applications, there is a need for a massively parallel, fast search text engine that can be configured and controlled from
software. The present invention is ideally suited to this problem domain, especially those applications where a text stream is being searched for key words in order to route interesting portions of the text to other software for in-depth analysis. High performance text search applications can also be used on foreign scripts by using one or more character encoding systems, such as those developed by Unicode and specifically UTF-8, which allow multi-byte Unicode characters to be treated as one or more single byte encodings.
3) Language Translation. To rapidly translate one language to another, the first stage is a fast and flexible dictionary lookup process. In addition to simple one- to-one mappings, it is important that such a system flexibly and transparently handle the translation of phrases and key word sequences to the corresponding phrases. The present invention is ideally suited to this task.
Other applications. A variety of other applications based on a hardware implementation of the lexical analysis algorithm described are possible including (but not limited to); routing hierarchical text based address strings, sorting applications, searching for repetitive patterns, and similar applications.
The foregoing description of the preferred embodiment of the invention has been presented for the purposes of illustration and description. Any number of other basic features, functions, or extensions of the foregoing method and systems would be obvious to those skilled in the art in light of the above teaching. For example, other basic features that would be provided by the lexical analyzer, but that are not described in detail herein, include case insensitivity, delimiter customization, white space customization, line-end and line-start sensitive tokens, symbol flags and tagging, analyzer backup, and other features of lexical analyzers that are well-known in the prior art. For these reasons, this description is not intended to be exhaustive or to limit the invention to the precise forms disclosed. It is intended that the scope of the invention be limited not by this detailed description but rather by the claims appended hereto.