WO2005111824A2 - Method and system for processing of text content - Google Patents

Method and system for processing of text content Download PDF

Info

Publication number
WO2005111824A2
WO2005111824A2 PCT/IL2005/000521 IL2005000521W WO2005111824A2 WO 2005111824 A2 WO2005111824 A2 WO 2005111824A2 IL 2005000521 W IL2005000521 W IL 2005000521W WO 2005111824 A2 WO2005111824 A2 WO 2005111824A2
Authority
WO
WIPO (PCT)
Prior art keywords
processing
directive
text
schema
semantic
Prior art date
Application number
PCT/IL2005/000521
Other languages
French (fr)
Other versions
WO2005111824A3 (en
Inventor
Eyal Maor
Gideon Kaempfer
Eliezer Gur
Original Assignee
Silverkite Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Silverkite Inc. filed Critical Silverkite Inc.
Publication of WO2005111824A2 publication Critical patent/WO2005111824A2/en
Publication of WO2005111824A3 publication Critical patent/WO2005111824A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures

Definitions

  • the present invention relates to systems and methods for processing of text files encoded in dialects of data representation languages.
  • Data representation languages such as SGML, HTTP headers, XML, and EDI are well known in the art, and are useful both for representing data as well as for exchange of data in a generic format between applications .
  • XML XML
  • HTML HyperText Markup Language
  • XML allows the definition of new elements and data structures and the exchange of such definitions between devices. In addition it remains readable to humans and appropriate for representation.
  • any data element in XML can be defined by a developer of a document and understood by any device that receives it.
  • XML enables better, open, and standardized business-to-business transactions.
  • XML is expected to become the dominant format for electronic data interchange.
  • Web Services are emerging to facilitate standard XML based message exchange in which different services are described.
  • XML processing including XML queries, Web Services and XML dialect transformation are load intensive tasks. This load intensive nature is at least partly due to the fact that XML is readable by humans, such that information is carried in a very heavy format. Failures or delays may cause major problems to any distributed heterogeneous real time system that must work in a highly reliable data distribution and integration framework that can deliver information to end clients with low latency in the presence of both node and link failures.
  • XML processing is memory and CPU intensive. Since XML is a markup language it is by nature more resource intensive when processed on a general processor server using the standard methods such as DOM and SAX as defined by the W3C or other proprietary software programs. Servers can easily be overloaded and memory consumption can reach its maximum capacity when processing XML data. These methods require large memory usage since the XML is eventually text that is being processed and not byte code as in a software compiler. CPU usage is therefore is getting limited with more processes added in parallel.
  • XML parsers are not adapted for or specifically customized in accordance with the characteristics of the file or group of files to be parsed.
  • One potential solution is to employ hand-coded dialect specific-tools rather than generic
  • XML tools such as DOM or SAX parsers. These dialect specific tools are created in accordance with a specific corpus of one or more files with common features associated with the specific dialect. Unfortunately, while hand-optimization techniques can lead to more efficient tools for accelerated processing of content, the cost of creating these tools can be prohibitive.
  • a method of text file processing including providing a schema associated with a dialect of a data representation language and processing the schema to generate at least one look-up table encoding a plurality of directives for text processing in accordance with the schema.
  • the provided schema is a schema which defines the dialect of the data representation language.
  • the schema is provided electronically, and the stage of generating includes only electronic processing of the electronically provided schema.
  • the generating of one or more look-up table includes effecting a compiling of the schema.
  • the compiling includes an electronic identification of at least one production rule of a grammar of the schema. According to some embodiments, the compiling includes an electronic identification of at least one semantic directive associated with a said production rule of said grammar.
  • the compiling includes a compiling of identified semantic directives into one or more lookup tables.
  • Exemplary semantic directives include but are not limited to classification semantic directives, such as a directive to semantically analyze text and classify analyzed text according to its semantic structure, and an action semantic directive whereby an action with semantic meaning is performed to processed text.
  • classification semantic directives such as a directive to semantically analyze text and classify analyzed text according to its semantic structure
  • action semantic directive whereby an action with semantic meaning is performed to processed text.
  • Exemplary semantic directives include but are not limited to validation directives and transformation directives.
  • the compiling includes a compiling of semantic meaning associated with the schema.
  • the compiling includes compiling includes a compiling of semantic classification directives associated with the schema. According to some embodiments, the compiling includes compiling includes a compiling of semantic analysis directives associated with the schema.
  • the compiling includes a compiling of validation directives of the schema into one or more of the lookup tables.
  • the compiling includes a compiling of transformation directives of the schema into one or more of the lookup tables.
  • files encoded in the data representation language are representable as a tree structure.
  • the data representation language is a tag delimited language. According to some embodiments, the data representation language is selected from the group consisting of XML and EDI .
  • the dialect is an XML dialect selected from the group consisting of FIX, OFX, SwiftML, SOAP,WSDL, HL7, EDI AccordXML
  • the schema is provided in a format selected from the group consisting of a DTD format and an XSD format.
  • the schema is a transformation schema.
  • Exemplary formats in which a transformation schema may be provided include but are not limited to XSL.
  • the generating of at least one lookup table includes at least partially implementing at least one lookup table in hardware.
  • the hardware includes at least one re-programmable logic component.
  • Appropriate logic components include but are not limited to FPGA components, ASIC components, and gated components.
  • a plurality of lookup tables are generated and the processing includes determining a first subset of the plurality of lookup tables to be implemented in software and a second subset of the plurality lookup tables to be implemented in hardware.
  • first subset of the plurality of lookup tables includes at least one lookup table that encodes at least one directive to perform a relatively simple operation.
  • Appropriate relatively simple operations include but are not limited to type checking operations, range checking operations, and fixed increment operations.
  • a first subset of the plurality of lookup tables includes at least one lookup table that encodes at least one directive to perform a relatively complex operation.
  • Appropriate relatively complex operations include but are not limited sorting operations, transactional operations, transactional operations requirement communication with a remote entity, and cryptographic operations.
  • At least one lookup table is a semantic lookup table encoding at least one semantic directive including but not limited to a directive to effect a semantic analysis, a directive to effect a semantic classification, and a directive to carry out a semantic action such as a transformation and a validation.
  • a directive to effect a semantic analysis includes an operation to be performed upon identification of at least one token, at least one production rule, or a combination thereof.
  • Exemplary operations to be performed upon identification of a token include but are not limited to a storing of a position of the token in an input stream, a storing of a length of the token, a storing of a prefix of the token, a calculation and stormg of a numerical value corresponding to the token, a storing of a token type of the token, a discarding of the token, a counting of the token, and a storing of pointer to a semantic rule.
  • Exemplary operations to be performed upon identification of a production rule include but are not limited to a storing of a position of a token associated with the production rule, a storing of number of tokens associated with the production rule, a storing of a character prefix associated with the production rule, a calculation and a storing of a numerical value associated with a token of the production rule, a calculation and storing of an abbreviated representation of the production rule, a calculation and storing of an abbreviated representation of a hierarchies of rules including the production rule, a storing of a rule type of the production rule, a discarding of production the rule, a counting of the production rule, a storing of an index of the production rule in a rule type table, a storing of an index of at least one sub-element of the production rule, a storing of an index of at least one sub-element of the production rule, a storing of a value of a counter associated with at least one sub-element of the production rule, an inheriting of
  • Exemplary semantic rules but are not limited to validation rules and transformation rules.
  • Exemplary operations to be performed upon identification of a combination of a token and a production rule include but are not limited to a storing of a pointer to a specific semantic rule and a counting of the production rule.
  • Exemplary semantic rules include but are not limited to validation rules and transformation rules.
  • at least one semantic lookup table is a stateless lookup table.
  • At least one semantic directive is selected from the group consisting of a validation directive and a transformation directive.
  • Exemplary validation directives include but are not limited to value comparison directives, simple value comparison directives, multiple value comparison directives, range checking directives, type checking directives, integer range checking directives, fraction range checking directives, value length directives, and value length checking directives.
  • At least one validation directive is selected from the group consisting of a path context validation and a context specific range changing.
  • at least one validation directive is a syntax directive selected from the group consisting of tracking how many elements of a certain type are permitted, effecting of a complex choice of possible parameter or attribute combinations, and enforcement of a specific element sequence constraint.
  • At least one validation directive is at least one validation directive is selected from the group consisting of a path context validation and a context specific range changing.
  • transformation directives include but are not limited to directives to mark a structure for transformation of a given type such as a type specified by a transformation identification code, directives to remove a token, directives to remove a complex structure, directives to truncate a token, directives to truncate a complex structure, directives to replace a first token with a second token, numerical conversion directives and directives to update of a given field in an output table.
  • the updating of the given field allows altering of one or more pointers some predefined value.
  • the altering is for a namespace change in XML.
  • At least transformation directive is conditional upon at least one validation result.
  • Appropriate range checking directives include but are not limited to directives for context specific range checking.
  • an array of at least one transformation directive encodes a transformation between the dialect and a different dialect of the data representation language.
  • an array of at least one transformation directive encodes a transformation between the data representation language and a different data representation language.
  • an array of at least one transformation directive encodes a transformation between the data representation language and a language other than a data representation language.
  • at least one semantic directive is a path validation directive.
  • a method of text processing including receiving a plurality of text files encoded in a data representation language, for at least one text file determining a dialect of the data representation language, generating from a definition of the dialect at least one look-up table encoding directives for text processing in accordance with the determined dialect, and using at least one look-up table, processing the respective text file in accordance with the determined dialect.
  • At least a part of one said look-up table is implemented at least in part in hardware.
  • a plurality of said text files are subjected to a processing including determining a set of common or heavy operations, determining a set of uncommon or light operations; for a subset of the plurality of text files, performing the set of common or heavy operations and perfo ⁇ ning on the subset of the text files the set of uncommon or light operations.
  • the determining is effected at least in part using a first hardware module and the processing is effected at least in part using a second hardware module.
  • the first and second hardware modules are configured to effect a pipeline processing.
  • data associated with at least one look-up table is cached, and the stage of processing includes retrieving at least some of cached data.
  • the caching includes caching at a plurality of locations within a network.
  • the definition of the dialect includes a schema, and the generating of at least one look-up table includes electronic processing of the schema.
  • processing of said schema includes effecting a compiling of the schema.
  • a hardware update of a said lookup table is performed before commencing processing of at least one the text file.
  • the determining of the dialect includes identifying a string selected from the group consisting of a file name of a text file and a file type of a text file.
  • the determining of the dialect includes identifying a file source of one or more of the text files. According to some embodiments, the determining of the dialect includes parsing at least some of one or more of the text files.
  • the determining of the dialect includes effecting a first pass over a respective text file, and the processing using at least one look-up table includes effecting a second pass over the respective text file. According to some embodiments, the determining, generating and processing is performed iteratively more than once on a said text file.
  • more than one text file is subjected to processing in parallel.
  • At least one look-up table encodes a production rule of a grammar.
  • the processing of the text file includes effecting in accordance with at least one look-up table at least one grammatical analysis of a the text file.
  • exemplary grammatical analyses include but are not limited to syntactic analyses and semantic analyses.
  • a syntactic analysis includes recording at least one value selected from the group consisting of a production rule identifier, a parsing record index, a beginning of a record in an input stream, a length of a record in an index stream, a beginning of a specified sub-element of a production rule, a length of a specified sub-element of a sub-element, a context value, a value stored earlier in a parsing process, a record prefix, a prefix of a specified token in a production rule, a value associated with a specific token in a production rule, a hash value associated with at least one token associated with a production rule, a number of combined hash values, a number of counters incremented
  • the grammatical analysis includes determining a validity of a production rule.
  • the determining of the validity of the production rule includes determining a validity of at least one parent production rule, wherein invalidity of the parent production rule is indicative of invalidity of the production rule.
  • the semantic analysis includes recording at least one value selected from the group consisting of a validation rule to be applied to a parsing record a result of a comparison between two calculated values, a transformation rule to be applied to a parsing record.
  • Exemplary reference nodes include but are not limited to a root node of the tree, a node that is fixed distance from a root node of the tree.
  • the determining of the path and the storing of the abbreviated representation is carried out for at least one node element which is a descendant of the reference node.
  • the determining of the path and the storing of the abbreviated representation is carried out only for the node elements which are descendants of said reference node.
  • determining and storing is effected for all nodes having one or more predefined depths wilbin the tree.
  • the abbreviated representation of the path is mapped to a representation of data associated with the node.
  • the method includes mapping an abbreviated representation of a path between an ancestor of the node element to a representation of data associated with the node.
  • a system for accelerating text file processing including a schema input for receiving a schema defining a dialect structure of a data representation language and look-up table generator for processing the schema to generate at least one look-up table encoding directives for text processing in accordance with the schema.
  • the look-up data generator includes a schema compiler for effecting of compiling said schema.
  • the look-up data generator is operative to implement at least one said look up table in hardware such as re-programmable logic.
  • the implementing includes configuring re-programmable logic.
  • a system for text processing including an input for receiving at least one text file encoded in a data representation language and a text processing engine including at least one look-up table encoding a plurality of text-processing directives in accordance with a schema of the data representation language, the text processing engine for processing at least one received text file.
  • the system further comprises a dialect determiner for determining a said dialect of a said text file.
  • a dialect determiner for determining a said dialect of a said text file.
  • at least one look-up table is implemented at least in part in hardware.
  • At least one look-up table encodes at least semantic directive such as a directive to effect a semantic analysis, a directive to effect a semantic analysis, and a directive to effect a semantic operation.
  • exemplary semantic operations include but are not limited to validation operations and transformation operations.
  • the text processing engine is distributed within a computer network.
  • the presently disclosed system further comprises an exception handling module for handling exceptions generated by the processing of at least one text file.
  • the text processing engine is operative to reconfigure at least one said look-up table while effecting processing at a received text file.
  • the text processing engine includes a hardware character normalization engine
  • the processing of at least one received text file includes generating a character stream from the byte stream using the hardware character normalization engine.
  • the text processing engine includes a hardware token identification engine
  • the processing of the received text file includes identifying tokens within a character stream representative of said text file using said hardware token identification engine.
  • the text processing engine includes a hardware parsing engine
  • the processing of one or more received text files includes receiving a stream of tokens representative of the text file and producing a stream of parsing records using the hardware parsing engine.
  • the hardware parsing engine uses a look-up table encoding at least one syntactic text-processing directive of the dialect.
  • the text processing engine further includes at least one hardware semantic analysis engine.
  • the hardware semantic analysis engine is selected from the group consisting of a hardware validation engine and a hardware transformation rule engine.
  • a system for generating data useful for fast text processing including an input for receiving a text file representable as a tree having a plurality of node elements, a path determiner for determining a path within said tree between said respective node element and a reference element, a path representor for deriving an abbreviated representation of said determined path and a storage for said abbreviated representation.
  • the reference element is a root element of said tree.
  • the path representor is operative to derive a hash value as a abbreviated representation of a determined path corresponding to a node.
  • the storage is operative to store a map between a representation of a said node and an abbreviate representation of a said path corresponding to said represented node.
  • the presently disclosed method includes processing at least one text file of the corpus using directives associated with the data representation language, from results of the processing determining a schema of a dialect associated with the processed files, modifying a set of at least one lookup table encoding a plurality of directives for processing of text files of the corpus and processing at least one text file in accordance with the modified lookup tables. It is now disclosed for the first time a method of processing a corpus of at least one text file encoded in a data representation language.
  • the presently disclosed method includes processing at least one text file of the corpus using directives associated with the data representation language, from results of the processing, determining schema data of a dialect associated with the processed files, using the determined schema data, modifying at least one lookup table encoding at least one directive for text file processing, and processing at least one text file in accordance with a modified lookup table.
  • the determining, modifying, and processing with the modified lookup tables are repeated at least once.
  • the modifying of at least one lookup table includes at least one of generating at least one lookup table and updating at least one lookup table.
  • the modifying of at least one lookup table includes updating hardware associated with the lookup table.
  • At least one lookup table encodes at least one semantic directive.
  • a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for providing a schema associated with a dialect of a data representation language and processing the schema to generate at least one look-up table encoding a plurality of directives for text processing in accordance with the schema.
  • a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for receiving a plurality of text files encoded in a data representation language, for at least one said text file, determining a dialect of the data representation language, generating from a definition of the dialect at least one look-up table encoding directives for text processing in accordance with the determined dialect, and using at least one look-up table, processing the respective text file in accordance with the determined dialect.
  • a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for providing a text file representable as a tree having a plurality of node elements, for at least one node element, determining a path within the tree between the respective node element and a selected reference element, and storing an abbreviated representation of the path.
  • a method of accelerating processing of HTTP headers includes receiving a plurality of HTTP headers; determining common patterns among said plurality of the HTTP headers and generating from the determined common patterns at least one look-up table encoding directives for text processing in accordance with the determined common patterns.
  • At least one HTTP header is processed using one generated lookup table.
  • At least one look-up . table is implemented at least in part in hardware. It is now disclosed for the first time a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for receiving a plurality of HTTP headers; determining common patterns among said plurality of the HTTP headers and generating from the determined common patterns at least one look-up table encoding directives for text processing in accordance with the determined common patterns.
  • FIG. 1 provides a listing for an exemplary DTD file.
  • FIG. 2 provides a listing for an exemplary XSL file.
  • FIGS. 3A-B provide a listing for an exemplary XML file.
  • FIG. 4 provides a flow chart describing generation of lookup tables in accordance with exemplary embodiments of the present invention.
  • FIG. 5 provides an exemplary lookup table encoding semantic directives.
  • FIG. 6 provides a flow chart describing content processing in accordance with exemplary embodiments of the present invention.
  • FIG. 7 provides a flow chart describing pre-processing of data representation language content in accordance with exemplary embodiments of the present invention.
  • FIG. 8 provides a flow chart describing exemplary processing of data representation language content in accordance with some embodiments of the present invention.
  • FIG. 9 provides a flow chart describing a process wherein different aspects of a dialect associated with a data representation language are iteratively compiled and used for content processing.
  • FIG. 10 provides a flow chart describing a process wherein hash values of certain paths in a tree representation of a text file are computed and stored.
  • TABLE 1 includes information gathered for the production rule related to XML elements for the example input files of FIGS . 1 -3.
  • TABLE 2 includes information gathered for the production rule related to XML attributes for the example input files of FIGS. 1-3.
  • TABLE 3 includes flags and counters generated in the processing stage for the example input files of FIGS. 1-3.
  • FIG. 11 provides a listing of an HTML file that is the product the XML file of FIG. 3 transformed in accordance with XSL directives provides in FIG. 2
  • Embodiments of the present invention provide methods, apparatus and computer readable code for efficient processing of content in accordance with data representation languages.
  • Exemplary text processing in accordance with some embodiments of the present invention includes but is not limited to parsing, validation, searching, extracting data and transformation to other formats.
  • the presently disclosed methods and systems can be implemented in software, hardware or any combination thereof. Not wishing to be bound by any particular theory, it is noted that certain hardware implementations are useful for accelerating the processing of text files in accordance with presently disclosed techniques. Furthermore, it is noted that the data processing of the present invention can be applied to a variety of different types of data in various computer systems and appliances. In accordance with certain embodiments of the present invention, it has been discovered that processing of text files encoded in data representation languages can be accelerated by generating a series of lookup tables in accordance with the data representation language and/or a specific dialect of the data representation language.
  • abbreviated representations such as hash values of paths between certain nodes within the tree is useful for accelerating a subsequent process where it is necessary to quickly locate paths between nodes such as a path between a given node and a root node of the tree.
  • abbreviated representations of paths of hash values are used in search expressions, such as for example XPath and XQuery for XML files, as well as for transformation expressions, such as for example those encoded by XSL.
  • semantic directives include but are not limited to validation directives and transformation directives.
  • the file “emp.dtd” in Figure 1 defines that the input text files should be XML files that have a root element “employees". This element should contain one or more sub-elements, each called “employee”. Each “employee” element should have an attribute “serial” of type “ID”, and the following sub-elements (in this order): "name”, "position”, “address 1", “address2" (optional), "city”, “state”, “zip”, “phone” (optional), and “email” (optional).
  • the "name” element should have the following attributes: “age”, “sex”, “race” (optional, when absent a default value is implied), and "m_status”.
  • Each of the other elements is then defined to be a string (character data).
  • the above XSL file of FIG. 2 has conversion and presentation instructions. When applied to the input XML file the expected result of the transformation is in HTML format. The resulting HTML file depends on the content of the actual input XML file, and is presented below for the specific input XML file of FIG. 3 (see FIG. 11) It is noted that although “eml.xml” includes no explicit reference to schema file for validation, the “emp.xml” file conforms to the "emp.dtd” file described above, and that the transformation schema "emp.xsl” may be applied to “emp.xml.”
  • the XML file explicitly references an XML style sheet file "emp.xsl” that includes transformation rules and formatting instructions. Having the "emp.xsl” file preprocessed is a prerequisite for transforming the input "emp.xml” file. It will be appreciated that no specific characteristics of the aforementioned input files are intended as a limitation of the scope of the present invention. Thus, although these files are associated with the specific data representation language XML any data representation language currently known in the art or defined in the future is appropriate for the present invention. Similarly, "Document Type Definition” is just one appropriate format for schema defining specific dialects, and any other validation schema format currently known or to be defined in the future is appropriate.
  • XML Style-sheet is merely one exemplary format for a transformation schema, and any other transformation schema format currently known or to be defined in the future, including both transformation schema associated with XML and transformation schema associated with other data representation languages, is appropriate.
  • drawings which are used to illustrate the inventive concepts, are not mutually exclusive. Rather, each one has been tailored to illustrate a specific concept discussed. In some cases, the elements or steps shown in a particular drawing co-exist with others shown in a different drawing, but only certain elements or steps are shown for clarity.
  • a "data representation language” or a “markup language” such as, for example, XML or EDI is a language with a well defined syntax and semantics for providing agreed upon standards for message or data exchange and/or for data storage. This in contrast with programming languages or computer code languages, such as, for example, C, C++, Java, C#, and Fortran, containing specific computer instructions compilable into machine executable code.
  • the data represented in the data representation language is a structured message.
  • the data representation language and/or the dialect has strict syntax, and rigid singular semantics.
  • a data representation language and/or a dialect or sub-language defines a message protocol or messaging standard.
  • ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
  • transformation schema including but not limited to XSL schema defining transformation rules may be associated with a schema of a dialect of a data representation language.
  • a data representation language and/or a dialect or sub-language defines a message protocol or messaging standard.
  • One exemplary message protocol is the Financial Information eXchange (FIX) protocol developed specifically for the real-time electronic exchange of securities transactions.
  • FIX Financial Information eXchange
  • One example of a data representation language that functions as a message protocol is the hypertext transport protocol (HTTP), used for conveying messages over a network between various electronic devices or appliances.
  • HTTP hypertext transport protocol
  • SOAP Simple Object Access Protocol
  • FIG. 4 provides a flow chart illustrating generation of one or more lookup tables according to certain exemplary embodiments of the present invention.
  • FIG. 4 illustrates a process for the generation of five different types of lookup tables, namely one or more Lexical lookup tables
  • the definition of the dialect and/or of the schema associated with the dialect is expressed through at least one of:
  • a context free grammar defined by a set of production rules based on the above set of tokens.
  • a set of semantic operations to be performed for any given token or production rule There is no specific limitation on the format in which a schema associated with or defining a dialect of a data representation language. Exemplary formats for schema defining a dialect include but are not limited to DTD and XSD formats. Exemplary formats for transformation schema associated with specific data representation languages and/or schemas include but are not limited to XSL. It is noted that the specific schema formats for defining a dialect and/or transformation schema mentioned herein are merely appropriate examples, and it is understood that other schema formats including those not disclosed to date are within the spirit and scope of the present invention.
  • the schema is not provided independently of text files to be subsequently processed, but is derived from processing a number of text files, and determining patterns within the text files. In some embodiments, these patterns are determined using statistical methods known in the art.
  • the allowed lexical and syntactical rules including tokens and grammar rules are well defined. Additional tokens and grammar rules may be derived from the validation and transformation schemas. The tokens and the grammar are then used for generating the state-machines required for fast processing of files encoded in the data definition language. This information may be saved as DFA (Deterministic Final Automata) state-machines, which in turn may be translated into lookup tables (or LUTs) that may be used by hardware and software for the actual processing of the data representation language content.
  • DFA Deterministic Final Automata
  • a token may be defined as a regular expression (see, for example, the definition of GNU regex library) over any standard character set (e.g. ASCII, UTF-8 or UTF-16). There is no inherent limitation on the token definitions.
  • the tokens are compiled by the means of a Regular Expression Compiler such as GNU Flex or other appropriate compilers, into a deterministic state machine 106 (also termed Deterministic Finite Automata - DFA) such as a software encoded DFA.
  • a deterministic state machine 106 also termed Deterministic Finite Automata - DFA
  • This state machine is then converted into one or more memory mapped Lexical lookup table(s) 108 to be loaded into the hardware memory device used for lexical analysis.
  • this conversion process is performed by a dedicated software converter, though it will be appreciated that this is not a specific limitation.
  • Each lexical lookup table 108 is in essence a depiction of the state machine where each memory record corresponds to a state including the current output of the state machine (e.g. if a token was recognized) and the method of calculating the next state according to the next character to be read.
  • a context free grammar is defined through a set of production rules beginning from a single root variable for the specific data representation language dialect (different dialects may have different root variables).
  • the format of describing the grammar is similar to the grammar description allowed by GNU Bison or Yacc.
  • the grammar is compiled by means of a Grammar Compiler (SWGC), similar in essence to GNU Bison or Yacc, into a software encoded deterministic state machine. Traversal from one state to the other is defined by the current token received from the input (coming from a lexical analyzer), the current state and the top of a stack that accompanies the state machine during processing.
  • SWGC Grammar Compiler
  • the construction of such a state machine with stacks is well defined and published in classic computer science literature (see for instance "Compilers: Principles, Techniques and Tools" by A. A o, R. Sethi and J. Ullman, Addison-Wesley, 1986).
  • the state machine is converted into a memory mapped Syntax Lookup Table 114 to be loaded into the hardware memory device used for parsing (syntax analysis). In some embodiments, this is performed by a dedicated software converter (SWDFA2SYLUT) though this is not a specific limitation.
  • the Syntax Lookup Table 114 is in essence a depiction of the state machine where each memory record corresponds to the state including the current output of the state machine (i.e. action to be performed, production rule completed, statement identified or syntax error) and the method of calculating the next state according to the next token to be read and the content of the top of the stack.
  • lookup tables such as the Token Lexical Lookup Tables 108 and the Syntax Lookup Tables 114 encode operations or directives having semantic meaning.
  • Exemplary operations for tokens in the Lexical Lookup table include but are not limited to one or more operations enabled by the hardware implementation as follows:
  • token value e.g. integer value of a string of digits.
  • token value e.g. integer value of a string of digits.
  • hash function of the token value e.g. a 64-bit value calculated as a function of the string.
  • Exemplary operations in the Syntax Lookup Table include but are not limited to one or more operations enabled by the hardware implementation as follows:
  • index e.g. a pointer to a specific validation or transformation rule
  • the processing mechanisms implementing the lexical and syntax analysis according to the defined tokens and grammar inherently validate that the processed file is well formed in the sense that it is constructed out of a set of well defined "words" (tokens) that are put together into “meaningful sentences” (grammar).
  • the tokens and grammar production rules may be augmented with a set of semantic operation identifiers to be performed when a given token or production rule is identified (followed).
  • semantic operation identifiers may also be associated to combinations of production rules and specific token values.
  • a semantic lookup table encodes directives for semantic processing of a text file encoded in a data representation language.
  • semantic directives include semantic classification or semantic analysis directives which are directives to determine or classify text based upon semantic characteristics, and semantic action directives for performing an action other than semantic classification in accordance with semantic structures. It is noted that one particular example of semantic classification or semantic analysis is validation, wherein text is analyzed and semantic properties of the analyzed text are dete ⁇ nined.
  • semantic action directives are transformation directives wherein output text is generated in accordance with semantic properties of input text. It is noted that these directives allow for accelerated processing of the text file.
  • Exemplary semantic processing directives include but are not limited to transformation directives and validation directives.
  • Exemplary transformation directives include but are not limited to a directive to transform a particular document encoded in a dialect of data representation language into a document encoded in the same dialect of the same data representation language, into a document encoded in a different well-defined dialect of the same data representation language, and/or into a document encoded in a different data representation language.
  • the operations for combinations of production rules and token values are stored in a dedicated Semantic Lookup Table 120. These operations include but are not limited to: 1. Store an index (e.g. a pointer to a specific validation or transformation rule).
  • VALIDATION LOOKUP TABLE(S) 126 In addition to the definition of the data representation language dialect to be processed, specific validation rules may be defined that allow the system to verify that the messages being processed meet given criteria.
  • the validation rules for XML files may be written in the Document Type Definition (DTD) language (which is part of the XML definition, for formal definition see http://www.w3.org/TRxmlll/).
  • the validation rules may be written in standard XML Schema Definition (XSD) language (for a formal definition see http://www.w3.org/TR xmlschema-O/, http://www.w3.org/TR/xmlschema- 1/, http://www.w3.org/TR xmlschema-2/, http://www.w3.org/TR/xmlschema-formal/) or in a similar type of validation language.
  • DTD Document Type Definition
  • XSD XML Schema Definition
  • XSD allows verifying various aspects of XML files, but may be generalized to other data formats as well, such as higher level XML dialects (e.g. FIX, OFX, HL7, SwiftML, SOAP, WSDL) and other structured message formats (e.g. EDI, HTTP headers, CSV).
  • higher level XML dialects e.g. FIX, OFX, HL7, SwiftML, SOAP, WSDL
  • EDI HyperText Markup Language
  • HTTP headers e.g. HTTP headers, CSV
  • validation schemas are compiled by a Validation Schema Compiler (SWVSC) into two sets of validation rales: Rules to be verified by the Hardware Validation Engine (HWVE) and rules to be verified by a
  • SWVSC Validation Schema Compiler
  • HWVE Hardware Validation Engine
  • SWVM Software Validation Module
  • XSD is supported by the SWVM.
  • HWVE Validation Lookup Table
  • VDLUT Validation Lookup Table
  • Type checking (e.g. integer, string).
  • Fraction range checking e.g. number of digits before and after decimal point.
  • Value length (e.g. the number of characters in the corresponding value field).
  • Pattern matching Patterns will be defined as validation tokens, matched during lexical analysis and included in the appropriate grammar).
  • Syntax directive e.g. how many elements of a certain type are allowed, complex choices of possible parameter/attribute combinations, specific element sequence constraints
  • Path context validation together with simple value comparisons e.g. XP ATH predicates
  • Context specific range checking e.g. Context specific range checking
  • XSL and specifically XSL Transformations (XSLT) (for a formal definition see http://www.w3.org/TR/xslt), is a standard way to define required transformations of XML files, but may be generalized to other data formats as well, such as higher level XML dialects (e.g. FIX, OFX, HL7, SwiftML, SOAP, WSDL) and other structured message formats (e.g. EDI, HTTP headers, CSV).
  • SWTSC Transformation Schema Compiler
  • HWTRE Hardware Transformation Engine
  • SWTM Software Transformation Module
  • the transformation type may be a general type (e.g. "convert string to uppercase") or specified by a Transformation Identification Code (TIC) for a specific operation the SWTM will recognize.
  • TIC Transformation Identification Code
  • the transformation rales may appear in the context of tokens as well as grammar rules. In addition, they may optionally be conditional on the result of the validation results.
  • lookup tables described in the previous section are useful in the framework of a general text file processing method illustrated in FIG. 6 and described herein for the first time by the present inventor.
  • actual text files are processed in accordance with the results of the preprocessing 204A.
  • the text files are subjected to a post processing 206 described below.
  • some or all required software and hardware elements for maximal processing performance are prepared prior to the reception of some or all of the data representation language content to be processed.
  • these preparations are performed by software elements, using information that is known about the structure of the actual text files that to be processed.
  • the resultant tables and state-machines encoded in hardware and/or in software can be used in the processing stage 204A. In some embodiments, this processing is performed by a combination of tightly coupled hardware and software elements.
  • the text files are subjected to a post processing 206 described below where accumulation of information regarding the processed file(s) and caching information for possible future processing performance enhancements.
  • the pre-processing stage is a one-time operation that associated with configuration of the system (for example, when the input content is expected to conform to information known at configuration time).
  • the pre-processing may be done when new data is encountered when the relevant information is not yet ready for its analysis (for example, new files encoded in a data representation language arrive with indications of additional schemas to which the data representation language files should conform).
  • the new content is not necessarily parsed immediately, but waits until the appropriate information is prepared. Note that this enables support for future data formats and for schema updates.
  • pre-processing 202 A may be used for any incoming data that conforms to the appropriate schema or template It is noted that the processing of 202 A prepares the system for the arrival of data representation language files in such a way that they are processed with minimal effort and time. Nevertheless, some of the steps of preprocessing 202A may be performed well in advance of the arrival of the data representation language content to be processed while others may be performed only after the arrival of some or all of one or more data representation language file to be processed. From an algorithmic perspective, the time of arrival of the data representation language file has no impact whatsoever. However, in some embodiments it is desired to reduce the pre-processing required after the arrival of the target data representation language file to a minimum in order to reduce file-handling latency.
  • the preprocessing stage prepares helpful information for optimizing the processing of the content.
  • this information is based on the known of the data, such as its validation schema and its transformation template.
  • FIG. 7 provides a flow chart of an exemplary pre-processing in accordance with a schema associated with a dialect of the data representation language. It is understood that in some embodiments, certain steps are performed without others.
  • one or more state machines are prepared 310.
  • a validation schema compilation is effected 312 of schema-specific validation rules identified in accordance with the identified schema tokens and or grammar.
  • a transformation schema compilation is effected 314 of schema-specific transformation rules identified in accordance with the identified schema tokens and/or grammar.
  • the compilation of schema to generate lookup tables 304 may be quite extensive.
  • the preparations include compilation of several complex directives, including regular expressions, grammars, automata, validation schemas, transformation schemas and additional rules (e.g. rales defined by IT or business policies).
  • a large portion of the preprocessing is expected to be common to most of the files handled at a given location in a network. Therefore, the results of the above mentioned operations may be stored temporarily or permanently in a local cache. The stored information may then be retrieved at a later time instead of performing the actual processing operations repetitively.
  • the cache may be managed using any caching strategy, such as most recently used, most frequently used or other well-known techniques.
  • one or more hardware engines are updated 308 with the results of earlier pre-processing steps before processing of text file. In some embodiments, this is carried out by configuring re-programmable logic components. This is only required for results that were not previously loaded into the hardware tables.
  • the integrated circuit includes memory, parsing circuitry and an interface.
  • the parsing circuitry is configured to parse the building blocks that compromise the data. For example, for XML data, the element tag and attributes are the building blocks. Once these building blocks are parsed, further processing may be performed. Using a special binary format and a processing methodology, content and data processing may be significantly improved, while reducing the memory and CPU requirements, as well as the time required for the processing.
  • One feature provided by many embodiments of the present invention is the ability to implement some parts of it (mostly the processing stage 204A) in hardware components such as reprogrammable logic components while maintaining the flexibility required for supporting a broad range of dialects and schemas of the data representation language or markup language.
  • reprogrammable logic components include but are not limited to FPGA and ASIC components
  • the actual processing of files encoded in the data representation language is performed by a combination of tightly coupled hardware and software elements, and is carried out after some or all of the requisite pre-processing operations have been at least partially completed and some or all of the software and/or hardware engines have been updated accordingly.
  • FIG. 8 provides a flow chart of an exemplary processing of text files in accordance with previously generated lookup table 304. It is noted that not every step depicted in FIG. 8 is required in every embodiment, and that while some embodiments provide that the processing is carried out according to the order illustrated in FIG. 8, the specific order provided is not a limitation of the present invention.
  • the optional first step of processing prerequisites verification 502 is useful for identification of the prerequisites for processing the file and verification that these have been completed.
  • a character set normalization 504 is carried out, including a transformation of the original file character set into a single common character set.
  • a lexical analysis 506 including identification of tokens within the character stream is carried out.
  • identification of these tokens it is possible to identify specific grammatical structures within the token stream in the context of a syntactical analysis 508.
  • This, in turn, is followed by a semantic analysis 510 which includes identification of certain semantic meanings of the parsed data.
  • the initial validation 512 includes execution of limited hardware based validation.
  • the initial transformation 514 includes execution of limited hardware based transformation.
  • the subsequent final validation 516 includes the completion of the validation process in software, while the final transformation 518 includes the completion of the transformation process in software.
  • processing prerequisites include but are not limited to:
  • the source of the file This may be defined by the Layer2 or Layer3 addresses, system interface ID's or other addressing information available (e.g. HTTP headers).
  • the file type This may be defined by the file name (e.g. a file name suffix) or by an identification tag in the beginning of the file (e.g. a magic number).
  • the file content In some cases, the processing prerequisites will be identified within the file itself in a well-defined manner (e.g. XML files may mention the XML schemas to be used for their own validation).
  • the file may be processed twice. Initially, it will be processed as a generic file according to its type as identified by one of the two first methods. During this processing phase, only the required prerequisite information is extracted. Then the full pre-processing may be completed and only then, a second, full processing phase is performed. Such a recursive procedure may take place more than once if during the second processing phase, additional prerequisites are identified. This recursive procedure is described in the flow chart of FIG. 9. First, one or more text files are processed 204B genetically using directives appropriate for the data representation language. Based on this processing, a set of previously unknown dialect-specific characteristics of previously processed text files is determined 202B.
  • appropriate hardware and/or look-up tables are updated 210.
  • the text files are once more reprocessed 204B using the updated hardware and/or look-up tables.
  • additional dialect specific characteristics are detected 202, and thus the stages of updating hardware and/or look-up tables 210 and processing text files using updated hardware and/or look-up tables 204 is repeated.
  • HWCNE Hardware Character Normalization Engine
  • the HWCNE is a simple engine that transforms a byte stream into a character stream.
  • the characters are identified according to the appropriate expected character set definition.
  • a character may be defined by a single byte or by a small number of consecutive bytes.
  • each character is transformed into a single 16-bit word constituting its UTF-16 representation.
  • UTF-16 encoded files no transformation is required and the HWCNE operation becomes transparent.
  • the output of the HWCNE includes at least one of:
  • the Hardware Token Identification Engine receives a stream of characters and transforms it into a stream of tokens and accompanying semantic information.
  • the HWTIE uses the Lexical lookup table 108 in order to identify the tokens. It is initialized at an initial tokenization state before the first character and after each token is identified. For each character and current state, the HWTIE calculates the memory address to be accessed in the Lexical lookup table 108 and reads the corresponding record. This record may contain information regarding the tokenization result (e.g.
  • the HWTIE calculates at least one of the following values during the tokenization process:
  • the token identifier (as defined by the TKLUT when the token is identified).
  • a prefix of the token e.g. the first N characters of the string representing the token.
  • the hash of the token value (e.g. an N-bit value representing the token value).
  • the hash function is computed on the normalized character representation (and not on the original bytes stream representing the character).
  • the numeric value of the token (e.g. the integer represented by the token string). This may only be applicable for a subset of the tokens.
  • HWTIE may process one character every clock cycle, which in some embodiments is beyond 1 Gigabit per second of throughput (dependant on the character encoding) while requiring in some embodiments two external memory devices at the most (and possibly a combination of off chip memory device and one on chip memory block).
  • Lexical lookup table 108 The number of states that can be supported by the Lexical lookup table 108, if implemented in standard SDRAM technology may reach tens of thousands of states and beyond. This is expected to be more than sufficient for any typical structured file format and certainly more than required for most common programming languages, including complex languages such as C or Java.
  • the Hardware Parsing Engine receives a stream of tokens and produces a stream of parsing records.
  • a parsing record may be produced for every input token or for a stream of input tokens. In some cases, more than one parsing record may be produced per token (e.g. in the case that a single token has a complex meaning that may be expressed as multiple tokens or if a token requires a specific transformation action but would otherwise not be represented by a parsing record by itself).
  • the HWPE uses the Syntax Lookup Table 114 and an auxiliary Parsing Stack (PSTK) in order to identify the production rules of a given grammar used to produce the token stream.
  • PSTK auxiliary Parsing Stack
  • the HWPE is initialized at an initial parsing state before the first token of the file is received.
  • the PSTK is initialized to contain an end of stack identifier. For each token, current state and head of stack content, the HWPE calculates the address in the Syntax Lookup Table SYLUT 114 to be accessed and reads the corresponding record. This record may contain information regarding the parsing results (e.g. if a production rale has been identified or a syntax error has occurred), action to be taken (e.g.
  • tokens may be accumulated in a FIFO memory prior to being processed by the HWPE to allow for variable processing delays.
  • the HWPE may store several temporary values as part of the general context or within the PSTK. These values are required for various calculations performed by the HWPE.
  • the HWPE may calculate the following at least one of the following values to be included in the parsing records as output:
  • Production rule identifier e.g. the type of statement being parsed.
  • Parsing record index may be allocated before the record is sent to the output and hence may be lower than the index of records sent out earlier). May be based on one of several stored counter values (e.g. to enable separate indexing for different rale types).
  • Beginning of record in the input stream e.g. the beginning of a token within a production rule - not necessarily the first token.
  • Length of the record in the input stream e.g. the sum of lengths of a specified subset of tokens [or grammar variables] constituting a production rale).
  • Context value(s) e.g. a value stored earlier in the parsing process, such as a namespace identifier for XML files.
  • prefix e.g. the prefix of a specified token in the production rule
  • Value e.g. the value of a specified token included in the production rule.
  • a number of hash values (the hash value of one or more of the tokens constituting the rale).
  • a number of combined hash values are calculated as the combined hash value of the record with one of a number of hash values stored in the record of the parent production rule (residing on the top of the PSTK). If the hash values of the parent production rale are invalid, the result of the combination will be invalid. 12. A number of counters incremented based on the counter values of previously processed production rules (i.e. rule children).
  • a validation rule to be applied to the parsing record 15.
  • the result of a comparison between two specified calculated values e.g. two hash values).
  • HWPE may, in some embodiments, process up to one token every clock cycle.
  • a typical token of some embodiments may constitute between one and tens of characters or beyond.
  • this may support beyond 4 Gigabits per second of throughput (dependant on the character encoding) while requiring two external memory devices at the most (and possibly a combination of off chip memory device and one on chip memory block).
  • the throughput of the HWPE may be much higher than that of the HWTIE. This allows traversing multiple production rules without reading additional tokens and without exhausting a possible FIFO memory preceding the HWPE.
  • Syntax Lookup Table 114 The number of states that can be supported by Syntax Lookup Table 114, if implemented in standard SDRAM technology may reach tens of thousands of states and beyond. This is expected to be more than sufficient for any typical structured file format and certainly more than required for most common programming languages, including complex languages such as C or Java. This ensures that many grammars may be stored together in the Syntax Lookup Table 114 and activated according to the processing requirements or any given input file.
  • each row represents a single parsing record for the "emp.xml" input example that was introduced earlier.
  • these tables are generated as a single stream of parsing records with separate running record indexes for XML elements and XML attributes, effectively partitioning the records into two tables.
  • these tables are presented separately as tables 1A-1B and 2A-2B. In general an arbitrary number of tables may be generated in this fashion. Note also that additional records may be generated internally but discarded later on. It is stressed that the fields and elements in the tables provided are merely provided as an- illustrating examined, and it will be appreciated that tables that include extra fields or rows or that lack certain fields or rows are still within the scope of the invention.
  • Each instance in the element table includes information for an element in the XML file.
  • This information includes a unique identifier for the element, length of the element, length of the value of the element, hash values (detailed below), index to the first attribute (i.e. its position in the attributes table below), index to the parent element, index to last child of the element, index to the previous sibling, and sibling index.
  • the HWPE outputs the records one at a time, adding indices to mark the type of record (for example, if it is an element or an attribute) and the parsing record index (PRI, the record index in the table). Note that when using random access memory (RAM), the records may put directly in order even when the output is not in the same order by using the PRI.
  • RAM random access memory
  • the hash values are for the tag of the element, the path from the root element to the element, the path from the parent element to the element, the path from the grand-parent to the element and the path from the grand-office-parent to the element (obviously, this can be further optimized to save more [or less] paths so that the compromise between space and performance is best for the applications at hand).
  • the following pointers to the XML file i.e. the offset from the start of the file to the first character of the relevant object
  • start of the element tag start of the value of the element.
  • Table 1 includes the information gathered for the production rule related to XML elements.
  • PRI is the parsing record index attached to the element
  • type is the type of production rule that generated the record (this table includes all records that were generated by the "element” production rule)
  • V type is the type of the value of the element
  • V(8) is the 8 byte prefix of the value of the element
  • V(N) is the numeric value of the element (where a value may be computed)
  • 1st is the index of the first attribute of the element in the attributes table
  • # is the number of attributes the element has
  • P is the index of the processing record of the parent element (which is also within this table)
  • C# is the number of the children elements of the element
  • LC is the last child element of the element
  • PS is the previous siblmg element of the element
  • S# is the sibling index for the element within the parent element
  • h(V) is the hash of the value of the element
  • h(T) is the hash of the tag of the element
  • Each instance in the attribute table includes information for an attribute in the XML file. This information includes a unique identifier for the attribute, an index to its element, index to the next attribute (i.e. its position in the attributes table below), length of the attribute, length of the value of the attribute, and hash values (detailed below).
  • the hash values are for the attribute, the path from the root element to the element and to the attribute, the path from the parent element to the element and to the attribute, the path from the grand-parent to the element and to the attribute and the path from the grand-office-parent to the element and to the attribute (obviously, this can be further optimized to save more [or less] paths so that the compromise between space and performance is best for the applications at hand). Additional hash values may be calculated in order to accelerate the processing of specific dialects of data representation languages.
  • at least one of the following pointers to the XML file i.e. the offset from the start of the file to the first character of the relevant object
  • start of the attribute start of the value of the attribute.
  • VR is the validation rule to be applied for the attribute
  • FSME is the list of fields that are relevant for semantic analysis.
  • HWSME Semantic Analysis Engine
  • the HWSME may be a stateless engine (in contrast to the HWTIE and HWPE) and may perform the semantic analysis of the parsing records one record at a time regardless of the history of records previously processed by it. Using a subset of the record fields (such as the type and the hash value of the record, as indicated by the HWPE or HWTIE) the HWSME reads an entry from the Semantic Lookup Table 120. This entry specifies which validation and transformation rales may need to be applied to the parsing record. These rules will be applied later by the hardware and software validation and transformation engines (see below).
  • hash values when hash values are used, 2 approaches may be used.
  • the first approach assumes that the probability of a hash collision is negligible (for example, IO "48 ) and thus the hash value identifies the compared object uniquely.
  • a second approach is to verify that the value of the compared object is identical to the one expected when the hash value is identical to the expected hash value.
  • the latter may be slower in implementation, as it requires value comparison (for example, strings) in each case (though the hash value used may be shorter in this case), but it is accurate in all cases.
  • the output of the HWSME includes the original parsing record received from the HWPE with possible additional information including but not limited to:
  • a set of transformation rule identifiers to be applied to the record may store statistics related to certain semantic rules such as the number of instances of a certain record type with a given value.
  • one or two memory access cycles may need to be completed.
  • multiple record streams may be analyzed in parallel.
  • a single record stream may be broken into .
  • multiple streams since the HWSME may be stateless, this decomposition is relatively simple).
  • the HWSME may analyze up to one parsing record every clock cycle.
  • the HWSME throughput is limited by the HWPE throughput and does not require any input buffering. Only one or two external memory devices are required by the HWSME (and possibly a combination of one off chip memory device and one on chip memory block).
  • Semantic Lookup Table 120 The number of semantic rules that can be supported by the Semantic Lookup Table 120, if implemented in standard SDRAM technology may reach many tens of thousands and beyond. This is expected to be more than sufficient for any typical structured file format.
  • a stream of semantically augmented parsing records is formed on which validation and transformation rules may be applied.
  • the HWVE performs the initial step of validation resulting in a validated stream of parsing records. Typically, for every parsing record received one validated parsing record is produced.
  • the HWVE uses the Validation Lookup Table 126 to deduce which fields of the parsing records require validation and which validation actions need to be performed on them.
  • the HWVE may be a stateless engine (in contrast to the HWTIE and HWPE) and may perform the validation actions of the parsing records one record at a time regardless of the liistory of records previously processed by it. Using the validation rale identifiers supplies by the previous processing steps, the HWVE reads the validation parameters (such as range boundaries and validation functions) from the Validation Lookup Table 126. For more information on the supported validation actions see section "Validation schema compilation" above.
  • the output of the HWVE includes the original parsing record received from the HWSME with optional additional information including at least one of:
  • a validation error indicator (e.g. range exception, type mismatch).
  • multiple record streams may be validated in parallel.
  • a single record stream may be broken into multiple streams (since the HWVE is stateless, this decomposition is relatively simple).
  • the HWVE may validate up to one parsing record every clock cycle.
  • the HWVE throughput is limited by the HWSME throughput and does not require any input buffering. Only one or two external memory devices are required by the HWVE (and possibly a combination of one off chip memory device and one on chip memory block).
  • the number of validation records that can be supported by the Validation Lookup Table 126, if implemented in standard SDRAM technology may reach many tens of thousands and beyond. This is expected to be more than sufficient for any typical structured file format.
  • the records described earlier in Tables 1-2 are scanned in order to look for the relevant fields (like name and position) in order to have the validation information available and then tested so that the values are verified with the rales defined by the pre- processing stage.
  • the "sex" attribute is verified to be either "male” or "female”.
  • the HWVE sets appropriate flags, which will later on trigger alerts and notifications for the control application. Note that the example file “emp.xml” conforms to the sample validation file “emp.dtd”, so its validation is successful.
  • Table 3 shows the flags that were generated during the processing stage as described above. Note that the flags may also refer to information searched within the data description language file. For example, it is possible to denote if the term "confidential" is found within the data description language file (or even in specific location, for example in the root element of an XML file).
  • Table 3 includes the results of the processing and the list of errors that were encountered. Table 3 holds variables that may be used by the application to tell if the XML is well formed, passes validation etc. Note that this is only a sample of the information that may be saved for the processed XML.
  • i) INITIAL TRANSFORMATION 506 Following the initial validation, a stream of validated parsing records is produced on which initial transformation rules may be applied.
  • the HWTRE performs the initial step of transformation resulting in a partially transformed stream of parsing records. Typically, for every parsing record received one transformed or untouched parsing record is produced.
  • the HWTRE uses the Transformation Lookup Table 128 to deduce which fields of the parsing records require transformation and which transformation actions need to be performed on them.
  • the HWTRE may be a stateless engine (similar to the HWVE) and may perform the transformation actions of the parsing records one record at a time regardless of the history of records previously processed by it. Using the transformation rule identifiers supplied by the previous processing steps, the HWTRE reads the transformation parameters (such as transformation functions) from the transformation lookup table TRLUT 128. For more information on the supported transformation actions see section "TRANSFORMATION SCHEMA COMPILATION TO GENERATE ONE OR MORE TRANSFORMATION LOOKUP TABLE(S) 128" above.
  • the output of the HWTRE includes the original parsing record received from the HWVE with possible additional information including at least one of: 1. An indication whether the parsing record requires transformation. 2. The required transformation type.
  • one or two memory access cycles may need to be completed.
  • multiple record streams may be transformed in parallel.
  • a single record stream may be broken into multiple streams (since the HWTRE is stateless, this decomposition is relatively simple).
  • the HWTRE in some embodiments may validate up to one parsing record every clock cycle.
  • the HWTRE throughput is limited by the HWSME throughput and does not require any input buffering. Only one or two external memory devices are required by the HWTRE (and possibly a combination of one off chip memory device and one on chip memory block).
  • Transformation Lookup Table 1208 The number of transformation records that can be supported by the Transformation Lookup Table 128, if implemented in standard SDRAM technology may reach many tens of thousands and beyond. This is expected to be more than sufficient for any typical structured file format.
  • the transformed and validated parsing records produced by the HWTRE are sent back to the software processing modules for completion of the process.
  • a transformation instructions table is created.
  • This table includes list of operations that are required in order to generate the required HTML output.
  • the processing that remains is generating the actual output by following these instructions, which are actually offset and length to copy strings from the input file as well as strings prepared in the pre-processing stage.
  • a typical implementation may have the relevant strings (that is, the ones that may be required for the output) from the pre-processing stage stored in a smaller file and then reference this file.
  • the records described earlier in Tables 2-3 are scanned in order to look for the relevant fields (like name and position) in order to have the transformation infonnation available. For example, the "name" element is recognized and then appropriate entries are prepared for building the appropriate output.
  • the result of the processing for the "emp.xml” file is in Table 4 (for simplicity, the offset value in this example refers to strings in the original "emp.xsl” file).
  • Each line in the table has an instruction which builds an additional part of the required output file.
  • the references to the "emp.xsl” file are actually constant strings known at the preprocessing stage, while the references to the "emp.xml” file are actually pointers to strings within the input file.
  • "TRI" is the transformation record index
  • "OP" is the operation required
  • the first software module to receive the transformed and validated parsing records is the SWVM.
  • the SWVM is responsible for completing all the required validation directives that the hardware engines did not perform (either because they were not capable of performing them or because they were not instructed to perform them).
  • the SWVM uses indications received from the hardware engines pointing at the records that require further validation.
  • the actual validation is performed by executing the required validation directives as defined in the pre-processing phase. This is done using standard software validation techniques.
  • One important result of the final validation is a decision if the input data representation language file is valid or not. In case the file fails validation, the output records will in this example also point what validation rules were violated.
  • the SWTM is responsible for completing all the required transformation directives that the hardware engines did not perform (either because they were not capable of performing them or because they were not instructed to perform them).
  • the SWTM uses indications received from the hardware engines pointing at the records that require further transformation.
  • the actual transformation is performed by executing the required transformation directives as defined in the pre-processing phase. This is done using standard software transformation techniques.
  • exceptions may be the result of malformed data representation language files or implementation restrictions.
  • the file may be partially processed (e.g. if a simple lexical error is found) or it may be totally incomprehensible.
  • processing is delegated to standard software tools for functional completeness at the price of reduced performance.
  • tables may be created "on the fly”. Note also that creating the tables requires only a simple stack that depends on the depth of the XML structure and not on the amount of the information in the XML file.
  • This implementation is specifically optimized for hardware accelerations, as the sequential access to the file and low memory requirements allow for one-path processing that would generate the required tables and allow further information to be easily deduced from the tables.
  • Another optimization could be elimination of tables that are not required for a specific application. For example, in cases where the application is interested only in elements but not in the attributes, the list of attributes may be removed.
  • tables may be unified to a single table in case implementation is simplified or in case for application reasons it would result in better performance or for any other reason.
  • a feature provided by certain embodiments of the present invention is the ability to use hashes and compare hashes instead of complete strings.
  • Abbreviated representations such as hash values are calculated for one or more data item of the data representation language (e.g. XML elements, attributes, values) and the abbreviated representation (e.g. hash value) rather than the full representation is used in comparisons. Therefore it can efficiently be decided if the data conforms to a schema or if a transformation should occur for some part of the data.
  • Hash values are simple and fast to process (by hardware as well as by software) and thus it is possible to perform very fast and efficient processing of data representation language content.
  • the hash values may be used for search expressions (such are used by XPath and XQuery for XML files) as well as for transformation expressions (such are used by XSL for XML files) and actions.
  • hash values are not unique, and thus it is possible to have 2 different expressions that have the same hash value (this is called "hash collision", and such expressions are called “hash synonyms”).
  • the algorithm described may take different approaches to address this issue: The first would be to use a hash function that generates large values (for example, 160 bits values) and assume that the probability for a collision is negligible (for example, IO "48 ). This approach results in very efficient processing with a risk of collision going undetected. The other approach would be to verify the expression (for example, by comparing the strings) when the hash value for it signals that it is a synonym to the expression that is expected. Obviously, the latter approach requires more processing and thus would generate slower throughput for the data processing.
  • FIG. 10 One particular usage of abbreviated representations or hash values particular to markup languages or data representation languages providing tree representations is described in FIG. 10.
  • a tree representation of part or all of the text file is provided or derived 402.
  • paths are calculated between a selected reference node 404 and one or more selected 406 target nodes.
  • a representation of the path e.g. a hash value
  • this process of deriving a path 410 and storing the representation 412 is also carried out for additional nodes with fixed relation to a selected target node, such as an ancestor of the target node, such as a parent or grandparent of the target node.
  • each of the verbs, "comprise” “include” and “have”, and conjugates thereof, are used to indicate that the object or objects of the verb are not necessarily a complete listing of members, components, elements or parts of the subject or subjects of the verb.

Abstract

Methods, apparatus and computer readable code for processing of text files m accordance with data representation languages and specific dialects of data representation languages (102) are disclosed In some embodiments, one or more lookup tables are derived from processing or compiling a schema of a data representation language dialect Exemplary lookup tables include but are not limited to lexical lookup tables (108), syntactic lookup tables (114), semantic lookup tables (120), transformation lookup tables (128) and validation lookup tables ( 126) Optionally, one or more lookup tables are implemented at least in part in hardware using, for example, reprogrammable logic components In some embodiments, hash values or other abbreviated representations of specific data items are used in order to accelerate content processing In a particular embodiment, hash values representing paths between specific nodes of a tree representing a text file are stored

Description

METHOD AND SYSTEM FOR PROCESSING OF TEXT CONTENT
FIELD OF THE INVENTION
The present invention relates to systems and methods for processing of text files encoded in dialects of data representation languages.
BACKGROUND OF THE INVENTION
Data representation languages such as SGML, HTTP headers, XML, and EDI are well known in the art, and are useful both for representing data as well as for exchange of data in a generic format between applications .
The advantage of XML is mainly in its extensibility, in contrast to HTML, which is a closed standard. XML allows the definition of new elements and data structures and the exchange of such definitions between devices. In addition it remains readable to humans and appropriate for representation. Thus, any data element in XML can be defined by a developer of a document and understood by any device that receives it. By providing a common standard for identifying data, XML enables better, open, and standardized business-to-business transactions. XML is expected to become the dominant format for electronic data interchange. When moving from static browser based usage to machine-to-machine data exchange XML has moved to a different domain and has become an architecture building block for enterprise IT application infrastructures. Built on top of XML, Web Services are emerging to facilitate standard XML based message exchange in which different services are described.
The rapid adoption of XML and the importance of the information encodes requires infrastructure software and hardware that is both flexible and robust. Unfortunately, XML processing, including XML queries, Web Services and XML dialect transformation are load intensive tasks. This load intensive nature is at least partly due to the fact that XML is readable by humans, such that information is carried in a very heavy format. Failures or delays may cause major problems to any distributed heterogeneous real time system that must work in a highly reliable data distribution and integration framework that can deliver information to end clients with low latency in the presence of both node and link failures. Once XML has moved to the front line of application infrastructure it should adhere to the Quality of Service such applications demand, namely, fast response and processing overhead. It therefore needs to be processed at wire speed and should not increase traffic latency, especially in mission critical, real time applications or when a customer may be waiting for a transaction to be executed. Low latency can be crucial for certain data that is extremely time sensitive. For example, real-time trading systems rely upon the timely arrival of current security prices, air-traffic control systems require up-to-the-second data on aircraft position and status, and gaps or delay in live network video and audio feeds can be distracting. In such environments, even a sub-second pause in a data feed while a delivery network retransmits or reconfigures may be unacceptable
(e.g. in banking, trading, air traffic control systems). In other applications such an unwanted pause may cause errors, delays, performance level may drop or systems may not react adequately.
Since the amount and size of XML traffic is growing there is a growing need for XML processing from application servers, web servers and database servers.
One of the factors hindering the rapid adoption of XML is that XML processing is memory and CPU intensive. Since XML is a markup language it is by nature more resource intensive when processed on a general processor server using the standard methods such as DOM and SAX as defined by the W3C or other proprietary software programs. Servers can easily be overloaded and memory consumption can reach its maximum capacity when processing XML data. These methods require large memory usage since the XML is eventually text that is being processed and not byte code as in a software compiler. CPU usage is therefore is getting limited with more processes added in parallel.
The demand for XML today is moving forward into more functional code rather than only a typical markup objective and XML is used as a language for transaction and functional software program operations. Software compilers and parsers in conventional servers are provided to read XML data and provide access to its structure and content to extract the meaningful data the transaction requires.
All these parsers or processors require several times the amount of processing resources used to process many other types of data. Processing resources can be scarce particularly in systems such as handheld devices or mobile phones.
For all the above reasons, a method is required that can accelerate the processing of XML in dedicated hardware enabling faster processing without creating additional latency in the network. One reason that XML processing is so resource intensive is the generic nature of the
XML language. Although it is precisely this property of XML that makes this data representation language so powerful, one unfortunate consequence is that XML parsers are not adapted for or specifically customized in accordance with the characteristics of the file or group of files to be parsed. One potential solution is to employ hand-coded dialect specific-tools rather than generic
XML tools such as DOM or SAX parsers. These dialect specific tools are created in accordance with a specific corpus of one or more files with common features associated with the specific dialect. Unfortunately, while hand-optimization techniques can lead to more efficient tools for accelerated processing of content, the cost of creating these tools can be prohibitive.
There is an ongoing need for improved tools for processing content in accordance with specific data representation languages such as XML, EDI and HTTP headers. Furthermore, there is an ongoing need for improved tools for processing content in accordance with specific dialects of data representation languages.
SUMMARY OF THE INVENTION
The aforementioned needs are satisfied by several aspects of the present invention.
It is now disclosed for the first time a method of text file processing including providing a schema associated with a dialect of a data representation language and processing the schema to generate at least one look-up table encoding a plurality of directives for text processing in accordance with the schema.
According to some embodiments, the provided schema is a schema which defines the dialect of the data representation language.
According to some embodiments, the schema is provided electronically, and the stage of generating includes only electronic processing of the electronically provided schema.
According to some embodiments, the generating of one or more look-up table includes effecting a compiling of the schema.
According to some embodiments, the compiling includes an electronic identification of at least one production rule of a grammar of the schema. According to some embodiments, the compiling includes an electronic identification of at least one semantic directive associated with a said production rule of said grammar.
According to some embodiments, the compiling includes a compiling of identified semantic directives into one or more lookup tables.
Exemplary semantic directives include but are not limited to classification semantic directives, such as a directive to semantically analyze text and classify analyzed text according to its semantic structure, and an action semantic directive whereby an action with semantic meaning is performed to processed text.
Exemplary semantic directives include but are not limited to validation directives and transformation directives. According to some embodiments, the compiling includes a compiling of semantic meaning associated with the schema.
According to some embodiments, the compiling includes compiling includes a compiling of semantic classification directives associated with the schema. According to some embodiments, the compiling includes compiling includes a compiling of semantic analysis directives associated with the schema.
According to some embodiments, the compiling includes a compiling of validation directives of the schema into one or more of the lookup tables.
According to some embodiments, the compiling includes a compiling of transformation directives of the schema into one or more of the lookup tables.
According to some embodiments, files encoded in the data representation language are representable as a tree structure.
I According to some embodiments, the data representation language is a tag delimited language. According to some embodiments, the data representation language is selected from the group consisting of XML and EDI .
According to some embodiments, the dialect is an XML dialect selected from the group consisting of FIX, OFX, SwiftML, SOAP,WSDL, HL7, EDI AccordXML
According to some embodiments, the schema is provided in a format selected from the group consisting of a DTD format and an XSD format.
According to some embodiments, the schema is a transformation schema. Exemplary formats in which a transformation schema may be provided include but are not limited to XSL.
According to some embodiments, the generating of at least one lookup table includes at least partially implementing at least one lookup table in hardware.
According to some embodiments, the hardware includes at least one re-programmable logic component. Appropriate logic components include but are not limited to FPGA components, ASIC components, and gated components.
According to some embodiments, a plurality of lookup tables are generated and the processing includes determining a first subset of the plurality of lookup tables to be implemented in software and a second subset of the plurality lookup tables to be implemented in hardware.
According to some embodiments, first subset of the plurality of lookup tables includes at least one lookup table that encodes at least one directive to perform a relatively simple operation. Appropriate relatively simple operations include but are not limited to type checking operations, range checking operations, and fixed increment operations.
According to some embodiments, a first subset of the plurality of lookup tables includes at least one lookup table that encodes at least one directive to perform a relatively complex operation.
Appropriate relatively complex operations include but are not limited sorting operations, transactional operations, transactional operations requirement communication with a remote entity, and cryptographic operations.
According to some embodiments, at least one lookup table is a semantic lookup table encoding at least one semantic directive including but not limited to a directive to effect a semantic analysis, a directive to effect a semantic classification, and a directive to carry out a semantic action such as a transformation and a validation.
According to some embodiments, a directive to effect a semantic analysis includes an operation to be performed upon identification of at least one token, at least one production rule, or a combination thereof.
Exemplary operations to be performed upon identification of a token include but are not limited to a storing of a position of the token in an input stream, a storing of a length of the token, a storing of a prefix of the token, a calculation and stormg of a numerical value corresponding to the token, a storing of a token type of the token, a discarding of the token, a counting of the token, and a storing of pointer to a semantic rule.
Exemplary operations to be performed upon identification of a production rule include but are not limited to a storing of a position of a token associated with the production rule, a storing of number of tokens associated with the production rule, a storing of a character prefix associated with the production rule, a calculation and a storing of a numerical value associated with a token of the production rule, a calculation and storing of an abbreviated representation of the production rule, a calculation and storing of an abbreviated representation of a hierarchies of rules including the production rule, a storing of a rule type of the production rule, a discarding of production the rule, a counting of the production rule, a storing of an index of the production rule in a rule type table, a storing of an index of at least one sub-element of the production rule, a storing of an index of at least one sub-element of the production rule, a storing of a value of a counter associated with at least one sub-element of the production rule, an inheriting of an index of the production rule to at least one sub-element of the production rule, a storing of at least one index inherited from a parent rule of the production rule, a comparison of a specific calculated value of a stored value associated with the production rule, a verifying that a plurality of tokens in a the production rule have the same semantic meaning, a storing of a pointer to a specific semantic rule, a storing of an indication of fields to be used to deduce required the production rule identifiers.
Exemplary semantic rules but are not limited to validation rules and transformation rules. Exemplary operations to be performed upon identification of a combination of a token and a production rule include but are not limited to a storing of a pointer to a specific semantic rule and a counting of the production rule.
Exemplary semantic rules include but are not limited to validation rules and transformation rules. According to some embodiments, at least one semantic lookup table is a stateless lookup table.
According to some embodiments, at least one semantic directive is selected from the group consisting of a validation directive and a transformation directive.
Exemplary validation directives include but are not limited to value comparison directives, simple value comparison directives, multiple value comparison directives, range checking directives, type checking directives, integer range checking directives, fraction range checking directives, value length directives, and value length checking directives.
According to some embodiments, at least one validation directive is selected from the group consisting of a path context validation and a context specific range changing. According to some embodiments, at least one validation directive is a syntax directive selected from the group consisting of tracking how many elements of a certain type are permitted, effecting of a complex choice of possible parameter or attribute combinations, and enforcement of a specific element sequence constraint.
According to some embodiments, at least one validation directive is at least one validation directive is selected from the group consisting of a path context validation and a context specific range changing.
Appropriate transformation directives include but are not limited to directives to mark a structure for transformation of a given type such as a type specified by a transformation identification code, directives to remove a token, directives to remove a complex structure, directives to truncate a token, directives to truncate a complex structure, directives to replace a first token with a second token, numerical conversion directives and directives to update of a given field in an output table.
According to some embodiments, the updating of the given field allows altering of one or more pointers some predefined value. According to some embodiments, the altering is for a namespace change in XML.
According to some embodiments, at least transformation directive is conditional upon at least one validation result.
Appropriate range checking directives include but are not limited to directives for context specific range checking.
According to some embodiments, an array of at least one transformation directive encodes a transformation between the dialect and a different dialect of the data representation language.
According to some embodiments, an array of at least one transformation directive encodes a transformation between the data representation language and a different data representation language.
According to some embodiments, an array of at least one transformation directive encodes a transformation between the data representation language and a language other than a data representation language. According to some embodiments, at least one semantic directive is a path validation directive.
It is now disclosed for the first time a method of text processing including receiving a plurality of text files encoded in a data representation language, for at least one text file determining a dialect of the data representation language, generating from a definition of the dialect at least one look-up table encoding directives for text processing in accordance with the determined dialect, and using at least one look-up table, processing the respective text file in accordance with the determined dialect.
According to some embodiments, at least a part of one said look-up table is implemented at least in part in hardware. According to some embodiments, a plurality of said text files are subjected to a processing including determining a set of common or heavy operations, determining a set of uncommon or light operations; for a subset of the plurality of text files, performing the set of common or heavy operations and perfoπning on the subset of the text files the set of uncommon or light operations. According to some embodiments, the determining is effected at least in part using a first hardware module and the processing is effected at least in part using a second hardware module.
According to some embodiments, the first and second hardware modules are configured to effect a pipeline processing. According to some embodiments, data associated with at least one look-up table is cached, and the stage of processing includes retrieving at least some of cached data.
According to some embodiments, the caching includes caching at a plurality of locations within a network. According to some embodiments, the definition of the dialect includes a schema, and the generating of at least one look-up table includes electronic processing of the schema.
According to some embodiments, processing of said schema includes effecting a compiling of the schema.
According to some embodiments, before commencing processing of at least one the text file a hardware update of a said lookup table is performed.
According to some embodiments, the determining of the dialect includes identifying a string selected from the group consisting of a file name of a text file and a file type of a text file.
According to some embodiments, the determining of the dialect includes identifying a file source of one or more of the text files. According to some embodiments, the determining of the dialect includes parsing at least some of one or more of the text files.
According to some embodiments, the determining of the dialect includes effecting a first pass over a respective text file, and the processing using at least one look-up table includes effecting a second pass over the respective text file. According to some embodiments, the determining, generating and processing is performed iteratively more than once on a said text file.
According to some embodiments, more than one text file is subjected to processing in parallel.
According to some embodiments, at least one look-up table encodes a production rule of a grammar.
According to some embodiments, the processing of the text file includes effecting in accordance with at least one look-up table at least one grammatical analysis of a the text file. Exemplary grammatical analyses include but are not limited to syntactic analyses and semantic analyses. According to some embodiments, a syntactic analysis includes recording at least one value selected from the group consisting of a production rule identifier, a parsing record index, a beginning of a record in an input stream, a length of a record in an index stream, a beginning of a specified sub-element of a production rule, a length of a specified sub-element of a sub-element, a context value, a value stored earlier in a parsing process, a record prefix, a prefix of a specified token in a production rule, a value associated with a specific token in a production rule, a hash value associated with at least one token associated with a production rule, a number of combined hash values, a number of counters incremented based on counter values of previously processed production rules, a number of record indexes of previously processed production rules, a number of record indexes of previously processed production rules and a syntax error indication.
According to some embodiments, the grammatical analysis includes determining a validity of a production rule.
According to some embodiments, the determining of the validity of the production rule includes determining a validity of at least one parent production rule, wherein invalidity of the parent production rule is indicative of invalidity of the production rule.
According to some embodiments, the semantic analysis includes recording at least one value selected from the group consisting of a validation rule to be applied to a parsing record a result of a comparison between two calculated values, a transformation rule to be applied to a parsing record. It is now disclosed for the first time a method of generating data useful for fast text processing. The presently disclosed method includes providing a text file representable as a tree having a plurality of node elements, selecting a reference element for at least one said node element, determining a path within the tree between a respective node element and the reference element, and storing an abbreviated representation of the path. According to some embodiments, the abbreviated representation is a hash value.
Exemplary reference nodes include but are not limited to a root node of the tree, a node that is fixed distance from a root node of the tree.
According to some embodiments, the determining of the path and the storing of the abbreviated representation is carried out for at least one node element which is a descendant of the reference node.
According to some embodiments, the determining of the path and the storing of the abbreviated representation is carried out only for the node elements which are descendants of said reference node.
According to some embodiments, determining and storing is effected for all nodes having one or more predefined depths wilbin the tree.
According to some embodiments, the abbreviated representation of the path is mapped to a representation of data associated with the node. According to some embodiments, for at least some respective node elements, the method includes mapping an abbreviated representation of a path between an ancestor of the node element to a representation of data associated with the node.
It is now disclosed for the first time a system for accelerating text file processing including a schema input for receiving a schema defining a dialect structure of a data representation language and look-up table generator for processing the schema to generate at least one look-up table encoding directives for text processing in accordance with the schema.
According to some embodiments, the look-up data generator includes a schema compiler for effecting of compiling said schema. According to some embodiments, the look-up data generator is operative to implement at least one said look up table in hardware such as re-programmable logic. According to some embodiments, the implementing includes configuring re-programmable logic.
It is now disclosed for the first time a system for text processing, the system including an input for receiving at least one text file encoded in a data representation language and a text processing engine including at least one look-up table encoding a plurality of text-processing directives in accordance with a schema of the data representation language, the text processing engine for processing at least one received text file.
According to some embodiments, the system further comprises a dialect determiner for determining a said dialect of a said text file. According to some embodiments, at least one look-up table is implemented at least in part in hardware.
According to some embodiments, at least one look-up table encodes at least semantic directive such as a directive to effect a semantic analysis, a directive to effect a semantic analysis, and a directive to effect a semantic operation. Exemplary semantic operations include but are not limited to validation operations and transformation operations.
According to some embodiments, the text processing engine is distributed within a computer network.
According to some embodiments, the presently disclosed system further comprises an exception handling module for handling exceptions generated by the processing of at least one text file.
According to some embodiments, the text processing engine is operative to reconfigure at least one said look-up table while effecting processing at a received text file.
According to some embodiments, the text processing engine includes a hardware character normalization engine, and the processing of at least one received text file includes generating a character stream from the byte stream using the hardware character normalization engine.
According to some embodiments, the text processing engine includes a hardware token identification engine, and the processing of the received text file includes identifying tokens within a character stream representative of said text file using said hardware token identification engine.
According to some embodiments, the text processing engine includes a hardware parsing engine, and the processing of one or more received text files includes receiving a stream of tokens representative of the text file and producing a stream of parsing records using the hardware parsing engine.
According to some embodiments, the hardware parsing engine uses a look-up table encoding at least one syntactic text-processing directive of the dialect.
According to some embodiments, the text processing engine further includes at least one hardware semantic analysis engine. According to some embodiments, the hardware semantic analysis engine is selected from the group consisting of a hardware validation engine and a hardware transformation rule engine.
It is now disclosed for the first time a system for generating data useful for fast text processing including an input for receiving a text file representable as a tree having a plurality of node elements, a path determiner for determining a path within said tree between said respective node element and a reference element, a path representor for deriving an abbreviated representation of said determined path and a storage for said abbreviated representation.
According to some embodiments, the reference element is a root element of said tree.
According to some embodiments, the path representor is operative to derive a hash value as a abbreviated representation of a determined path corresponding to a node. According to some embodiments, the storage is operative to store a map between a representation of a said node and an abbreviate representation of a said path corresponding to said represented node.
It is now disclosed for the first time a method of processing a corpus of at least one text file encoded in a data representation language. The presently disclosed method includes processing at least one text file of the corpus using directives associated with the data representation language, from results of the processing determining a schema of a dialect associated with the processed files, modifying a set of at least one lookup table encoding a plurality of directives for processing of text files of the corpus and processing at least one text file in accordance with the modified lookup tables. It is now disclosed for the first time a method of processing a corpus of at least one text file encoded in a data representation language. The presently disclosed method includes processing at least one text file of the corpus using directives associated with the data representation language, from results of the processing, determining schema data of a dialect associated with the processed files, using the determined schema data, modifying at least one lookup table encoding at least one directive for text file processing, and processing at least one text file in accordance with a modified lookup table.
According to some embodiments, the determining, modifying, and processing with the modified lookup tables are repeated at least once. According to some embodiments, the modifying of at least one lookup table includes at least one of generating at least one lookup table and updating at least one lookup table.
According to some embodiments, the modifying of at least one lookup table includes updating hardware associated with the lookup table.
According to some embodiments, at least one lookup table encodes at least one semantic directive.
It is now disclosed for the first time a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for providing a schema associated with a dialect of a data representation language and processing the schema to generate at least one look-up table encoding a plurality of directives for text processing in accordance with the schema.
It is now disclosed for the first time a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for receiving a plurality of text files encoded in a data representation language, for at least one said text file, determining a dialect of the data representation language, generating from a definition of the dialect at least one look-up table encoding directives for text processing in accordance with the determined dialect, and using at least one look-up table, processing the respective text file in accordance with the determined dialect.
It is now disclosed for the first time a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for providing a text file representable as a tree having a plurality of node elements, for at least one node element, determining a path within the tree between the respective node element and a selected reference element, and storing an abbreviated representation of the path. It is now disclosed for the first time a method of accelerating processing of HTTP headers. The presently disclosed method includes receiving a plurality of HTTP headers; determining common patterns among said plurality of the HTTP headers and generating from the determined common patterns at least one look-up table encoding directives for text processing in accordance with the determined common patterns.
In some embodiments, at least one HTTP header is processed using one generated lookup table.
In some embodiments, at least one look-up . table is implemented at least in part in hardware. It is now disclosed for the first time a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for receiving a plurality of HTTP headers; determining common patterns among said plurality of the HTTP headers and generating from the determined common patterns at least one look-up table encoding directives for text processing in accordance with the determined common patterns.
These and further embodiments will be apparent from the detailed description and examples that follow.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 provides a listing for an exemplary DTD file.
FIG. 2 provides a listing for an exemplary XSL file.
FIGS. 3A-B provide a listing for an exemplary XML file.
FIG. 4 provides a flow chart describing generation of lookup tables in accordance with exemplary embodiments of the present invention. FIG. 5 provides an exemplary lookup table encoding semantic directives.
FIG. 6 provides a flow chart describing content processing in accordance with exemplary embodiments of the present invention.
FIG. 7 provides a flow chart describing pre-processing of data representation language content in accordance with exemplary embodiments of the present invention. FIG. 8 provides a flow chart describing exemplary processing of data representation language content in accordance with some embodiments of the present invention.
FIG. 9 provides a flow chart describing a process wherein different aspects of a dialect associated with a data representation language are iteratively compiled and used for content processing. FIG. 10 provides a flow chart describing a process wherein hash values of certain paths in a tree representation of a text file are computed and stored.
TABLE 1 includes information gathered for the production rule related to XML elements for the example input files of FIGS . 1 -3. TABLE 2 includes information gathered for the production rule related to XML attributes for the example input files of FIGS. 1-3.
TABLE 3 includes flags and counters generated in the processing stage for the example input files of FIGS. 1-3.
TABLE 4 provides transformation records for the example input files of FIGS. 1-3 FIG. 11 provides a listing of an HTML file that is the product the XML file of FIG. 3 transformed in accordance with XSL directives provides in FIG. 2
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention provide methods, apparatus and computer readable code for efficient processing of content in accordance with data representation languages. Exemplary text processing in accordance with some embodiments of the present invention includes but is not limited to parsing, validation, searching, extracting data and transformation to other formats.
The presently disclosed methods and systems can be implemented in software, hardware or any combination thereof. Not wishing to be bound by any particular theory, it is noted that certain hardware implementations are useful for accelerating the processing of text files in accordance with presently disclosed techniques. Furthermore, it is noted that the data processing of the present invention can be applied to a variety of different types of data in various computer systems and appliances. In accordance with certain embodiments of the present invention, it has been discovered that processing of text files encoded in data representation languages can be accelerated by generating a series of lookup tables in accordance with the data representation language and/or a specific dialect of the data representation language.
Furthermore, for the case of text files representable as a tree having a plurality of node elements, the present inventor has discovered that deriving and storing abbreviated representations such as hash values of paths between certain nodes within the tree is useful for accelerating a subsequent process where it is necessary to quickly locate paths between nodes such as a path between a given node and a root node of the tree. Optionally, the abbreviated representations of paths of hash values are used in search expressions, such as for example XPath and XQuery for XML files, as well as for transformation expressions, such as for example those encoded by XSL.
Furthermore, it has been discovered that hardware-implemented lookup tables for encoding semantic directives associated with a data representation language are useful for accelerating processing of text files encoded in the data representation language. Exemplary semantic directives include but are not limited to validation directives and transformation directives.
Therefore, in accordance with some embodiments of the present invention, it can efficiently be decided if the input data conforms to a schema or if a transformation should occur for some part of the input data.
The present invention will now be described in terms of specific, example embodiments.
It is to be understood that the invention is not limited to the example embodiments disclosed. It should also be understood that not every feature of the methods, apparatus and computer readable code for processing of text files associated with a data representation language described is necessary to implement the invention as claimed in any particular one of the appended claims. Various elements and features of devices are described to fully enable the invention. It should also be understood that throughout this disclosure, where a process or method is shown or described, the steps of the method may be performed in any order or simultaneously, unless it is clear from the context that one step depends on another being performed first.
I. EXEMPLARY XML FILES
Specific embodiments of the present invention will be described in terms of exemplary input files taken from the book "XML and Web Services Unleashed (ISBN #0-672-323419)" by Ron Schmelzer. The example uses the following files:
1. "emp.dtd" - a Document Type Definition file provided in FIG. 1
2. "emp.xsl" - an XML Style-sheet file provided in FIG. 2 which describes the transformation and formatting operations to be applied to XML input file to generate the required output files. 3. "emlxml" - an actual XML input file provides in FIGS. 3A-3B.
The file "emp.dtd" in Figure 1 defines that the input text files should be XML files that have a root element "employees". This element should contain one or more sub-elements, each called "employee". Each "employee" element should have an attribute "serial" of type "ID", and the following sub-elements (in this order): "name", "position", "address 1", "address2" (optional), "city", "state", "zip", "phone" (optional), and "email" (optional). The "name" element should have the following attributes: "age", "sex", "race" (optional, when absent a default value is implied), and "m_status". Each of the other elements is then defined to be a string (character data). The above XSL file of FIG. 2 has conversion and presentation instructions. When applied to the input XML file the expected result of the transformation is in HTML format. The resulting HTML file depends on the content of the actual input XML file, and is presented below for the specific input XML file of FIG. 3 (see FIG. 11) It is noted that although "eml.xml" includes no explicit reference to schema file for validation, the "emp.xml" file conforms to the "emp.dtd" file described above, and that the transformation schema "emp.xsl" may be applied to "emp.xml."
The XML file explicitly references an XML style sheet file "emp.xsl" that includes transformation rules and formatting instructions. Having the "emp.xsl" file preprocessed is a prerequisite for transforming the input "emp.xml" file. It will be appreciated that no specific characteristics of the aforementioned input files are intended as a limitation of the scope of the present invention. Thus, although these files are associated with the specific data representation language XML any data representation language currently known in the art or defined in the future is appropriate for the present invention. Similarly, "Document Type Definition" is just one appropriate format for schema defining specific dialects, and any other validation schema format currently known or to be defined in the future is appropriate. Similarly, "XML Style-sheet" is merely one exemplary format for a transformation schema, and any other transformation schema format currently known or to be defined in the future, including both transformation schema associated with XML and transformation schema associated with other data representation languages, is appropriate. At least some of the drawings, which are used to illustrate the inventive concepts, are not mutually exclusive. Rather, each one has been tailored to illustrate a specific concept discussed. In some cases, the elements or steps shown in a particular drawing co-exist with others shown in a different drawing, but only certain elements or steps are shown for clarity.
For convenience, certain terms employed in the specification, examples, and appended claims are collected here.
As used herein, a "data representation language" or a "markup language" such as, for example, XML or EDI is a language with a well defined syntax and semantics for providing agreed upon standards for message or data exchange and/or for data storage. This in contrast with programming languages or computer code languages, such as, for example, C, C++, Java, C#, and Fortran, containing specific computer instructions compilable into machine executable code. In some embodiments, the data represented in the data representation language is a structured message.
It is noted that in some embodiments, the data representation language and/or the dialect has strict syntax, and rigid singular semantics.
In some embodiments, a data representation language and/or a dialect or sub-language defines a message protocol or messaging standard.
One salient feature of many data representation languages is that specific dialects or sublanguages of the data representation language may be defined, where the definition may be expressed using a schema. Schema define the specific structure of the dialect and may be available, for example, as a schema definition such as an XML schema definition , as a DTD section, or as a context free grammar with a set of semantic operations. One particular example of a dialect or sub-language of XML is Open Financial Exchange (OFX), a specification for the electronic exchange of financial data between financial institutions which is XML compliant and thus a dialect of XML. Other examples of relevant XML dialects include but are not limited to ebXML (electronic business XML) and SwiftML.
Furthermore, it is noted that transformation schema including but not limited to XSL schema defining transformation rules may be associated with a schema of a dialect of a data representation language. In some embodiments, a data representation language and/or a dialect or sub-language defines a message protocol or messaging standard. One exemplary message protocol is the Financial Information eXchange (FIX) protocol developed specifically for the real-time electronic exchange of securities transactions. One example of a data representation language that functions as a message protocol is the hypertext transport protocol (HTTP), used for conveying messages over a network between various electronic devices or appliances. One example of a lightweight protocol for exchanging messages between computer applications is Simple Object Access Protocol (SOAP), a dialect of XML that encodes a communications envelope for exchange of XML messages.
II. GENERATION OF LOOKUP-TABLES FROM A DIALECT OF A DATA REPRESENTATION LANGUAGE
Exemplary methods, devices and computer readable code for converting a definition of a data representation language into a plurality of lookup tables will now be described. FIG. 4 provides a flow chart illustrating generation of one or more lookup tables according to certain exemplary embodiments of the present invention. Although FIG. 4 illustrates a process for the generation of five different types of lookup tables, namely one or more Lexical lookup tables
108, one or more Syntax Lookup Tables 114, one or more Semantic Lookup Tables 120, one or more Validation Lookup Tables 126, and one or more Transformation Lookup Tables 128, this should not be construed as a limitation, and it is understood that some embodiments of the invention provide for generation of some of the types of lookup tables and not other types.
Given a well-defined dialect of a data representation language and/or a well-define validation or transformation schema associated with the dialect, it is possible to generate look-up tables that are useful for accelerated processing of text files in accordance with the defined dialect.
In some embodiments, the definition of the dialect and/or of the schema associated with the dialect is expressed through at least one of:
1. A set of tokens defined by regular expressions.
2. A context free grammar defined by a set of production rules based on the above set of tokens.
3. A set of semantic operations to be performed for any given token or production rule. There is no specific limitation on the format in which a schema associated with or defining a dialect of a data representation language. Exemplary formats for schema defining a dialect include but are not limited to DTD and XSD formats. Exemplary formats for transformation schema associated with specific data representation languages and/or schemas include but are not limited to XSL. It is noted that the specific schema formats for defining a dialect and/or transformation schema mentioned herein are merely appropriate examples, and it is understood that other schema formats including those not disclosed to date are within the spirit and scope of the present invention. In some embodiments, the schema is not provided independently of text files to be subsequently processed, but is derived from processing a number of text files, and determining patterns within the text files. In some embodiments, these patterns are determined using statistical methods known in the art.
For data definition languages such as XML and EDI, the allowed lexical and syntactical rules including tokens and grammar rules are well defined. Additional tokens and grammar rules may be derived from the validation and transformation schemas. The tokens and the grammar are then used for generating the state-machines required for fast processing of files encoded in the data definition language. This information may be saved as DFA (Deterministic Final Automata) state-machines, which in turn may be translated into lookup tables (or LUTs) that may be used by hardware and software for the actual processing of the data representation language content.
a) GENERATION OF LEXICAL LOOKUP TABLE TABLESCS 108 ENCODING LEXICALTOKENIZATION OPERATIONS
A token may be defined as a regular expression (see, for example, the definition of GNU regex library) over any standard character set (e.g. ASCII, UTF-8 or UTF-16). There is no inherent limitation on the token definitions.
The tokens are compiled by the means of a Regular Expression Compiler such as GNU Flex or other appropriate compilers, into a deterministic state machine 106 (also termed Deterministic Finite Automata - DFA) such as a software encoded DFA. This state machine is then converted into one or more memory mapped Lexical lookup table(s) 108 to be loaded into the hardware memory device used for lexical analysis. In exemplary embodiments, this conversion process is performed by a dedicated software converter, though it will be appreciated that this is not a specific limitation.
Each lexical lookup table 108 is in essence a depiction of the state machine where each memory record corresponds to a state including the current output of the state machine (e.g. if a token was recognized) and the method of calculating the next state according to the next character to be read.
b) GENERATION OF SYNTAX LOOKUP TABLE(S 114 ENCODING PARSING
OPERATIONS A context free grammar is defined through a set of production rules beginning from a single root variable for the specific data representation language dialect (different dialects may have different root variables). The format of describing the grammar is similar to the grammar description allowed by GNU Bison or Yacc.
The grammar is compiled by means of a Grammar Compiler (SWGC), similar in essence to GNU Bison or Yacc, into a software encoded deterministic state machine. Traversal from one state to the other is defined by the current token received from the input (coming from a lexical analyzer), the current state and the top of a stack that accompanies the state machine during processing. The construction of such a state machine with stacks is well defined and published in classic computer science literature (see for instance "Compilers: Principles, Techniques and Tools" by A. A o, R. Sethi and J. Ullman, Addison-Wesley, 1986). Finally, the state machine is converted into a memory mapped Syntax Lookup Table 114 to be loaded into the hardware memory device used for parsing (syntax analysis). In some embodiments, this is performed by a dedicated software converter (SWDFA2SYLUT) though this is not a specific limitation. The Syntax Lookup Table 114 is in essence a depiction of the state machine where each memory record corresponds to the state including the current output of the state machine (i.e. action to be performed, production rule completed, statement identified or syntax error) and the method of calculating the next state according to the next token to be read and the content of the top of the stack.
c) GENERATION OF LOOKUP TABLES ECODLNG OPERATIONS WITH SEMANTIC
MEANING In some embodiments, lookup tables such as the Token Lexical Lookup Tables 108 and the Syntax Lookup Tables 114 encode operations or directives having semantic meaning. Exemplary operations for tokens in the Lexical Lookup table include but are not limited to one or more operations enabled by the hardware implementation as follows:
1. Store token position in input stream.
2. Store token length.
3. Store first N characters of the token (e.g. N = 8).
4. Calculate and store token value (e.g. integer value of a string of digits). 5. Calculate and store a hash function of the token value (e.g. a 64-bit value calculated as a function of the string).
6. Store the type of token identified (e.g. specific reserved word, comment, white space).
7. Discard token (i.e. don't send to parser).
8. Count token with counter #N (statistics for final processing report). 9. Store an index (e.g. a pointer to a specific validation or transformation rule).
Exemplary operations in the Syntax Lookup Table include but are not limited to one or more operations enabled by the hardware implementation as follows:
1. Store rule position in input stream (i.e. the position of the first token of the rule). 2. Store rule length (i.e. the total length of all tokens constituting the rule).
3. Store first N characters of the rule (e.g. N = 8).
4. Calculate and store rule value (e.g. integer value of one of the tokens).
5. Calculate and store a hash function of the rule value (e.g. the hash value of one of the tokens constituting the rule). 6. Calculate and store a combined hash function of the rule value and an inherited hash value (e.g. for computing the hash function of a hierarchy).
7. Store the type of rule identified (e.g. statement, mathematical expression, complex structure). 8. Discard rule (i.e. don't send to validation).
9. Count rule with counter #N (statistics for final processing report).
10. Store value of counter #N (e.g. an index of the rule in a specific rule type table).
11. Store the index in the output stream of one of the rule sub elements (may be stored in one of several index fields associated with the rule). 12. Store the value of a counter associated with one of the rule sub elements (e.g. for counting the number of attributes an element may have).
13. Inherit an index of the current rule in the output stream to all sub elements (e.g. for storing hierarchy relationships).
14. Store an index inherited from a parent rule (multiple indexes may be inherited). 15. Compare a specific calculated value (e.g. hash function output) to a stored value associated with the production rule (e.g. for validating that two tokens in a production rule have the same semantic meaning).
16. Store an index (e.g. a pointer to a specific validation or transformation rule).
17. Store an indication regarding the fields to be used to deduce the required validation or transformation rule identifiers.
d) OPTIONAL GENERATION OF SEMANTIC LOOKUP TABLED 120
The processing mechanisms implementing the lexical and syntax analysis according to the defined tokens and grammar inherently validate that the processed file is well formed in the sense that it is constructed out of a set of well defined "words" (tokens) that are put together into "meaningful sentences" (grammar). The tokens and grammar production rules may be augmented with a set of semantic operation identifiers to be performed when a given token or production rule is identified (followed). In addition, semantic operation identifiers may also be associated to combinations of production rules and specific token values. As used herein, a semantic lookup table encodes directives for semantic processing of a text file encoded in a data representation language. In general, it is noted that semantic directives include semantic classification or semantic analysis directives which are directives to determine or classify text based upon semantic characteristics, and semantic action directives for performing an action other than semantic classification in accordance with semantic structures. It is noted that one particular example of semantic classification or semantic analysis is validation, wherein text is analyzed and semantic properties of the analyzed text are deteπnined.
One particular example of semantic action directives are transformation directives wherein output text is generated in accordance with semantic properties of input text. It is noted that these directives allow for accelerated processing of the text file.
Exemplary semantic processing directives include but are not limited to transformation directives and validation directives. Exemplary transformation directives include but are not limited to a directive to transform a particular document encoded in a dialect of data representation language into a document encoded in the same dialect of the same data representation language, into a document encoded in a different well-defined dialect of the same data representation language, and/or into a document encoded in a different data representation language.
In some embodiments, the operations for combinations of production rules and token values are stored in a dedicated Semantic Lookup Table 120. These operations include but are not limited to: 1. Store an index (e.g. a pointer to a specific validation or transformation rule).
2. Count rule with counter #N (e.g. for statistics for final processing report).
e OPTIONAL VALIDATION SCHEMA COMPILATION TO GENERATE ONE OR MORE
VALIDATION LOOKUP TABLE(S) 126 In addition to the definition of the data representation language dialect to be processed, specific validation rules may be defined that allow the system to verify that the messages being processed meet given criteria.
The validation rules for XML files may be written in the Document Type Definition (DTD) language (which is part of the XML definition, for formal definition see http://www.w3.org/TRxmlll/). Alternatively, the validation rules may be written in standard XML Schema Definition (XSD) language (for a formal definition see http://www.w3.org/TR xmlschema-O/, http://www.w3.org/TR/xmlschema- 1/, http://www.w3.org/TR xmlschema-2/, http://www.w3.org/TR/xmlschema-formal/) or in a similar type of validation language. XSD allows verifying various aspects of XML files, but may be generalized to other data formats as well, such as higher level XML dialects (e.g. FIX, OFX, HL7, SwiftML, SOAP, WSDL) and other structured message formats (e.g. EDI, HTTP headers, CSV).
An example for a DTD file is shown FIG. 1. In some embodiments, validation schemas are compiled by a Validation Schema Compiler (SWVSC) into two sets of validation rales: Rules to be verified by the Hardware Validation Engine (HWVE) and rules to be verified by a
Software Validation Module (SWVM). For example, in some embodiments the full scope of
XSD is supported by the SWVM. In addition, for performance acceleration, some or all of the following simple validation primitives are supported by the HWVE and encoded in the Validation Lookup Table (VDLUT) 126:
1. Simple value comparisons (equal, not equal, greater than, less than).
2. Multiple value comparisons up to N different values (e.g. for enumerated types).
3. Type checking (e.g. integer, string).
4. Integer range checking 5. Fraction range checking (e.g. number of digits before and after decimal point).
6. Value length (e.g. the number of characters in the corresponding value field).
7. Value length checking (equal, equal, greater than, less than).
In addition, several complex validation directives are supported by the hardware via inclusion within the tokens and grammar as mentioned above. These mechanisms allow many additional primitives including but not limited to:
1. Pattern matching (patterns will be defined as validation tokens, matched during lexical analysis and included in the appropriate grammar).
2. Syntax directive (e.g. how many elements of a certain type are allowed, complex choices of possible parameter/attribute combinations, specific element sequence constraints)
Finally, using a combination of simple and complex validation primitives may be used to perform complex validation such as:
1. Path context validation together with simple value comparisons (e.g. XP ATH predicates) 2. Context specific range checking.
f) OPTIONAL TRANSFORMATION SCHEMA COMPILATION TO GENERATE ONE OR MORE TRANSFORMATION LOOKUP TABLE(S 128 Following the optional validation that a file is of a given format and follows the required semantics, it may be transformed into a new format using the same file dialect (e.g. XML transformation to XML) or using a totally new dialect (e.g. EDI transformation to XML or XML transformation to EDI). The transformation required is defined by a set of transformation rules. These rules may be written in standard extensible Stylesheet Language (XSL) (for a formal definition see http://www.w3.orR/Style/XSL) or in a similar type of transformation language. XSL, and specifically XSL Transformations (XSLT) (for a formal definition see http://www.w3.org/TR/xslt), is a standard way to define required transformations of XML files, but may be generalized to other data formats as well, such as higher level XML dialects (e.g. FIX, OFX, HL7, SwiftML, SOAP, WSDL) and other structured message formats (e.g. EDI, HTTP headers, CSV).
An example for an XSL file is shown in FIG. 2. The Transformation Schema Compiler (SWTSC) compiles the transformation schemas into two sets of transformation rales: Rules to be executed by the Hardware Transformation Engine (HWTRE) and rules to be executed by a Software Transformation Module (SWTM). The full scope of XSL is supported by the SWTM. In addition, for performance acceleration, one or more of the following simple transformation primitives are supported by the HWTRE and encoded in the Transformation Lookup Table (TRLUT) 128:
1. Marking of a token or complex structure for transformation of a given type (the actual transformation to be performed by the SWTM). The transformation type may be a general type (e.g. "convert string to uppercase") or specified by a Transformation Identification Code (TIC) for a specific operation the SWTM will recognize. 2. Removal of a token or complex structure.
3. Truncation of a token or complex structure (e.g. for white-space minimization).
4. Replacement of a token by another token (e.g. for name swaps). In this case, the relevant values and hash values are updated within the output table; however the actual input stream is not altered. 5. Simple numeric conversions such as decimal fraction to integer conversions or integer increment or decrement conversions. Again, the relevant values and hash values are updated within the output table; however the actual input stream is not altered. 6. Update a given field in the output table. Allows altering pointers to some predefined value (e.g. for namespace changes in XML).
As for the validation rules, the transformation rales may appear in the context of tokens as well as grammar rules. In addition, they may optionally be conditional on the result of the validation results. II. A METHOD FOR TEXT FILE PROCESSING
It is noted that the lookup tables described in the previous section are useful in the framework of a general text file processing method illustrated in FIG. 6 and described herein for the first time by the present inventor. Thus, after the preprocessing of schema and/or other data associated with a data representation language and/or a dialect of a data representation language 202A, actual text files are processed in accordance with the results of the preprocessing 204A. Optionally, the text files are subjected to a post processing 206 described below.
In general, during the pre-processing stage 202A some or all required software and hardware elements for maximal processing performance are prepared prior to the reception of some or all of the data representation language content to be processed. In some embodiments, these preparations are performed by software elements, using information that is known about the structure of the actual text files that to be processed.
The resultant tables and state-machines (as well as other information as described in details below) encoded in hardware and/or in software can be used in the processing stage 204A. In some embodiments, this processing is performed by a combination of tightly coupled hardware and software elements.
Optionally, the text files are subjected to a post processing 206 described below where accumulation of information regarding the processed file(s) and caching information for possible future processing performance enhancements. It is noted that in some embodiments, the pre-processing stage is a one-time operation that associated with configuration of the system (for example, when the input content is expected to conform to information known at configuration time). Alternatively, the pre-processing may be done when new data is encountered when the relevant information is not yet ready for its analysis (for example, new files encoded in a data representation language arrive with indications of additional schemas to which the data representation language files should conform). In this case, the new content is not necessarily parsed immediately, but waits until the appropriate information is prepared. Note that this enables support for future data formats and for schema updates. Note also that the results of pre-processing 202 A may be used for any incoming data that conforms to the appropriate schema or template It is noted that the processing of 202 A prepares the system for the arrival of data representation language files in such a way that they are processed with minimal effort and time. Nevertheless, some of the steps of preprocessing 202A may be performed well in advance of the arrival of the data representation language content to be processed while others may be performed only after the arrival of some or all of one or more data representation language file to be processed. From an algorithmic perspective, the time of arrival of the data representation language file has no impact whatsoever. However, in some embodiments it is desired to reduce the pre-processing required after the arrival of the target data representation language file to a minimum in order to reduce file-handling latency.
a) PREPROCESSING OF SCEHMA AND/OR OTHER DATA ASSOCIATED WITH A DATA REPRESENTATION LANGUAGE AND/OR DIALECT 202A
The preprocessing stage prepares helpful information for optimizing the processing of the content. In some embodiments, this information is based on the known of the data, such as its validation schema and its transformation template.
For example, the validation schema of an XML file describes what the root element should be, and what values and sub-elements it can hold. A transformation template could describe how to transform the information within the XML file into some other format, such as HTML. FIG. 7 provides a flow chart of an exemplary pre-processing in accordance with a schema associated with a dialect of the data representation language. It is understood that in some embodiments, certain steps are performed without others.
Thus, after the appropriate data representation language and/or schema 302 is determined at least in part, it is possible to identify schema tokens 305 and grammar 307. In accordance with the identified schema tokens 305 and/or grammar structure (.e.g production rules) 307, one or more state machines are prepared 310. Optionally, a validation schema compilation is effected 312 of schema-specific validation rules identified in accordance with the identified schema tokens and or grammar. . Optionally, a transformation schema compilation is effected 314 of schema-specific transformation rules identified in accordance with the identified schema tokens and/or grammar.
It is noted that the order of steps as described above and illustrated in FIG. 7 is one exemplary order of steps provided for illustrative purposes only, and is not intended to limit the scope of the present invention. In some embodiments, the steps are indeed carried out in the order as described in FIG. 7. In other embodiments, the steps are carried out in an order other than that described in FIG. 7.
Thus, it is noted that the compilation of schema to generate lookup tables 304 may be quite extensive. The preparations include compilation of several complex directives, including regular expressions, grammars, automata, validation schemas, transformation schemas and additional rules (e.g. rales defined by IT or business policies). In some embodiments, a large portion of the preprocessing is expected to be common to most of the files handled at a given location in a network. Therefore, the results of the above mentioned operations may be stored temporarily or permanently in a local cache. The stored information may then be retrieved at a later time instead of performing the actual processing operations repetitively. The cache may be managed using any caching strategy, such as most recently used, most frequently used or other well-known techniques.
In some embodiments, one or more hardware engines are updated 308 with the results of earlier pre-processing steps before processing of text file. In some embodiments, this is carried out by configuring re-programmable logic components. This is only required for results that were not previously loaded into the hardware tables.
As a result, in a case common according to some embodiments, no hardware updates are required before many text files encoded in the data representation language are actually processed.
Certain embodiments of the present invention provide methods, computer readable codes and apparatus for building an integrated circuit for parsing data. According to some embodiments, the integrated circuit includes memory, parsing circuitry and an interface. The parsing circuitry is configured to parse the building blocks that compromise the data. For example, for XML data, the element tag and attributes are the building blocks. Once these building blocks are parsed, further processing may be performed. Using a special binary format and a processing methodology, content and data processing may be significantly improved, while reducing the memory and CPU requirements, as well as the time required for the processing.
One feature provided by many embodiments of the present invention is the ability to implement some parts of it (mostly the processing stage 204A) in hardware components such as reprogrammable logic components while maintaining the flexibility required for supporting a broad range of dialects and schemas of the data representation language or markup language. During the preprocessing stage 202A much of this hardware is configured. Appropriate reprogrammable logic components include but are not limited to FPGA and ASIC components
b) PROCESSING OF TEXT FILES IN ACCORADNCE WITH RESULTS OF PREPROCESSING 204A
In some embodiments, the actual processing of files encoded in the data representation language is performed by a combination of tightly coupled hardware and software elements, and is carried out after some or all of the requisite pre-processing operations have been at least partially completed and some or all of the software and/or hardware engines have been updated accordingly.
FIG. 8 provides a flow chart of an exemplary processing of text files in accordance with previously generated lookup table 304. It is noted that not every step depicted in FIG. 8 is required in every embodiment, and that while some embodiments provide that the processing is carried out according to the order illustrated in FIG. 8, the specific order provided is not a limitation of the present invention.
As depicted in FIG. 8, the optional first step of processing prerequisites verification 502 is useful for identification of the prerequisites for processing the file and verification that these have been completed. Subsequently, a character set normalization 504 is carried out, including a transformation of the original file character set into a single common character set. Once the original file is transferred into a common character set, a lexical analysis 506 including identification of tokens within the character stream is carried out. Upon identification of these tokens, it is possible to identify specific grammatical structures within the token stream in the context of a syntactical analysis 508. This, in turn, is followed by a semantic analysis 510 which includes identification of certain semantic meanings of the parsed data.
In accordance with the semantic meanings identifies during the semantic analysis 510, one or more validation and/or transformation directives are optionally carried out. The initial validation 512 includes execution of limited hardware based validation. The initial transformation 514 includes execution of limited hardware based transformation. The subsequent final validation 516 includes the completion of the validation process in software, while the final transformation 518 includes the completion of the transformation process in software.
During the entire procedure, various types of errors or exceptions may occur. These are gracefully handled by an optional exception handling module following the occurrence of the first error or exception.
c) PROCESSING PREREQUISITES VERIFICATION 502
Before processing of text files of a data representation language, the software and hardware processing engines are prepared to handle the actual processing. Exemplary ways of determining processing prerequisites include but are not limited to:
1. The source of the file. This may be defined by the Layer2 or Layer3 addresses, system interface ID's or other addressing information available (e.g. HTTP headers).
2. The file type. This may be defined by the file name (e.g. a file name suffix) or by an identification tag in the beginning of the file (e.g. a magic number). 3. The file content. In some cases, the processing prerequisites will be identified within the file itself in a well-defined manner (e.g. XML files may mention the XML schemas to be used for their own validation).
In the latter case, where the file content dictates the prerequisites, the file may be processed twice. Initially, it will be processed as a generic file according to its type as identified by one of the two first methods. During this processing phase, only the required prerequisite information is extracted. Then the full pre-processing may be completed and only then, a second, full processing phase is performed. Such a recursive procedure may take place more than once if during the second processing phase, additional prerequisites are identified. This recursive procedure is described in the flow chart of FIG. 9. First, one or more text files are processed 204B genetically using directives appropriate for the data representation language. Based on this processing, a set of previously unknown dialect-specific characteristics of previously processed text files is determined 202B. In accordance with currently known characteristics about the dialect of sublanguage of previously processed text file, appropriate hardware and/or look-up tables are updated 210. Subsequently, the text files are once more reprocessed 204B using the updated hardware and/or look-up tables. Optionally, during the course of reprocessing, additional dialect specific characteristics are detected 202, and thus the stages of updating hardware and/or look-up tables 210 and processing text files using updated hardware and/or look-up tables 204 is repeated.
d CHARACTER SET NORMALIZATION 504
All tokens are defined over a common character set. Therefore, before token identification can commence, all input streams must be converted to this common denominator.
In some embodiments, this is performed by the Hardware Character Normalization Engine (HWCNE).
Essentially, the HWCNE is a simple engine that transforms a byte stream into a character stream. The characters are identified according to the appropriate expected character set definition. Typically a character may be defined by a single byte or by a small number of consecutive bytes. In any case, each character is transformed into a single 16-bit word constituting its UTF-16 representation. For UTF-16 encoded files, no transformation is required and the HWCNE operation becomes transparent.
In some embodiments, the output of the HWCNE includes at least one of:
1. The character identified. 2. The beginning index of the character in the input stream (to allow following engines to continue to reference the original file indexes).
3. The length of the character in the input string.
e) LEXICAL ANALYSIS 506
Once the input stream has been converted into a stream of characters, the token identification process, termed lexical analysis, may take place. The Hardware Token Identification Engine (HWTIE) receives a stream of characters and transforms it into a stream of tokens and accompanying semantic information. In some embodiments, the HWTIE uses the Lexical lookup table 108 in order to identify the tokens. It is initialized at an initial tokenization state before the first character and after each token is identified. For each character and current state, the HWTIE calculates the memory address to be accessed in the Lexical lookup table 108 and reads the corresponding record. This record may contain information regarding the tokenization result (e.g. if a token has been identified), an offset to a consecutive lookup iteration, the algorithm of computing the next state (e.g. selection of a deterministic lookup or a hash function dependant lookup and which hash function to use) and additional semantic information (e.g. indication that a specific transformation rule should be applied to the token).
In some embodiments, the HWTIE calculates at least one of the following values during the tokenization process:
1. The token identifier (as defined by the TKLUT when the token is identified).
2. A prefix of the token (e.g. the first N characters of the string representing the token).
3. The hash of the token value (e.g. an N-bit value representing the token value). The hash function is computed on the normalized character representation (and not on the original bytes stream representing the character).
4. The numeric value of the token (e.g. the integer represented by the token string). This may only be applicable for a subset of the tokens.
5. The token index within the input stream.
6. The token length within the input stream (expressed in bytes based on the appropriate lengths of the characters in the original file).
7. A validation rule to be applied to the token.
8. A transformation rale to be applied to the token.
9. An invalid token indication. All these values are organized within a token record that is sent for further processing to the next processing step.
For every received character one or two dependant memory access cycles need to be completed. For efficient memory bandwidth utilization as well as increased performance, multiple character streams may be tokenized in parallel. Alternatively, a single character stream may be broken into multiple streams at well-defined points identified as token boundaries even prior to entering the HWTIE. As a result, the HWTIE may process one character every clock cycle, which in some embodiments is beyond 1 Gigabit per second of throughput (dependant on the character encoding) while requiring in some embodiments two external memory devices at the most (and possibly a combination of off chip memory device and one on chip memory block).
The number of states that can be supported by the Lexical lookup table 108, if implemented in standard SDRAM technology may reach tens of thousands of states and beyond. This is expected to be more than sufficient for any typical structured file format and certainly more than required for most common programming languages, including complex languages such as C or Java.
f) SYNTACTICAL ANALYSIS 508
After the input stream has been converted into token records as defined above, the token records may be used to identify the syntactical structure of the input. This process is commonly known as parsing or syntactical analysis. The Hardware Parsing Engine (HWPE) receives a stream of tokens and produces a stream of parsing records. A parsing record may be produced for every input token or for a stream of input tokens. In some cases, more than one parsing record may be produced per token (e.g. in the case that a single token has a complex meaning that may be expressed as multiple tokens or if a token requires a specific transformation action but would otherwise not be represented by a parsing record by itself).
In some embodiments, the HWPE uses the Syntax Lookup Table 114 and an auxiliary Parsing Stack (PSTK) in order to identify the production rules of a given grammar used to produce the token stream. Note that the token stream is evaluated in the context of a single grammar (identified prior to the initialization of the processing procedure), although the Syntax Lookup Table 114 may contain information regarding multiple grammars simultaneously.
The HWPE is initialized at an initial parsing state before the first token of the file is received. In addition, the PSTK is initialized to contain an end of stack identifier. For each token, current state and head of stack content, the HWPE calculates the address in the Syntax Lookup Table SYLUT 114 to be accessed and reads the corresponding record. This record may contain information regarding the parsing results (e.g. if a production rale has been identified or a syntax error has occurred), action to be taken (e.g. push/pop the current state, parsing record and token or grammar variable to/from the PSTK), an offset to a consecutive lookup iteration, semantic information as defined for the production rule (see section "Generation of Syntax Lookup Table 114" above) and the algorithm of computing the next state (e.g. selection of a deterministic lookup or a hash function dependant lookup and which hash function to use).
For every received token a series of production rales may need to be followed. Therefore, tokens may be accumulated in a FIFO memory prior to being processed by the HWPE to allow for variable processing delays. During the parsing process, the HWPE may store several temporary values as part of the general context or within the PSTK. These values are required for various calculations performed by the HWPE. In some embodiments, the HWPE may calculate the following at least one of the following values to be included in the parsing records as output:
1. Production rule identifier (e.g. the type of statement being parsed). 2. Parsing record index (may be allocated before the record is sent to the output and hence may be lower than the index of records sent out earlier). May be based on one of several stored counter values (e.g. to enable separate indexing for different rale types). 3. Beginning of record in the input stream (e.g. the beginning of a token within a production rule - not necessarily the first token). 4. Length of the record in the input stream (e.g. the sum of lengths of a specified subset of tokens [or grammar variables] constituting a production rale).
5. Beginning of a specified sub-element (token or grammar variable) of the production rule.
6. Length of a specified sub-element (token or grammar variable) of the production rale.
7. Context value(s) (e.g. a value stored earlier in the parsing process, such as a namespace identifier for XML files).
8. Record prefix (e.g. the prefix of a specified token in the production rule).
9. Value (e.g. the value of a specified token included in the production rule).
10. A number of hash values (the hash value of one or more of the tokens constituting the rale). 11. A number of combined hash values. Each of these values is calculated as the combined hash value of the record with one of a number of hash values stored in the record of the parent production rule (residing on the top of the PSTK). If the hash values of the parent production rale are invalid, the result of the combination will be invalid. 12. A number of counters incremented based on the counter values of previously processed production rules (i.e. rule children).
13. A number of record indexes of previously processed production rules (i.e. rule children).
14. A validation rule to be applied to the parsing record. 15. The result of a comparison between two specified calculated values (e.g. two hash values).
16. A transformation rale to be applied to the parsing record.
17. A syntax error indication.
For every production rule, one or two memory access cycles may need to be completed. For efficient memory bandwidth utilization as well as increased performance, multiple token streams may be parsed in parallel. Alternatively, a single token stream may be broken into multiple streams at well defined points identified as major syntax boundaries even prior to entering the HWPE (not all data representation languages are likely to allow this but certain variants, such as EDI messages, will). As a result, the HWPE may, in some embodiments, process up to one token every clock cycle. A typical token of some embodiments may constitute between one and tens of characters or beyond. According to some embodiments, for an average token length of four characters, this may support beyond 4 Gigabits per second of throughput (dependant on the character encoding) while requiring two external memory devices at the most (and possibly a combination of off chip memory device and one on chip memory block). The throughput of the HWPE may be much higher than that of the HWTIE. This allows traversing multiple production rules without reading additional tokens and without exhausting a possible FIFO memory preceding the HWPE.
The number of states that can be supported by Syntax Lookup Table 114, if implemented in standard SDRAM technology may reach tens of thousands of states and beyond. This is expected to be more than sufficient for any typical structured file format and certainly more than required for most common programming languages, including complex languages such as C or Java. This ensures that many grammars may be stored together in the Syntax Lookup Table 114 and activated according to the processing requirements or any given input file.
In tables 1A-1B, 2A-2B provided in accordance with the exemplary input of FIGS. 1-3, each row represents a single parsing record for the "emp.xml" input example that was introduced earlier. Note that these tables are generated as a single stream of parsing records with separate running record indexes for XML elements and XML attributes, effectively partitioning the records into two tables. For clarity of presentation, these tables are presented separately as tables 1A-1B and 2A-2B. In general an arbitrary number of tables may be generated in this fashion. Note also that additional records may be generated internally but discarded later on. It is stressed that the fields and elements in the tables provided are merely provided as an- illustrating examined, and it will be appreciated that tables that include extra fields or rows or that lack certain fields or rows are still within the scope of the invention.
Each instance in the element table (see Table 1) includes information for an element in the XML file. This information includes a unique identifier for the element, length of the element, length of the value of the element, hash values (detailed below), index to the first attribute (i.e. its position in the attributes table below), index to the parent element, index to last child of the element, index to the previous sibling, and sibling index. The HWPE outputs the records one at a time, adding indices to mark the type of record (for example, if it is an element or an attribute) and the parsing record index (PRI, the record index in the table). Note that when using random access memory (RAM), the records may put directly in order even when the output is not in the same order by using the PRI. The hash values are for the tag of the element, the path from the root element to the element, the path from the parent element to the element, the path from the grand-parent to the element and the path from the grand-grand-parent to the element (obviously, this can be further optimized to save more [or less] paths so that the compromise between space and performance is best for the applications at hand). In addition, the following pointers to the XML file (i.e. the offset from the start of the file to the first character of the relevant object) are added: start of the element tag, start of the value of the element. Table 1 includes the information gathered for the production rule related to XML elements.
In this table, "PRI" is the parsing record index attached to the element, "type" is the type of production rule that generated the record (this table includes all records that were generated by the "element" production rule), "V type" is the type of the value of the element, "V(8)" is the 8 byte prefix of the value of the element, "V(N)" is the numeric value of the element (where a value may be computed), "1st" is the index of the first attribute of the element in the attributes table, "#" is the number of attributes the element has, "P" is the index of the processing record of the parent element (which is also within this table), "C#" is the number of the children elements of the element, "LC" is the last child element of the element, "PS" is the previous siblmg element of the element, "S#" is the sibling index for the element within the parent element, "h(V)" is the hash of the value of the element, "h(T)" is the hash of the tag of the element, "h(R)" is the hash of the path from the root element to the element, "h(P)" is the hash of the path from the parent to the element, "h(GP)" is the hash of the path from the grandparent to the element, "s(T)" is the offset of the element's tag in the input XML file, "len(T)" is the length of the tag of the element, "s(V)" is the index of the element's value in the input XML file, "len(V)" is the length of the element's value, "TR" is the transformation rule to be applied for the element, "VR" is the validation rale to be applied for the element, and FSME, which is an abbreviation for "For SeMantic analysis Engine" is the list of fields that are relevant for semantic analysis and will be used by the HWSME. The identifiers point out which fields to act upon (e.g. compare) when a semantic rale is executed.
Each instance in the attribute table (see Table 2) includes information for an attribute in the XML file. This information includes a unique identifier for the attribute, an index to its element, index to the next attribute (i.e. its position in the attributes table below), length of the attribute, length of the value of the attribute, and hash values (detailed below).
The hash values are for the attribute, the path from the root element to the element and to the attribute, the path from the parent element to the element and to the attribute, the path from the grand-parent to the element and to the attribute and the path from the grand-grand-parent to the element and to the attribute (obviously, this can be further optimized to save more [or less] paths so that the compromise between space and performance is best for the applications at hand). Additional hash values may be calculated in order to accelerate the processing of specific dialects of data representation languages. In addition, at least one of the following pointers to the XML file (i.e. the offset from the start of the file to the first character of the relevant object) may optionally be added: start of the attribute, start of the value of the attribute.
In Table 2, "PRI" is the parsing record index attached to the element, "type" is the type of production rule that generated the record (this table includes all records that were generated by the "attribute" production rale), "V type" is the type of the value of the attribute, "V(8)" is the 8 byte prefix of the value of the attribute, "V(N)" is the numeric value of the attribute (where a value may be computed), "PE" is the index of the parent element of the attribute in the elements table, "IE" is the index of the attribute within the element, "P", "C#", "LC", "PS", and "S#" are fields not used for attributes (only for elements), "h(V)" is the hash of the value of the attribute, "h(A)" is the hash of the attribute's name, "h(R)" is the hash of the path from the root element to the element and the attribute, "h(P)" is the hash of the path from the parent to the element and the attribute, "h(GP)" is the hash of the path from the grandparent to the element and the attribute, "s(A)" is the index of the attribute's name in the input XML file, "len(A)" is the length of the attribute's name, "s(V)" is the index of the attribute's value in the input XML file, "len(V)" is the length of the attribute's value, "TR" is the transformation rule to be applied for the attribute,
"VR" is the validation rule to be applied for the attribute, and FSME is the list of fields that are relevant for semantic analysis.
Note that while some fields in Table 2 are similar to the ones in Table 1 some fields have different meaning. This may be achieved by the way the processing is configured. Note also that maintaining the same structure for each row in the tables may fasten the processing.
However, it is possible to have different rows in case different methods are used for the processing.
g^ SEMANTICAL ANALYSIS 510
Following the syntactical analysis, a stream of parsing records is formed on which semantic analysis may be applied. The Semantic Analysis Engine (HWSME) performs semantic analysis resulting in a semantically augmented stream of parsing records. Typically, for every parsing record received one semantically augmented parsing record is produced. The HWSME uses the Semantic Lookup Table 120 to deduce which validation and/or transformation rules may need to be applied to the parsing records.
The HWSME may be a stateless engine (in contrast to the HWTIE and HWPE) and may perform the semantic analysis of the parsing records one record at a time regardless of the history of records previously processed by it. Using a subset of the record fields (such as the type and the hash value of the record, as indicated by the HWPE or HWTIE) the HWSME reads an entry from the Semantic Lookup Table 120. This entry specifies which validation and transformation rales may need to be applied to the parsing record. These rules will be applied later by the hardware and software validation and transformation engines (see below).
Note that when hash values are used, 2 approaches may be used. The first approach assumes that the probability of a hash collision is negligible (for example, IO"48) and thus the hash value identifies the compared object uniquely. A second approach is to verify that the value of the compared object is identical to the one expected when the hash value is identical to the expected hash value. The latter may be slower in implementation, as it requires value comparison (for example, strings) in each case (though the hash value used may be shorter in this case), but it is accurate in all cases.
In some embodiments, the output of the HWSME includes the original parsing record received from the HWPE with possible additional information including but not limited to:
1. A set of validation rule identifiers to be applied to the record.
2. A set of transformation rule identifiers to be applied to the record. In addition, the HWSME may store statistics related to certain semantic rules such as the number of instances of a certain record type with a given value.
In some embodiments, for every parsing record, one or two memory access cycles may need to be completed. For efficient memory bandwidth utilization as well as increased performance, multiple record streams may be analyzed in parallel. Alternatively, a single record stream may be broken into . multiple streams (since the HWSME may be stateless, this decomposition is relatively simple). As a result, the HWSME may analyze up to one parsing record every clock cycle. Hence, the HWSME throughput is limited by the HWPE throughput and does not require any input buffering. Only one or two external memory devices are required by the HWSME (and possibly a combination of one off chip memory device and one on chip memory block).
The number of semantic rules that can be supported by the Semantic Lookup Table 120, if implemented in standard SDRAM technology may reach many tens of thousands and beyond. This is expected to be more than sufficient for any typical structured file format.
h INITIAL VALIDATION 506
Following the semantic analysis, a stream of semantically augmented parsing records is formed on which validation and transformation rules may be applied. The HWVE performs the initial step of validation resulting in a validated stream of parsing records. Typically, for every parsing record received one validated parsing record is produced.
The HWVE uses the Validation Lookup Table 126 to deduce which fields of the parsing records require validation and which validation actions need to be performed on them.
The HWVE may be a stateless engine (in contrast to the HWTIE and HWPE) and may perform the validation actions of the parsing records one record at a time regardless of the liistory of records previously processed by it. Using the validation rale identifiers supplies by the previous processing steps, the HWVE reads the validation parameters (such as range boundaries and validation functions) from the Validation Lookup Table 126. For more information on the supported validation actions see section "Validation schema compilation" above.
The output of the HWVE includes the original parsing record received from the HWSME with optional additional information including at least one of:
1. An indication that the parsing record was validated (and found valid or invalid).
2. A validation error indicator (e.g. range exception, type mismatch).
3. An indicator that the current record requires validation but was not validated by the HWVE (e.g. due to hardware implementation limitations). For every parsing record, one or two memory access cycles may need to be completed.
For efficient memory bandwidth utilization as well as increased performance, multiple record streams may be validated in parallel. Alternatively, a single record stream may be broken into multiple streams (since the HWVE is stateless, this decomposition is relatively simple). As a result, the HWVE may validate up to one parsing record every clock cycle. Hence, the HWVE throughput is limited by the HWSME throughput and does not require any input buffering. Only one or two external memory devices are required by the HWVE (and possibly a combination of one off chip memory device and one on chip memory block).
The number of validation records that can be supported by the Validation Lookup Table 126, if implemented in standard SDRAM technology may reach many tens of thousands and beyond. This is expected to be more than sufficient for any typical structured file format.
For the present example, the records described earlier in Tables 1-2 are scanned in order to look for the relevant fields (like name and position) in order to have the validation information available and then tested so that the values are verified with the rales defined by the pre- processing stage. For example, the "sex" attribute is verified to be either "male" or "female". In case the validation fails, the HWVE sets appropriate flags, which will later on trigger alerts and notifications for the control application. Note that the example file "emp.xml" conforms to the sample validation file "emp.dtd", so its validation is successful.
Table 3 shows the flags that were generated during the processing stage as described above. Note that the flags may also refer to information searched within the data description language file. For example, it is possible to denote if the term "confidential" is found within the data description language file (or even in specific location, for example in the root element of an XML file).
Table 3 includes the results of the processing and the list of errors that were encountered. Table 3 holds variables that may be used by the application to tell if the XML is well formed, passes validation etc. Note that this is only a sample of the information that may be saved for the processed XML.
i) INITIAL TRANSFORMATION 506 Following the initial validation, a stream of validated parsing records is produced on which initial transformation rules may be applied. The HWTRE performs the initial step of transformation resulting in a partially transformed stream of parsing records. Typically, for every parsing record received one transformed or untouched parsing record is produced. The HWTRE uses the Transformation Lookup Table 128 to deduce which fields of the parsing records require transformation and which transformation actions need to be performed on them.
The HWTRE may be a stateless engine (similar to the HWVE) and may perform the transformation actions of the parsing records one record at a time regardless of the history of records previously processed by it. Using the transformation rule identifiers supplied by the previous processing steps, the HWTRE reads the transformation parameters (such as transformation functions) from the transformation lookup table TRLUT 128. For more information on the supported transformation actions see section "TRANSFORMATION SCHEMA COMPILATION TO GENERATE ONE OR MORE TRANSFORMATION LOOKUP TABLE(S) 128" above.
In some embodiments, the output of the HWTRE includes the original parsing record received from the HWVE with possible additional information including at least one of: 1. An indication whether the parsing record requires transformation. 2. The required transformation type.
3. The applied transformation rule identifier.
In some embodiments, for every parsing record, one or two memory access cycles may need to be completed. For efficient memory bandwidth utilization as well as increased performance, multiple record streams may be transformed in parallel. Alternatively, a single record stream may be broken into multiple streams (since the HWTRE is stateless, this decomposition is relatively simple). As a result, the HWTRE in some embodiments may validate up to one parsing record every clock cycle. Hence, the HWTRE throughput is limited by the HWSME throughput and does not require any input buffering. Only one or two external memory devices are required by the HWTRE (and possibly a combination of one off chip memory device and one on chip memory block).
The number of transformation records that can be supported by the Transformation Lookup Table 128, if implemented in standard SDRAM technology may reach many tens of thousands and beyond. This is expected to be more than sufficient for any typical structured file format. The transformed and validated parsing records produced by the HWTRE are sent back to the software processing modules for completion of the process.
In some embodiments, at the end of the transformation step, a transformation instructions table is created. This table includes list of operations that are required in order to generate the required HTML output. The processing that remains is generating the actual output by following these instructions, which are actually offset and length to copy strings from the input file as well as strings prepared in the pre-processing stage. A typical implementation may have the relevant strings (that is, the ones that may be required for the output) from the pre-processing stage stored in a smaller file and then reference this file. In the current example, the records described earlier in Tables 2-3 are scanned in order to look for the relevant fields (like name and position) in order to have the transformation infonnation available. For example, the "name" element is recognized and then appropriate entries are prepared for building the appropriate output.
In this example, the result of the processing for the "emp.xml" file is in Table 4 (for simplicity, the offset value in this example refers to strings in the original "emp.xsl" file). Each line in the table has an instruction which builds an additional part of the required output file. Note that the references to the "emp.xsl" file are actually constant strings known at the preprocessing stage, while the references to the "emp.xml" file are actually pointers to strings within the input file. In Table 4, "TRI" is the transformation record index, "OP" is the operation required
("const" for outputting a constant string known at the end of the preprocessing stage, "copy" for copying a string from the source text file), "Source" is a reference to the place where the required operand of the operation is, "Offset" is the offset of the required string from the start of the source operand and "Len" is the amount of bytes to output. Each entry in Table 4 is actually a view of the transformation rule that was applied to an element or an attribute, and the appropriate offset and length of the operand string. Using this table, the final transformation can be very efficient, as the required transformed output is built by simple primitives that copy strings directly to the destination output stream.
i FINAL VALIDATION 506
In some embodiments, the first software module to receive the transformed and validated parsing records is the SWVM. The SWVM is responsible for completing all the required validation directives that the hardware engines did not perform (either because they were not capable of performing them or because they were not instructed to perform them). The SWVM uses indications received from the hardware engines pointing at the records that require further validation. The actual validation is performed by executing the required validation directives as defined in the pre-processing phase. This is done using standard software validation techniques. One important result of the final validation, in some embodiments, is a decision if the input data representation language file is valid or not. In case the file fails validation, the output records will in this example also point what validation rules were violated.
k) FINAL TRANSFORMATION 506
Following the validation of the SWVM the records are sent to the SWTM. The SWTM is responsible for completing all the required transformation directives that the hardware engines did not perform (either because they were not capable of performing them or because they were not instructed to perform them). The SWTM uses indications received from the hardware engines pointing at the records that require further transformation. The actual transformation is performed by executing the required transformation directives as defined in the pre-processing phase. This is done using standard software transformation techniques.
For the example "emp.xsl" file, the resulted file after transformation to HTML file provided in FIG. 11.
1) EXCEPTION HANDLING
In every phase of the data representation language processing, various exceptions may occur. These exceptions may be the result of malformed data representation language files or implementation restrictions. In the first case, the file may be partially processed (e.g. if a simple lexical error is found) or it may be totally incomprehensible. In the second case, processing is delegated to standard software tools for functional completeness at the price of reduced performance.
m) WIRE-SPEED IMPLEMENTATION
Note that the tables may be created "on the fly". Note also that creating the tables requires only a simple stack that depends on the depth of the XML structure and not on the amount of the information in the XML file.
This implementation is specifically optimized for hardware accelerations, as the sequential access to the file and low memory requirements allow for one-path processing that would generate the required tables and allow further information to be easily deduced from the tables. Another optimization could be elimination of tables that are not required for a specific application. For example, in cases where the application is interested only in elements but not in the attributes, the list of attributes may be removed.
Note also that the tables may be unified to a single table in case implementation is simplified or in case for application reasons it would result in better performance or for any other reason.
n) POST PROCESSING 206
During the processing of data language representation files various statistics optionally are gathered regarding the nature of the file, the resources it required and hardware and software performance metrics (e.g. the production rate of tokens per second or parsing records per second). These statistics are fed back into the pre-processing mechanisms to control caching algorithms as well as performance optimizations related to the structure of the tokens or grammars fed to the hardware engines.
6) USAGE OF HASHES AND OTHER ABBREVIATED REPRESENTATIONS
A feature provided by certain embodiments of the present invention is the ability to use hashes and compare hashes instead of complete strings. Abbreviated representations such as hash values are calculated for one or more data item of the data representation language (e.g. XML elements, attributes, values) and the abbreviated representation (e.g. hash value) rather than the full representation is used in comparisons. Therefore it can efficiently be decided if the data conforms to a schema or if a transformation should occur for some part of the data. Hash values are simple and fast to process (by hardware as well as by software) and thus it is possible to perform very fast and efficient processing of data representation language content.
The hash values may be used for search expressions (such are used by XPath and XQuery for XML files) as well as for transformation expressions (such are used by XSL for XML files) and actions.
In some embodiments, hash values are not unique, and thus it is possible to have 2 different expressions that have the same hash value (this is called "hash collision", and such expressions are called "hash synonyms"). The algorithm described may take different approaches to address this issue: The first would be to use a hash function that generates large values (for example, 160 bits values) and assume that the probability for a collision is negligible (for example, IO"48). This approach results in very efficient processing with a risk of collision going undetected. The other approach would be to verify the expression (for example, by comparing the strings) when the hash value for it signals that it is a synonym to the expression that is expected. Obviously, the latter approach requires more processing and thus would generate slower throughput for the data processing.
One particular usage of abbreviated representations or hash values particular to markup languages or data representation languages providing tree representations is described in FIG. 10. According to the technique disclosed in FIG. 10, first a tree representation of part or all of the text file is provided or derived 402. Then paths are calculated between a selected reference node 404 and one or more selected 406 target nodes. After deriving data indicative of a path between a selected target node and the reference node 408, a representation of the path (e.g. a hash value) of derived and stored 412 in volatile or non-volatile memory. Optionally, this process of deriving a path 410 and storing the representation 412 is also carried out for additional nodes with fixed relation to a selected target node, such as an ancestor of the target node, such as a parent or grandparent of the target node.
In the description and claims of the present application, each of the verbs, "comprise" "include" and "have", and conjugates thereof, are used to indicate that the object or objects of the verb are not necessarily a complete listing of members, components, elements or parts of the subject or subjects of the verb.
The present invention has been described using detailed descriptions of embodiments thereof that are provided by way of example and are not intended to limit the scope of the invention. The described embodiments comprise different features, not all of which are required in all embodiments of the invention. Some embodiments of the present invention utilize only some of the features or possible combinations of the features. Variations of embodiments of the present invention that are described and embodiments of the present invention comprising different combinations of features noted in the described embodiments will occur to persons of the art. The scope of the invention is limited only by the following claims.

Claims

WHAT IS CLAIMED IS:
1. A method of text file processing, the method comprising: a) providing a schema associated with a dialect of a data representation language; and b) processing said schema to generate at least one look-up table encoding a plurality of directives for text processing in accordance with said schema.
2. The method of claim 1 wherein said schema is provided electronically, and said stage of generating includes only electronic processing of said electronically processed schema.
3. The method of claim 1 wherein said stage of generating said look-up table includes effecting a compiling of said schema.
4. The method of claim 3 wherein said compiling includes an electronic identification of at least one production rule of a grammar of said schema.
5. The method of claim 4 wherein said compiling includes identification of at least one semantic directive associated with a said production rule of said grammar.
6. The method of claim 5 wherein said semantic directive is selective from the group consisting of a classification semantic directive and an action semantic directive.
7. The method of claim 5 wherein said semantic directive is selected from the group consisting of validation directives and transformation directives.
8. The method of claim 5 wherein said compiling includes a compiling of said identified semantic directive into a said lookup table.
9. The method, of claim 1 wherein files encoded in said data representation language are representable as a tree structure.
10. The method of claim 1 wherein said data representation language is a tag delimited language.
11. The method of claim 1 wherein said data representation language is selected from the group consisting of XML, and EDI .
12. The method of claim 11 wherein said dialect is an XML dialect selected from the group consisting of FIX, OFX, SwiftML, SOAP,WSDL, HL7, EDI and AccordXML.
13. The method of claim 1 wherein said schema is provided in a format selected from the group consisting of a DTD format and an XSD format.
14. The method of claim 1 wherein said schema is a transformation schema.
15. The method of claim 1 wherein said transformation schema is provided in a format selected from the group consisting of XSL.
16. The method of claim 1 wherein said generating includes at least partially implementing said lookup table in hardware.
17. The method of claim 16 wherein said hardware includes at least one re-programmable logic component selected from the group consisting of an FPGA component, an ASIC component, and a gate array component.
18. The method of claim 16 wherein a plurality of said lookup tables are generated and said processing includes determining a first subset of said plurality to be implemented in software and a second subset of said plurality to be implemented in hardware.
19. The method of claim 18 wherein said first subset of said plurality includes at least one said lookup table that encodes at least one directive to perform a relatively simple operation.
20. The method of claim 19 wherein a said relatively simple operation is selected from the group consisting of a type checking operation, a range checking operation, and a fixed increment operation.
21. The method of claim 18 wherein said first subset of said plurality includes at least one said lookup table that encodes at least one directive to perform a relatively complex operation.
22. The method of claim 19 wherein a said relatively complex operation is selected from the group consisting of a sorting operation, transactional operations, transactional operations requirement communication with a remote entity, and cryptographic operations.
23. The method of claim 1 wherein at least one said lookup table encodes at least one semantic directive.
24. The method of claim 23 wherein said semantic directive includes an operation to be performed when at least one grammar item selected from the group consisting of a token, a production rule and a combination of a said token and a said production rule is identified.
25. The method of claim 23 wherein a said operation to be performed upon said identification of a said token is selected from the group consisting of a storing of a position of said token in an input stream, a storing of a length of said token, a storing of a prefix of said token, a calculation and storing of a numerical value corresponding to said token, a storing of a token type of said token, a discarding of said token, a counting of said token, and a storing of pointer to a semantic rule.
26. The method of claim 23 wherein a said operation to be performed upon identification of a said production rale is selected from the group consisting of a storing of a position of a token associated with said production rule, a storing of number of tokens associated with said production rale, a storing of a character prefix associated with said production rale, a calculation and a storing of a numerical value associated with a token of said production rule, a calculation and storing of an abbreviated representation of said production rale, a calculation and storing of an abbreviated representation of a hierarchies of rules including said production rule, a storing of a rale type of said production rale, a discarding of input related to said rale, a counting of said production rule, a storing of an index of said production rule in a rule type table, a storing of an index of at least one sub-element of said production rule, a storing of an index of at least one sub-element of said production rale, a storing of a value of a counter associated with at least one sub-element of said production rule, an inheriting of an index of said production rale to at least one sub- element of said production rule, a storing of at least one index inherited from a parent rale of said production rule, a comparison of a specific calculated value of a stored value associated with said production rale, a storing of a pointer to a specific semantic rale, a storing of an indication of fields to be used to deduce required said production rule identifiers.
27. The method of claim 26 wherein said semantic rale is selected from the group consisting of a validation rule and a transformation rule.
28. The method of claim 23 wherein a said operation to be performed upon identification of a combination of a said token and a said production rale is selected from the group consisting of a storing of a pointer to a specific semantic rale and a counting of said production rale.
29. The method of claim 28 wherein said semantic rale is selected from the group consisting of a validation rale and a transformation rule.
30. The method of claim 23 wherein at least one said semantic lookup table is a stateless lookup table.
31. The method of claim 1 wherein at least one said lookup table is a semantic lookup table encoding at least one semantic directive.
32. The method of claim 31 wherein said semantic directive is selected from the group consisting of a semantic classification directive and a semantic action directive.
33. The method of claim 31 wherein at least one said semantic directive is selected from the group consisting of a validation directive and a transformation directive.
34. The method of claim 33 wherein at least one said validation directive is selected from the group consisting of a value comparison directive, a simple value comparison directive, a multiple value comparison directive, an range checking directive, a type checking directive, an integer range checking directive, a fraction range checking directive, a value length directive, and a value length checking directive.
35. The method of claim 33 wherein at least one said transformation directive is selected from the group consisting of a directive to mark a stracture for transformation of a given type, a directive to mark a structure for transformation of a given type specified by a transformation identification code, a directive to remove a token, a directive to remove a complex structure, a directive to truncate a token, a directive to truncate a complex structure, a directive to replace a first token with a second token, and a numerical conversion, and an updating of a given field in an output table.
36. The method of claim 35 wherein said range checking directive is a directive for context specific range checking.
37. The method of claim 33 wherein an array of at least one transformation directive encodes a transformation between said dialect and a different dialect of said data representation language.
38. The method of claim 33 wherein an array of at least one transformation directive encodes a transformation between said data representation language and a different data representation language.
39. The method of claim 33 wherein an array of at least one transformation directive encodes a transformation between said data representation language and a language other than a data representation language.
40. The method of claim 31 wherein at least one said semantic directive is a path validation directive.
41. A method of text processing, the method comprising: a) receiving a plurality of text files encoded in a data representation language; b) for at least one said text file, determining a dialect of said data representation language; c) generating from a definition of said dialect at least one look-up table encoding directives for text processing in accordance with said determined dialect; and d) using at least one said look-up table, processing said respective text file in accordance with said determined dialect.
42. The method of claim 41 wherein at least one said look-up table is implemented at least in part in hardware.
43. The method of claim 41 wherein said a plurality of said text files are subjected to said processing, and said stage of processing includes: i) determining a set of common or heavy operations; ii) determining a set of uncommon or light operations; iii) for a subset of said plurality of said text files, performing said set of common or heavy operations; and iv) performing on said subset of said text files said set of uncommon or light operations.
44. The method of claim 41 wherein said determining is effected at least in part using a first hardware module and said processing is effected at least in part using a second said hardware module.
45. The method of claim 44 wherein said first and second hardware modules are configured to effect a pipeline processing.
46. The method of claim 41 wherein data associated with at least one said look-up table is cached, and said stage of processing mcludes retrieving at least some of said cached data.
47. The method of claim 46 wherein said caching includes caching at a plurality of locations within a network.
48. The method of claim 41 wherein said definition of said dialect includes a schema, and said generating of said look-up table includes electronic processing of said schema.
49. The method of claim 48 wherein said processing of said schema includes effecting a compiling of said schema.
50. The method of claim 41 wherein before commencing processing of at least one said text file a hardware update of a said lookup table is performed.
51. The method of claim 41 wherein said deteπnining of said dialect includes identifying a string selected from the group consisting of a file name of a said text file and a file type of a said text file.
52. The method of claim 41 wherein said determining of said dialect includes identifying a file source of a said text file.
53. The method of claim 41 wherein said determining of said dialect includes parsing at least some of said text file.
54. The method of claim 53 wherein said determining of said dialect includes effecting a first pass over said respective text file, and said processing using said at least one look-up table includes effecting a second pass over said respective text file.
55. The method of claim 41 wherein said determining, generating and processing is performed iteratively more than once on a said text file.
56. The method of claim 41 wherein more than one said text file is subjected to said processing in parallel.
57. The method of claim 41 wherein one said look-up table encodes a production rule of a grammar.
58. The method of claim 57 wherein said processing of said text file includes effecting in accordance with at least one said look-up table at least one grammatical analysis of a said text file selected from the group consisting of a syntactic analysis and a semantic analysis.
59. The method of claim 58 wherein said syntactic analysis includes recording at least one value selected from the group consisting of a production rule identifier, a parsing record index, a beginning of a record in an input stream, a length of a record in an index stream, a beginning of a specified sub-element of a production rale, a length of a specified sub- element of a sub-element, a context value, a value stored earlier in a parsing process, a record prefix, a prefix of a specified token in a production rule, a value associated with a specific token in a production rale, a hash value associated with at least one token associated with a production rale, a number of combined hash values, a number of counters incremented based on counter values of previously processed production rules, a number of record indexes of previously processed production rales, a number of record indexes of previously processed production rules and a syntax error indication.
60. The method of claim 58 wherein said semantic analysis includes recording at least one value selected from the group consisting of a validation rule to be applied to a parsing record a result of a comparison between two calculated values, a transformation rule to be applied to a parsing record.
61. A method of generating data useful for fast text processing, the method comprising: a) providing a text file representable as a tree having a plurality of node elements; b) selecting a reference element; c) for at least one said node element, determining a path within said tree between said respective node element and said reference element; and d) storing an abbreviated representation of said path.
62. The method of claim 61 wherein said reference node is a fixed distance from a root node of said tree.
63. The method of claim 62 wherein said reference node is said root node of said tree.
64. The method of claim 61 wherein said determining of said path and said storing of said abbreviate representation is carried out for at least one said node element which is a descendant of said reference node.
65. The method of claim 64 wherein said determining of said path and said storing of said abbreviate representation is carried out only for said node elements which are a descendants of said reference node.
66. The method of claim 61 wherein said detennining and said storing is effected for all said nodes having one or more predefined depths within said tree.
67. The method of claim 61 wherein said abbreviated representation is a hash value.
68. The method of claim 61 wherein said abbreviated representation of said path is mapped to a representation of data associated with said node.
69. The method of claim 61 further comprising: d) for at least some said respective node elements, mapping an abbreviated representation of a path between an ancestor of said node element to a representation of data associated with said node.
70. A system for accelerating text file processing, the system comprising: a) an schema input for receiving a schema defining a dialect structure of a data representation language; and b) a look-up table generator for processing said schema to generate at least one look-up table encoding directives for text processing in accordance with said schema.
71. The system of claim 70 wherein said look-up data generator includes a schema compiler for effecting of compiling said schema.
72. The system of claim 70 wherein said look-up data generator is operative to implement at least one said look up table in hardware.
73. A system for text processing, the system comprising: a) an input for receiving at least one text file encoded in a data representation language; and b) a text processing engine including at least one look-up table encoding a plurality of text-processing directives in accordance with a schema of said data representation language, said text processing engine for processing at least one said received text file.
74. The system of claim 73 further comprising: c) a dialect determiner for determining a said dialect of a said text file.
75. The system of claim 73 wherein at least one said look-up table is implemented at least in part in hardware.
76. The system of claim 73 wherein at least one said look-up table encodes at least semantic directive.
77. The system of claim 73 wherein said text processing engine is distributed within a computer network.
78. The system of claim 73 further comprising: c) an exception handling module for handling exceptions generated by said processing of said at least one text file.
79. The system of claim 73 wherein said text processing engine is operative to reconfigure at least one said look-up table while effecting said processing at a said received text file.
80. The system of claim 73 wherein said text processing engine includes a hardware character normalization engine, and said processing of said received text file includes generating a character stream from said byte stream using said hardware character normalization engine.
81. The system of claim 73 wherein said text processing engine includes a hardware token identification engine, and said processing of said received text file includes identifying tokens within a character stream representative of said text file using said hardware token identification engine.
82. The system of claim 73 wherein said text processing engine includes a hardware parsing engine, and said processing of said received text file includes receiving a stream of tokens representative of said text file and producing a stream of parsing records using
said hardware parsing engine.
83. The system of claim 82 wherein said hardware parsing engine uses a said look-up table encoding at least one syntactic said text-processing directive of said dialect.
84. The system of claim 73 wherein said text processing engine further includes at least one hardware semantic analysis engine.
85. The system of claim 84 wherein a said hardware semantic analysis engine is selected from the group consisting of a hardware validation engine and a hardware transformation rule engine.
86.. A system for generating data useful for fast text processing, the system comprising: a) an input for receiving a text file representable as a tree having a plurality of node elements; b) a path determiner for determining a path within said tree between said respective node element and a reference element; c) a path representor for deriving an abbreviated representation of said determined path; and d) a storage for said abbreviated representation.
87. The system of claim 86 wherein said reference element is a root element of said tree.
88. The system of claim 86 wherein said path representor is operative to derive a hash value as a said abbreviated representation of said determined path corresponding to said node.
89. The system of claim 86 wherein said storage is operative to store a map between a representation of a said node and an abbreviate representation of a said path corresponding to said represented node.
90. A method of processing a corpus of at least one text file encoded in a data representation language, the method comprising: a) processing at least one text file of said corpus using directives associated with said data representation language; b) from results of said processing, determining a schema of a dialect associated with said processed files; c) modifying a set of at least one lookup table encoding a plurality of directives for processing of text files of the corpus; d) processing at least one text file in accordance with said generated or updated lookup tables.
91. The method of claim 90 wherein steps b, c and d are repeated at least once after said processing in accordance with said modified lookup tables.
92. The method of claim 90 wherein said modifying of said lookup table includes at least one of generating at least one said lookup table and updating at least one said lookup table.
93. The method of claim 90 wherein said modifying of said lookup table mcludes updating hardware associated with said lookup table.
94. The method of claim 90 wherein at least one said lookup table encodes at least one semantic directive.
95. A computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for: a) providing a schema associated with a dialect of a data representation language; and b) processing said, schema to generate at least one look-up table encoding a plurality of directives for text processing in accordance with said schema.
96. A computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for: a) receiving a plurality of text files encoded in a data representation language; b) for at least one said text file, determining a dialect of said data representation language; c) generating from a definition of said dialect at least one look-up table encoding directives for text processing in accordance with said determined dialect; and d) using at least one said look-up table, processing said respective text file in accordance with said determined dialect.
97. A computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for: a) providing a text file representable as a tree having a plurality of node elements; b) for at least one said node element, determining a path witliin said tree between said respective node element and a selected reference element; and c) storing an abbreviated representation of said path.
98. A method of accelerating processing of HTTP headers, the method comprising: a) receiving a plurality of HTTP headers; b) determining common patterns among said plurality of said HTTP headers; c) generating from said determined common patterns at least one look-up table encoding directives for text processing in accordance with said determined common patterns.
99. The method of claim 98 further comprising: d) using at least one said look-up table, processing an HTTP header.
100. The method of claim 98 wherein at least one said look-up table is implemented at least in part in hardware.
PCT/IL2005/000521 2004-05-19 2005-05-19 Method and system for processing of text content WO2005111824A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US57212304P 2004-05-19 2004-05-19
US60/572,123 2004-05-19

Publications (2)

Publication Number Publication Date
WO2005111824A2 true WO2005111824A2 (en) 2005-11-24
WO2005111824A3 WO2005111824A3 (en) 2007-03-08

Family

ID=35394798

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2005/000521 WO2005111824A2 (en) 2004-05-19 2005-05-19 Method and system for processing of text content

Country Status (1)

Country Link
WO (1) WO2005111824A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2858323A1 (en) * 2013-10-01 2015-04-08 Enyx SA A method and a device for decoding data streams in reconfigurable platforms
CN109614593A (en) * 2018-11-09 2019-04-12 深圳市鼎阳科技有限公司 Human-computer interaction device and its multilingual implementation method, device and storage medium
CN109871529A (en) * 2017-12-04 2019-06-11 三星电子株式会社 Language processing method and equipment
CN109961495A (en) * 2019-04-11 2019-07-02 深圳迪乐普智能科技有限公司 A kind of implementation method and VR editing machine of VR editing machine
CN111045661A (en) * 2019-12-04 2020-04-21 西安鼎蓝通信技术有限公司 XML Schema generating method based on semantic and feature code
CN111444254A (en) * 2020-03-30 2020-07-24 北京东方金信科技有限公司 SK L system file format conversion method and system
CN113949438A (en) * 2021-09-24 2022-01-18 成都飞机工业(集团)有限责任公司 Unmanned aerial vehicle communication method, device, equipment and storage medium
CN116821271A (en) * 2023-08-30 2023-09-29 安徽商信政通信息技术股份有限公司 Address recognition and normalization method and system based on voice-shape code

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6466971B1 (en) * 1998-05-07 2002-10-15 Samsung Electronics Co., Ltd. Method and system for device to device command and control in a network
US20040205549A1 (en) * 2001-06-28 2004-10-14 Philips Electronics North America Corp. Method and system for transforming an xml document to at least one xml document structured according to a subset of a set of xml grammar rules
US20050014494A1 (en) * 2001-11-23 2005-01-20 Research In Motion Limited System and method for processing extensible markup language (XML) documents
US20050240392A1 (en) * 2004-04-23 2005-10-27 Munro W B Jr Method and system to display and search in a language independent manner

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6466971B1 (en) * 1998-05-07 2002-10-15 Samsung Electronics Co., Ltd. Method and system for device to device command and control in a network
US20040205549A1 (en) * 2001-06-28 2004-10-14 Philips Electronics North America Corp. Method and system for transforming an xml document to at least one xml document structured according to a subset of a set of xml grammar rules
US20050014494A1 (en) * 2001-11-23 2005-01-20 Research In Motion Limited System and method for processing extensible markup language (XML) documents
US20050240392A1 (en) * 2004-04-23 2005-10-27 Munro W B Jr Method and system to display and search in a language independent manner

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105659274B (en) * 2013-10-01 2020-04-14 艾尼克斯股份有限公司 Method and apparatus for decoding data stream in reconfigurable platform
CN105659274A (en) * 2013-10-01 2016-06-08 艾尼克斯股份有限公司 Method and device for decoding data streams in reconfigurable platforms
EP2858323A1 (en) * 2013-10-01 2015-04-08 Enyx SA A method and a device for decoding data streams in reconfigurable platforms
AU2014331143B2 (en) * 2013-10-01 2019-02-21 Enyx Sa A method and a device for decoding data streams in reconfigurable platforms
WO2015049305A1 (en) * 2013-10-01 2015-04-09 Enyx Sa A method and a device for decoding data streams in reconfigurable platforms
US10229426B2 (en) 2013-10-01 2019-03-12 Enyx Sa Method and a device for decoding data streams in reconfigurable platforms
CN109871529B (en) * 2017-12-04 2023-10-31 三星电子株式会社 Language processing method and device
CN109871529A (en) * 2017-12-04 2019-06-11 三星电子株式会社 Language processing method and equipment
CN109614593A (en) * 2018-11-09 2019-04-12 深圳市鼎阳科技有限公司 Human-computer interaction device and its multilingual implementation method, device and storage medium
CN109614593B (en) * 2018-11-09 2023-06-30 深圳市鼎阳科技股份有限公司 Man-machine interaction equipment, multilingual implementation method and device thereof and storage medium
CN109961495A (en) * 2019-04-11 2019-07-02 深圳迪乐普智能科技有限公司 A kind of implementation method and VR editing machine of VR editing machine
CN111045661B (en) * 2019-12-04 2023-07-04 鼎蓝惠民信息技术(西安)有限公司 XML Schema generation method based on semantic and feature codes
CN111045661A (en) * 2019-12-04 2020-04-21 西安鼎蓝通信技术有限公司 XML Schema generating method based on semantic and feature code
CN111444254A (en) * 2020-03-30 2020-07-24 北京东方金信科技有限公司 SK L system file format conversion method and system
CN113949438A (en) * 2021-09-24 2022-01-18 成都飞机工业(集团)有限责任公司 Unmanned aerial vehicle communication method, device, equipment and storage medium
CN116821271A (en) * 2023-08-30 2023-09-29 安徽商信政通信息技术股份有限公司 Address recognition and normalization method and system based on voice-shape code
CN116821271B (en) * 2023-08-30 2023-11-24 安徽商信政通信息技术股份有限公司 Address recognition and normalization method and system based on voice-shape code

Also Published As

Publication number Publication date
WO2005111824A3 (en) 2007-03-08

Similar Documents

Publication Publication Date Title
US7458022B2 (en) Hardware/software partition for high performance structured data transformation
US7437666B2 (en) Expression grouping and evaluation
US7555709B2 (en) Method and apparatus for stream based markup language post-processing
US8250062B2 (en) Optimized streaming evaluation of XML queries
Green et al. Processing XML streams with deterministic automata and stream indexes
US7328403B2 (en) Device for structured data transformation
US7590644B2 (en) Method and apparatus of streaming data transformation using code generator and translator
US7287217B2 (en) Method and apparatus for processing markup language information
KR101093271B1 (en) System for data format conversion for use in data centers
US7146352B2 (en) Query optimizer system and method
Barbosa et al. Efficient incremental validation of XML documents
US20060212859A1 (en) System and method for generating XML-based language parser and writer
US7853936B2 (en) Compilation of nested regular expressions
WO2005111824A2 (en) Method and system for processing of text content
WO2006116649A2 (en) Parser for structured document
Chiu et al. A compiler-based approach to schema-specific xml parsing
Dai et al. A 1 cycle-per-byte XML parsing accelerator
Møller Document Structure Description 2.0
US20090177765A1 (en) Systems and Methods of Packet Object Database Management
US20080092037A1 (en) Validation of XML content in a streaming fashion
Gao et al. A high performance schema-specific xml parser
Zhang Efficient XML stream processing and searching
Bettentrupp et al. A Prototype for Translating XSLT into XQuery.
Yen et al. XBMS-an open XML bibliography management system
Michel Representation of XML Schema Components

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase
32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 69(1) EPC (EPOFORM 1205A DATED 17.07.07)

122 Ep: pct application non-entry in european phase

Ref document number: 05741999

Country of ref document: EP

Kind code of ref document: A2

WWW Wipo information: withdrawn in national office

Ref document number: 5741999

Country of ref document: EP