WO2002027551A2 - Nested conditional relations (ncr) model and algebra - Google Patents

Nested conditional relations (ncr) model and algebra Download PDF

Info

Publication number
WO2002027551A2
WO2002027551A2 PCT/US2001/024353 US0124353W WO0227551A2 WO 2002027551 A2 WO2002027551 A2 WO 2002027551A2 US 0124353 W US0124353 W US 0124353W WO 0227551 A2 WO0227551 A2 WO 0227551A2
Authority
WO
WIPO (PCT)
Prior art keywords
name
data
dept
tag
query
Prior art date
Application number
PCT/US2001/024353
Other languages
French (fr)
Inventor
Denise L. Draper
Original Assignee
Nimble Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nimble Technology, Inc. filed Critical Nimble Technology, Inc.
Priority to AU2001281023A priority Critical patent/AU2001281023A1/en
Publication of WO2002027551A2 publication Critical patent/WO2002027551A2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/123Storage facilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2452Query translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Definitions

  • the described technology relates generally to accessing data and particularly to accessing data from data sources with diverse formats.
  • XML extensible markup language
  • Figure 1 illustrates the schema of this XML document.
  • Figure 2 represents a Joinln graph (JIG) for the match expression of Table 8.
  • Figure 3 is a block diagram illustrating the overall organization of an execution program generated by the data integration engine.
  • Figure 4 is a block diagram illuslrating the function to generate an execution program.
  • Figure 5 is a flow diagram illustrating processing of the generate extract program function in one embodiment.
  • Figure 6 is a flow diagram illustrating the processing of the generate extract plan function in one embodiment.
  • Figure 7 is a flow diagram illustrating processing of the match expression function in one embodiment.
  • Figure 8 is a flow diagram illustrating the processing of the create Joinln graph function in one embodiment.
  • Figure 9 is a flow diagram illustrating processing of the generate Joinln graph into one embodiment.
  • Figure 10 illustrates the tables of the data store.
  • Figure 11 illustrates the results of the sorted outer union for the tables of Figure 10.
  • Figure 12 illustrates the SQL query for each of the tables of Figure 10 paid to generate the sorted outer union.
  • Figure 13 is a flow diagram illustrating the processing of generating a sorted outer union.
  • Figure 14 is a flow diagram illustrating processing of a generate SQL query function in one embodiment.
  • Figure 15 is a block diagram illustrating an extract program.
  • Figure 16 is a flow diagram that illustrates code of a join node of an extract program in one embodiment.
  • Figure 17 illustrates the output of the nodes of the extraction plan to Figure 15.
  • Figure 18 illustrates a final NCR structure.
  • Figure 19 illustrates the Correspondence Tree.
  • the data integration engine accepts queries using a standard query language such as XML-QL, executes those queries against the multiple data stores, and returns the results.
  • the data stores may include relational databases, hierarchical databases, file systems, application data available via APIs, and so on.
  • a query may reference data that resides in different data stores.
  • the data integration engine allows operations such as joins across multiple data stores.
  • the data integration engine uses XML as the data model in which the data from the various data stores is represented.
  • the data integration engine processes a query by parsing the query into an internal representation, compiling and optimizing the internal representation into a physical execution representation, and then executing the execution representation. By providing a uniform and data model, the data integration engine allows access to data stores in diverse formats.
  • the data integration engine executes a query on a data store by first providing a mapping of the data store format into an XML format.
  • the query for the data store is based on XML format.
  • the data integration engine upon receiving a query, generates a native query for the data store from the received query using the provided mapping.
  • the data integration engine executes the native query to generate data in a native format needed to generate the results of the received query.
  • the data integration engine then converts the data in the native format into data in a format referred to as nested conditional relations ("NCR").
  • NCR nested conditional relations
  • the data integration engine then applies various operators (e.g., joins and unions) to the data in NCR format to generate the query results in an NCR format.
  • the data integration engine then converts the results in the NCR format into an XML format. In this way, the integration engine can provide access to various data sources in different formats.
  • a nested conditional relation is a table in which each row may have a different schema and each column is either a primitive type or a nested NCR.
  • the schema of each row in an NCR is indicated by a tag, which can be considered to be the zero column of the row.
  • certain rows of the table may represent employees of a company and have columns named "first name,” “last name,” “phone number,” and so on.
  • Other rows in the table may represent departments within the company and have columns named "department name,” "department head,” and so on.
  • the tag for a row indicates whether the row is an employee or a department row.
  • a column for a certain type of row may itself contain a nested conditional relation.
  • an employee row may include a column named "skills" that contains a table with sub-rows containing information relating to computer skills and accounting skills of the employee.
  • the table may itself be a nested conditional relation in that each sub-row may include a tag indicating whether the row represents a computer skill or an accounting skill.
  • the nesting of nested conditional relations may occur to an arbitrary level.
  • the NCR format is described below in detail.
  • Tables 1-3 illustrate an example of data that is stored in a data store such as a relational database.
  • the relational database contains three tables: DEPARTMENTS table, EMPLOYEES table, and BUILDINGSDOCS table.
  • the DEPARTMENTS table contains one row for each department of an organization. As illustrated by Table 1, the organization has a finance and an engineering department.
  • the DEPARTMENTS table contains two columns: name and contact.
  • the name column contains the name of the department
  • the contact column contains the employee identifier of the contact person for the department. For example, the first row of the table indicates that the department is "finance" and that the contact employee is "E1247.”
  • the EMPLOYEES table contains a row for each employee in the organization. Each row includes seven columns: ID, Fname, Lname, Dept, Bldg, Office, and Manager.
  • the ID column uniquely identifies the employee
  • the Fname column contains the first name of the employee
  • the Lname column contains the last name of the employee
  • the Dept column identifies the employee's department
  • the Bldg column identifies the building in which the employee is located
  • the Office column identifies the employee's office within the building
  • the Manager column identifies the employee's manager.
  • the Dept column contains one of the values from the Name column of the DEPARTMENTS table.
  • the BUILDINGSDOCS table contains a row for each office within each building of the organization.
  • the BUILDINGSDOCS table contains four columns: Building, Office, Phone, and MaintContact.
  • the Building column identifies a bwlding
  • the Office column identifies an office within the building
  • the Phone column contains the phone number associated with that office
  • the MaintContact column identifies the employee who is the maintenance contact for the office.
  • the combination of the Building and Office columns uniquely identifies each row.
  • the Bldg and Office columns of the EMPLOYEES table identifies a row within the BUILDINGSDOCS table.
  • Table 4 is an example of data stored as an XML document.
  • the XML document includes the root element ⁇ deptlist> that has a name attribute and that contains a ⁇ dept> element corresponding to each department within an organization.
  • Each ⁇ dept> element contains an ⁇ employee> element for each employee within the department.
  • Each ⁇ employee> element contains a ⁇ name> element and optionally an ⁇ off ⁇ ce> element.
  • the ⁇ name> element includes a ⁇ first> element and ⁇ last> element.
  • the ⁇ office> element includes a phone attribute.
  • the schema of an XML document may be represented by an XML data type definition ("DTD") of the document.
  • Figure 1 illustrates the schema of this XML document.
  • node 104 is the parent of nodes 105 and 108, which are children of node 104.
  • Node 101 corresponds to the ⁇ deptlist> element and has one child node 102, which corresponds to the ⁇ dept> element.
  • Node 102 has two child nodes, 103 and 104.
  • Node 104 corresponds to the name attribute of the ⁇ dept> element and node 104 corresponds to the ⁇ employee> element.
  • Node 104 has two child nodes 105 and 108.
  • Node 105 corresponds to the ⁇ name> element and has two child nodes 106 and 107.
  • Node 106 corresponds to the ⁇ first> element, and node 107 corresponds to the ⁇ last> element.
  • Node 108 corresponds to the ⁇ office> element and has one child node 109, which corresponds to the phone attribute.
  • mapping technique is particularly useful in situations where a legacy database, such as the example database of Tables 1-3, is to be accessed using queries designed for XML data, such as the example of Table 4.
  • the XML schema may be previously defined and many different applications for accessing data based on that XML schema may have also been defined.
  • one such application may be a query of the data.
  • An example query for semi-structured data may be an XML transform that is designed to input data in XML format and output a subset of the data in XML format.
  • a query for the database of Tables 1-3 may be a request to list the ID of each employee in the "Finance" department.
  • a query system inputs a semi-structured query and uses a mapping table to generate a structured query, such as an SQL query, that is appropriate for accessing the legacy database.
  • a mapping table for generating that mapping table is described in the following.
  • Table 5 is a portion of the mapping table generated in accordance with the mapping technique that maps the XML schema of Table 4 to the legacy database of Tables 1-3.
  • the mapping table contains one row for each parent-child relationship of the XML schema.
  • the mapping is further described in U.S. Patent Application entitled “Method and Apparatus for Storing Semi-Structured Data in a Structured Manner.”
  • the XML schema defines eight parent-child relationships such as the relationship between node 102 and node 104.
  • the mapping table contains eight rows. Each row uniquely identifies a parent-child relationship using the ParentName and ChildName columns.
  • the parent-child relationship of node 102 and node 104 is represented by row 3 as indicated by the ParentName of "dept” and the ChildName of "employee.”
  • Each row maps the parent-child relationship to the table in the legacy database that corresponds to that relationship.
  • the Table column indicates that the "dept-employee" relationship maps to the EMPLOYEES table.
  • the query system could use only the ParentName, ChildName, and Table columns of the mapping table to generate a structured query from a semi- structured query. For example, if the legacy database had used the same column names as defined by the elements of the XML schema (e.g., "employee” rather than "ID”), then only these three columns would be needed to generate the structured query.
  • the query system could input a semi- structured query with only these three columns and generate a structured query.
  • the mapping table includes a parent key column ("PKey") and a child key column ("CKey").
  • the parent key column contains the name of the column that identifies the parent of the parent-child relationship.
  • the child key column contains the name of the column that identifies the child of the parent-child relationship.
  • the parent is identified by the "dept” column and the child is identified by the "ID” column in the EMPLOYEES table.
  • Table 5 also includes a column named "A/E" to indicate whether the row corresponds to an element within the semi- structured data or an attribute of an element with semi-structured data.
  • rows 7 and 8 some of the parent and child keys actually consist of multiple columns that uniquely identify a row in the corresponding table.
  • the rows of the BUILDINGSDOCS table are uniquely identified by a combination of the Building and Office columns.
  • the query system maps the selections within the semi-structured query to selections within a structured query.
  • the following illustrates the basic format of that mapping when the structured query is an SQL format.
  • the query system also allows chaining of keys to effectively navigate through the hierarchy defined by the semi-structured data.
  • the query system uses the joint concept of relationship databases to effect this chaining of keys. The following illustrates chaining:
  • the TABLE1, PKEYl, and CKEYl parameters are derived from the first parent-child relationship in the chain, and the TABLE2 , PKEY2, and CKEY2 parameters are derived from the second parent-child relationship in the chain.
  • the child key associated with the first parent-child relationship matches the parent key associated with the second parent-child relationship.
  • the mapping table also contains the value rows corresponding to each leaf node, that is a node that is not a parent node.
  • the leaf nodes of Figure 1 are nodes 103, 106, 107, and 109.
  • each value row identifies an XML element or attribute, the table in the legacy database that contains an element, and the name of the column in the table that contains the value for that element or attribute.
  • Table 6 illustrates the four value rows for the mapping associated with Tables 1-3 and Table 4.
  • the "A/E” column identifies whether the row is an attribute or element; the "Name” column identifies the name of the element and attributes; the "Table” column identifies the legacy table; the “Key” column identifies the key for that table; and the "Value” column identifies the name of the column where the value is stored.
  • Table 7 illustrates a query that is to be applied to the data of Tables 1-3.
  • the query indicates to return the first and last names and phone number of each employee in the engineering department.
  • the data integration engine generates a "match expression" for a logical match operation ("LMatch") for the query when compiling the query.
  • the logical match operation supports operations for performing XML navigation.
  • the match expression defines a tree of navigations. Each node of the tree indicates a navigation type (e.g., child, parent, or sibling), a navigation condition (e.g., a condition on the name of the child), whether the navigation is required, whether there should be a binding to the target of the navigation (i.e., a value returned with the specified name), and whether the result should be nested.
  • Jo- Table 8 illustrates a match expression for the XML of Table 4 for the query of Table 7.
  • Each row of Table 8 represents a different navigation path.
  • the first row represents a navigation path from the root of the deplist element to its child element of the dept element and then to the name attribute of the dept element.
  • the remaining rows represent different branches on the tree.
  • the second row represents the branch of ro ⁇ t(deplist), c (dept), c ⁇ i(employee), c ⁇ i(name), and child(_ ⁇ st).
  • the symbols prefixed with "$" represent bindings.
  • FIG. 2 represents a Joinln graph (JIG) for the match expression of Table 8.
  • the Joinln graph is a data structure that facilitates the optimization of the query to be executed against the data store.
  • This JIG indicates that the Departments, Employees, and Buildhigdocs tables of the data store are to be joined together.
  • This JIG also indicates the bmdings (e.g., $first) and the join columns (e.g., Name and Dept).
  • the format of the JIG is described below in detail.
  • the JIG is generated from the match expression using the mapping.
  • the data integration engine then generates the query to be executed. The following query is generated.
  • FIG. 3 is a block diagram illustrating the overall organization of an execution program generated by the data integration engine.
  • An execution program consist of an extract program 310 and a construct program 320.
  • a compiler of the data integration engine generates the execution program during a compilation phase.
  • the extract program is a series of operations on a data extracted from the data sources.
  • the extract program represents a graph of the operations.
  • the leaf nodes 311 of the extract program represents a sorted outer union operation applied to the data stores 312.
  • the compiler generates a query for each data store in the native query language of the data store to retrieve the results of the sorted outer union.
  • the compiler generates the sorted outer union using the LMatch operation, Joinln graph, and mapping.
  • the generated query is applied to each data store.
  • the construct program accesses the root node 313 of the extract program which retrieves the results generated by the extract program.
  • the construct program collects the data and formats it into an XML output. As discussed below in more detail, the output of each operation of the extract program is in a nested conditional relation format.
  • Figures 4-9 are flow diagrams illustrating processing of the compiler of the data integration engine in one embodiment.
  • Figure 4 is a block diagram illustrating a function to generate an execution program. The function first generates the extract program and then generates the construct program. In block 401, the function invokes a generate extract program function to generate an extract program for the specified query against the specified data stores. In block 402, the function invokes the generate construct program function to generate a construct program to generate the results from the extracted data.
  • Figure 5 is a flow diagram illustrating processing of the generate extract program function in one embodiment.
  • the function generates an extract plan
  • hi block 502 the function identifies fragments of the extract plan.
  • a fragment of an extract plan are the set of operations that are applied to data derived from a single data source. Operations that apply to data from multiple data sources are grouped into one fragment.
  • the function optimizes the operations of the fragments and then returns.
  • Figure 6 is a flow diagram illustrating the processing of the generate extract plan function in one embodiment.
  • the function receives the XML query.
  • the function generates a match expression for the logical match associated with the data store.
  • the function creates the Joinln graph from the match expression using the mapping for the data store.
  • the function generates the native query from the Joinln graph.
  • the function indicates additional processing to generate the extract plan from the Joinln graph.
  • Blocks 602-604 illustrate the generation of the native query for the sorted outer union of the leaf nodes of the extract plan. The ellipses indicate other processing performed by the function.
  • the function then returns.
  • Figure 7 is a flow diagram illustrating processing of the match expression function in one embodiment.
  • This function is passed an XML node representing the data store and returns the match expression.
  • This function is recursively invoked for each child node of the passed XML node.
  • the function initializes the subtree to the XML node.
  • the function loops creating a match expression for each child node.
  • the function selects the next child node of the XML node.
  • decision block 703 if all the child nodes have already been selected, then the function returns, else the function continues at block 704.
  • the function recursively invokes the create match expression function passing the child node and receiving a child sub-tree in return.
  • the function adds the child sub-tree to the sub-tree and then loops to block 702 to select the next child.
  • Figure 8 is a flow diagram illustrating the processing of the create Joinln graph function in one embodiment.
  • the function invokes the generate Joinln graph passing the match expression and receiving the Joinln graph in return.
  • the function merges nodes of the Joinln graph.
  • the function processes merging of adjoining nodes of the Joinln graph and then returns.
  • Figure 9 is a flow diagram illustrating processing of the generate Joinln graph function into one embodiment.
  • This function is passed a match expression and returns a Joinln graph.
  • the function is recursively invoked for each child node of the passed match expression.
  • the function sets the Joinln graph to a node corresponding to the root of the match expression.
  • the function retrieves the mapping rows that can further the path from the root.
  • the function selects the next child node of the match expression.
  • decision block 903 if all the children have already been selected, the function returns, else the function continues at block 904.
  • the function recursively invokes the generate Joinln graph function passing the selected child node of the match expression and receiving a child Joinln graph in return.
  • Figures 10-15 illustrate the generation of an SQL query for a sorted outer union node of an extract program.
  • Figure 10 illustrates the tables of the data store.
  • the arrows between the tables illustrate joins between tables.
  • arrow 1001 represents a join between the third column of table 1.1 and the first column of table 2.
  • Figure 11 illustrates the results of the sorted outer union for the tables of Figure 10.
  • Figure 12 illustrates the SQL query for each of the tables of Figure 10 that are used to generate the sorted outer union.
  • Figure 13 is a flow diagram illustrating the processing of a function to generate a sorted outer union.
  • the function selects the next table of the source data store.
  • decision block 1302 if all the tables have already been selected, the function continues at block 1304, else the function continues at block 1303.
  • the function invokes the generate SQL query for the selected table and then loops to block 1301 to select the next table.
  • the function executes the generate SQL queries against the tables.
  • the function aggregates of the result of the queries into a table.
  • the function sorts the results and then returns.
  • Figure 14 is a flow diagram illustrating processing of a generate SQL query function in one embodiment.
  • the function outputs a select, from, and where clause for the query.
  • the function loops selecting each table in a join path of the data store.
  • the function selects the next table in the path.
  • decision block 1403 if all the tables have already been selected, then the function returns, else the function continues at block 1404.
  • the function adds the table to the from clause.
  • the function adds the table to the where clause.
  • the function loops selecting each column of the selected table.
  • the function selects the next column.
  • decision block 1407 if all the columns have already been selected, then the function loops to block 1402 to select the next table of the path, else the function updates the select clause with the column and then loops to block 1406 to select the next column. Columns of tables not in the selected path are set to null.
  • FIG. 15 is a block diagram illustrating an extract program.
  • Each of the leaf nodes 1501-1505 represent SQL queries that are applied to a data source.
  • Node 1506 represents a nesting of the results of nodes 1501 and 1502.
  • Node 1507 represents a nesting of the results of nodes 1506 and 1503.
  • Node 1508 represents a selection on the results of node 1507.
  • Node 1509 represents a nesting of the results of nodes 1504 and 1505.
  • Node 1510 represents a join of the results of nodes 1508 and 1509.
  • Node 1511 represents a projection of the results of node 1510.
  • Node 1512 represents the construct program that accesses the extract program.
  • Figure 16 is a flow diagram that illustrates code of a join node of an extract program in one embodiment.
  • the processing of each node of extract program is performed a pipeline manner, that is each node returns only the data needed to satisfy the next request from the construct program.
  • decision block 1601 if the right node is a fully processed, then the function continues at block 1602, else the function continues at block 1605.
  • decision block 1602 if the left node of the join is fully processed, then the function returns, else the function continues at block 1603.
  • the function retrieves at the next results from the left node.
  • the function initializes the right node based on the results returned from the left node.
  • the function retrieves the next results from the right node.
  • decision block 1606 if the results returned from the right node are contained in nested table, then the function returns an iterator for that table, else the function returns the results.
  • the iterator for a table is an optimization that allows nodes higher in the extract program to retrieve subsequent rows of the nested table without having to invoke lower-level nodes in the extract program.
  • Figure 17 illustrates the output of the nodes of the extraction plan to Figure 15.
  • construct program 1712 invokes root node 1711 of the extract program, that invocation is propagated down to the leaf nodes.
  • the SQL query of node 1701 returns result 1713
  • the SQL query of node 1702 returns result 1714.
  • Node 1706 indicates to nest results of nodes 1701 and 1702.
  • result 1714 is nested within result 1713 as indicated by result 1715.
  • Node 1703 returns result 1716.
  • Node 1707 nests result 1716 within result 1715.
  • the subscript within node 1707 specifies a target for the nesting. In this case, the subscript 2 indicates to nest within the third column of result 1715.
  • Result 1717 represents the result of the nesting.
  • Node 1708 represents selection on the result 1717.
  • the target represented by subscript 2.1 indicates to select the third column and the first row within the third column.
  • the result of the selection is result 1718.
  • Results 1719-1723 illustrates the results of the other nodes of the extract program.
  • the LMatch operator performs navigation-based selection over XML input.
  • the following example illustrates an XMLQL syntax fragment and the LMatch instance that is created to model it inside the compiler:
  • the "se//(a,$a) — child(b,—) — child(c,$c)" is a match expression.
  • the match expression is a tree with three nodes.
  • the general structure of the XMLQL pattern is translated into an isomorphic pattern within the match expression.
  • the XMLQL variables become "bindings" within the navigations.
  • the LMatch operator is one of the logical operators of the internal language of the data integration engine.
  • the LMatch operator is generally the "first" operator that is applied to input data and is responsible for converting XML input in to NCRs that are then further processed by the query engine.
  • the LMatch operator is a logical operator only in that one of the actions of the Compiler is to convert LMatch operators into a data source- dependent form (e.g., SQL for relational databases, or QLL for QL-Lite data sources).
  • the LMatch operator defines a match against XML data.
  • the pattern is defined by the "match expression,” which is a tree of navigation steps. Each navigation step describes a "movement" from a source element or attribute to a target element or attribute.
  • the parameters of the navigation step that govern navigation are as following:
  • the type of movement or navigation (child, parent, descendant etc.)
  • the navigation types are based on XPath axes.
  • LMatch matching is top-down on the tree of navigation steps. That is, the match begins at the root of both the XML document and the root of the match expression. Matches for the first navigation step are sought in the entire XML document. If the first navigation is a root navigation, then it matches the root of the XML document (where we interpret root to be the root element, not the document item, as defined in DOM). If the first navigation step is something other than root, it is as a navigation from the root.
  • the algorithm proceeds recursively: given a matched node, attempt each of the child navigations from the navigation tree (where child here means "child in the navigation tree," rather than child type node). Each attempted navigation will itself yield a new set of zero or more matches, which are then continued in the next level of the recursion, and so forth. While the recursion proceeds down the navigation tree, the navigations do not necessarily proceed "down" the XML tree; navigation types can move in arbitrary directions within the XML document (e.g., ancestor or precedingjsibling).
  • the first navigation step may yield a single element ⁇ a>.
  • the second step may yield a set of ⁇ b> elements, some of which contain ⁇ c> elements and some of which do not.
  • the evaluation of an LMatch operator is a three stage process: first, match the pattern within the LMatch operator against some source of XML; second, connect columns in the LMatch pattern with their associated items in the information set of the XML source; and thirs, structure those connected columns (the extracted information) into an NCR as indicated by the nesting settings on individual navigations. That is, an LMatch operator specifies a structural pattern that is sought after in a document, specifies which parts of that pattern should be returned, and specifies how the returned parts should be organized. The output of an LMatch operator is an NCR that contains the returned parts, organized as specified.
  • NCR column names some piece of information returned from an element or attribute node that has been matched.
  • Each navigation step may have one or more of the column types present.
  • the type of the column is derived from the type of the corresponding contents of the XML document (except for the table column).
  • NCR N cM ⁇ i(b,$b,$btable)— child(c,$c) —child( ⁇ ,$ ⁇ )
  • the root (top-level) navigation may also be nested or unnested. hi addition, the
  • LMatch operator like other operators, provides an additional column that names the its entire schema.
  • the child(b) navigation is a nested navigation that results in a nested NCR, named $btable, in the result.
  • This NCR will contain columns $b (because $b is a column on the child(b) navigation) and $c (because $c is a column on a navigation in child(b)'s subtree).
  • Figure 18 illustrates a final NCR structure.
  • a depth-first traversal of the match expression of an LMatch operator is used to construct the columns of the output NCR.
  • the LMatch operator also defines an ordering of the columns as well as their structure and names.
  • the results differ based on whether the navigation is nested. If the navigation is a nested navigation, then a nested NCR is created, which will contain the matches. But if the navigation is not nested, then the results are combined via a cross-product with all the other columns in the same table. So, if one ⁇ b> element contained multiple ⁇ c> elements, the $btable would contain the corresponding ⁇ b>- ⁇ c> pairs. Navigations that are not nested can be treated as a special case of nested navigations. Thus, an LMatch operator can be evaluated as if all navigations are nested. Then, for each navigation that is actually nested, a an LFlatten operation can be used to remove the table corresponding to the nesting.
  • a subtree column results in the entire XML subtree, tags and all, being returned as an atomic value. (This corresponds to the ELEMENT_AS notation in XML-QL.)
  • the compiler transforms this column into a more complex LMatch expression that "pulls apart" the entire subtree contents and modifies the rest of the execution unit to reconstruct the result back into a subtree when needed.
  • subtree columns exist initially, but they are replaced with more complex patterns.
  • the subtree columns are modeled in the NCR schema as a single, static column. After the rewrite, they begin with a table-valued column containing the nested results.
  • Advantages of the LMatch operator being a single, complex operation include:
  • mapping onto those operations is enabled.
  • the LMatch operator combines two kinds of capabilities into a single operator: navigational operations and composition of the results into a complex structure (the NCR). This allows a concise representation of a very common idiom.
  • the LMatch operator can be matched against a tree that represents an XML generator, rather than the actual XML document. For example,
  • the XML RDB Map can be interpreted as a generator of an XML document from a relational database. Matching the LMatch operator against an XML RDB Map is a fundamental step in converting the XML query into SQL.
  • the Construct Program of a query can be interpreted as a generator of an XML document from an NCR. Matching the LMatch operator against a Construct Program is a fundamental step in composing views.
  • the algorithm for matching against tree-structured XML generators is very similar to the algorithm for matching against XML input directly.
  • One difference is that where matching against an XML document generates tuples of output, matching against a generator generally produces a Correspondence Tree, which encodes all the potential correspondence points between the nodes of the generator and the navigation steps of the LMatch.
  • An XML generator is a tree (actually, a forest suffices) where the nodes in the tree represent the generation of XML elements or attributes or their values, and arcs between nodes represent inclusion. For example: element("person”) — attribute("ssn") — value() — element("name”) — value() — element("address”) — value()
  • the XML generator also indicates the arity of each arc.
  • the values for arity are optional (0 or 1), singular (exactly one) and multiple (0 or more). If an arc is marked multiple, then the generator can generate more than one instance of the child node for each parent instance. In the above example, if the arc between "person" and "name” were marked multiple, then a person could have zero or more names.
  • the arity of an arc is indicated by a subscript on the arc as shown in the following: element("person") — s attribute("ssn") — value()
  • the Correspondence Tree tracks which navigation steps in the LMatch operator correspond with which nodes in the XML generator.
  • the Correspondence Tree would be isomorphic to the LMatch navigation graph except for one thing: any given navigation step might match against multiple nodes in the generator.
  • the following is an example of an XML generator, an LMatch operator, and the corresponding
  • the LMatch selft($ ⁇ so ⁇ ) ⁇ — child(ssn) 2
  • Figure 19 illustrates the Correspondence Tree.
  • the subscripts on nodes in the generator and LMatch distinguish otherwise identical nodes when they appear in the Correspondence Tree.
  • the Correspondence Tree is "read” as: "The root navigation has a single match, namely the elementC'person" ⁇ node of the XML generator. From this generator node, the next LMatch navigation, c ⁇ i(name) 3 , is matched against two different generator nodes, and so on.
  • the Correspondence Tree is a bipartite graph.
  • a bipartite graph is one in which nodes come in two different alternating types. In this case, the node types are called navigation nodes (which reference navigation steps, and are pictorially indicated with brachets [ ]) and choice nodes (which reference generator nodes, and are pictorially indicated with braces ⁇ ⁇ ).
  • a bipartite graph is interpreted as having two different kinds of arcs, which are indicated by lines of different weights: light lines are choice arcs (arcs from navigation to choice nodes, choosing amongst multiple correspondences) and heavy lines are navigation arcs (arcs from choice to navigation nodes, following the navigation relationships in the LMatch operator).
  • a correspondence is a (navigation step, generator node) pair of a correpondence tree.
  • a correspondence is derived from a choice node by including the navigation step from the parent. For example, the following subgraph of a correspondence tree yields the following correspondence: subtree: [ c (name) 3 ] — ⁇ S: element("name") 4 ⁇ correspondence: ⁇ c d(name) 3 , element("name") 4 ⁇
  • the following matching algorithm generates the Correspondence Tree, given an LMatch operator and an XML generator as input.
  • the algorithm is a top-down recursion over the LMatch navigation graph.
  • the XML generator has the following operations:
  • the LMatch operator is limited to the following navigation types: root, child, self.
  • the nested flag on LMatch navigation steps is irrelevant to matching.
  • the LMatch operator provides the following pseudo code for accessing the match expression:
  • ChoiceNode. ne GeneratorNode, arity
  • the form of the algorithm is mutual recursion between two functions, each of which extends the graph by one level, or fails to do so (because there is no match).
  • the subroutines return boolean values indicating whether or not they were successful; this value is then used to determine whether or not to continue and whether or not to actually add nodes to the graph.
  • the following is the pseudo code for the addNavs function:
  • the rows of the nested table are that list of data.
  • the list of data can contain only 0 or 1 rows.
  • the NCR column is essentially a field that is filled in the by value.
  • the list of data is "joined against" the existing rows of the containing table. If the navigation is optional, the join is an outer join, if required, an inner join. If the list has multiple entries, the effect is a cross product against the other contents of the table.
  • the LMatch operation could contain only optional navigations or only required navigations in cases where the data will be present. Similarly, it is possible to eliminate the need to handle joins or cross products by limiting the LMatch operator to only allow unnested navigations when the data is at most singular.
  • LMatch operators Two type of normalization that can be performed on LMatch operators are removal of (non-root) self navigations and removal of implicit cross-products.
  • logical operators include the LSelect, LFlatten, and LBox operators.
  • the LSelect operator removes tuples from a table based on some condition.
  • the LFlatten operator flattens a nested table within an NCR. The operator is applied to a single nested table, and the process of flattening removes that table.
  • the LFlatten operator has a boolean parameter "outer" indicating whether the flattening operation should behave like an inner or left outer join — that is, if the nested table is empty, does flattening remove the containing row or not.
  • the LBox operator serves to introduce an artificial level of nesting within a table.
  • a singular relationship between a child navigation and its parent navigation is identified by examining the XML schema of the data that the LMatch operates against. Initially, the matching algorithm has been run. After that, it can be deteimined, for each navigation step, which place(s) in the schema the LMatch operator could match. From that information, and from the cardinality information available in the schema, it can be identified whether the singular condition holds.
  • the implementation can be styled in a bottom-up or top-down traversal, but note that in either case compensating operators are to be inserted at both the bottom and top of the chain.
  • FIGS. 20-25 illustrate normalization.
  • This condition can be detected in a postprocessing step, but it would require information from both the LMatch (the singularity of steps) and correlated information from the logical extraction program (the presence of LBox and LFlatten/LSelect); thus, this optimization may be implemented as an integral part of the recursive algorithm.
  • NCR Nested Conditional Relations
  • NCR extends relational algebra in two ways. First, it makes relations heterogeneous (i.e., allows them to contain records of different types). Each record is accompanied by a tag, describing its type, hence the term conditional relation. Second, relations can be nested. The value of an attribute can be either atomic (e.g., int, float, string) or another NCR.
  • a homogenous table is one that has rows of the same type.
  • a flat table has atomic fields, that is fields in the first normal form. The following table is a traditional relational data model.
  • Each field in the row is an atomic type.
  • the table is a set of such rows. Its type is:
  • NCRs Nested Conditional Relation model
  • a heterogeneous collection, or conditional relation is a relation which may have rows of different types.
  • the Dept and Persons persons tables below are of the traditional relational model.
  • a heterogeneous table consisting of departments and persons is obtained by interleaving the rows of the Depts and Persons tables as shown below: DepartmentsPersons
  • the Departments rows have four fields and the Persons rows have only three fields.
  • a tag is added to each row.
  • the value of the tag can be either Dept or Pers.
  • Each row has a structure that depends on this tag.
  • the type of such a row is called a tagged union type, and is denoted as:
  • the type of the entire table DepartmentsPersons is a set of a tagged union type:
  • the "Dept” tag in the condition indicates to select rows with a Dept tag.
  • the "FIoor>3 " indicates to select rows that have Floor>3. All rows that do not have the Dept tag are selected intact.
  • the condition applies to both Dept rows and Pers rows. For Dept rows, the condition specifies Floor>3; for Pers rows, the condition specifies Salary>50000. The result consists of the highlighted rows below:
  • the Pers rows have an Assignments field that is a non- atomic field.
  • the field contains a nested condition relation in that its sub-rows can be of type Project or Committee.
  • the following projection operation select the Name and Phone fields of the Dept rows and the Name field of the Pers rows.
  • Projections and selections can be combined and applied to the inner relations. This is done with combined operator, called Combo, that does both the selection and the projection.
  • a combo operator takes an argument p, called a.p-former, that describes what selections and projections are to be done.
  • the p-former generalizes both the argument in a selection, ⁇ p , and that in a projection ⁇ p .
  • the following is an example of a p-former: p ⁇ Dept/(Floor>3):[Name,Floor]
  • the Combo operator selects a submatrix via a combination of selections (on tags and predicates) and projections (on fields). 2 Operands and Typing
  • NCR is strongly typed which will allow type inference and type checking.
  • the basic manipulable types are tables and single-valued attributes.
  • the basic types are primitives or sets which are defined as:
  • Variant type: v :: tag:r where tag is a label, called the tag of the variant type.
  • a table is defined that includes both PeopIe(name: String, ssn ⁇ nt) rows and Employees(name: String, ssnant, salary: float) rows.
  • Vpeopie "People: [name: String, ssn:int]
  • NCR type describing a table consisting of both people and employees is: tpeopi mpioyees : ⁇ ⁇ People: [name:String, ssn:int]
  • the Project operator returns a different subset of the attributes from each variant type.
  • the project operator denoted Up, is parameterized by a list, p, of elements of the form tag:[set of attributes]. "P” is called a. projection p-former. It determines both the type of the project operator and its semantics (which columns are being projected out).
  • ⁇ g l- a q>, ⁇ • t q ⁇ , ⁇ ' a q ⁇ q2 • t q2'---' a q' ⁇ q ⁇ qiqp, -I
  • the project operator only affects tuples with tags that are mentioned in p. Tuples with other tags are simply copied to the output, unaffected.
  • the following table shows the results of applying a project operator.
  • the input table is an EmpsDeptsSites NCR with Emp, Dept, and Site tags. Emp:
  • the project operator is equivalent to ⁇ Emp:[name]
  • the Site records are included in the result by default. 2.2 Select
  • the Select operator returns a different subset of records for each variant type.
  • the select operator denoted ⁇ p , is parameterized by a list, p, of elements of the form tag/condition. "P" is called a selection p-former. It determines both the type of the select operator and its semantics, hi the following, conditions and expressions are defined.
  • Scalar operator if e l5 e 2 are two expressions of base types bi, b 2 respectively, then Q ⁇ op e 2 is an expression of type b, where op is an operator: op is one of (+, -, *, /), or a string operator (concat, substr), or any user-defined function on scalar values.
  • NCR expression if ei, e 2 , ... are expressions of types X ⁇ , t 2 , ... , then fi ⁇ e ⁇ , ... ) is an expression of type t, where f is an NCR expression f: ti x t 2 x ... — >• t.
  • Contexts are not used in selections, but are used later in the combo operator.
  • a selection p-former is an expression:
  • the selection condition(s) only apply to tuples with tags mentioned in p. Tuples with other tags are simply copied to the output, unaffected.
  • the select operator operates only on the top level, in that it decides for each top level record whether to keep it or toss it.
  • the condition c can look deep inside the current record (e.g., by using existential/universal quantifiers).
  • the rename operator (p) renames tags and/or attributes. It is a generalization of the rename operator in the relational algebra:
  • This operator renames tagi to tag'i, and renames the fields in the record of type tagi by changing an to an', • • • ; renames tag 2 to tag 2 ', and so on. All tags and/or labels that are not mentioned are left unchanged.
  • the output type is "isomorphic" to the input type: only the variant and record labels at the top level have changed.
  • the "identity" mapping aj — » a; can be abbreviated to a; in specifying a renaming.
  • the extend operator ( ⁇ ) adds new fields to a record, each computed by some expression from other fields.
  • tag n [Cni:e nl , c ⁇ :esammlung2, ... ]>
  • ey, ... have types ej j : r; — » ty 1
  • n' is a record type obtained by adding the fields [ci ⁇ t ⁇ 1 , c_:l_ ...] to n.
  • the Combo operator ( ⁇ ) combines Project, Select, Rename, and Extend, and does this to arbitrary nesting levels.
  • the Combo operator is parameterized by an argument that is a deeply structured expression combining arguments of Project (IT), Select ( ⁇ ), Rename (p), and Extend ( ⁇ ). Such an expression is called a p-former.
  • combo does not implicitly copy "the other" tags and labels to the output, but deletes them. This allows both copying and deleting. (For convenience, another version of the combo operator may also be defined that copies by default.)
  • a p-former, p has an input type t and an output type , and the p-former is denoted by: p :: t ⁇ f
  • the p-former takes an input value of type t and returns either an output value of type t' or "nothing.”
  • ⁇ p the type of the Combo operation is ⁇ t ⁇ — » ⁇ t 1 ⁇ .
  • Inner p-formers For each p-former, there is a context (which, recall, is a sequence of record types, (r ls ... , r n )). Inner p-formers have a context consisting of all record types of the surrounding records. The top-level p- former has an empty context.
  • the (a — b) p components rename an attribute a to b, and apply recursively the p- former p on the value of that attribute.
  • the c.e components introduce a new attribute
  • the p-formers p 1; ..., p k first apply to the values vn, ..., Vik respectively. If any of them returns "nothing," then the record p- former returns "nothing.” Otherwise, let vi', ..., V k ' be the values returned by the - formers, and let wi, ... , w p be the values record p-former returns the record [bi:vi'
  • the condition c is checked first. If it returns false, then the p-former returns "nothing 1 Otherwise, it returns whatever p returns, but changes the tag to tag'.
  • the c may use existential/universal quantifiers on the set fields of r.
  • the set p-former Given a set ⁇ x ls ..., x n ⁇ , the set p-former first applies the p-former p to each element in the set. Let yi, ..., yk be all values returned by p (t ' .e., excluding "nothing"), then the set p-former returns ⁇ yi, ... , yk ⁇
  • the operator Given a set ⁇ x l3 ..., x n ⁇ , the operator first applies the p-former p to each element in the set. Let yi, ..., yk be all values returned by p (i.e., excluding "nothing"). Then ⁇ p returns ⁇ y l3 ..., y k ⁇
  • combo operator can then be abbreviated as:
  • a Projection operator is a particular case of the Combo operator. For example ⁇ Emp:[name]
  • a Selection operator is a particular case of the Combo operator. For example
  • p ⁇ Emp— >Employee:[nan ⁇ e— >person,ssn,saI,phone— ⁇ contact]]
  • p ⁇ Emp— >Employee:[nan ⁇ e— >person,ssn,saI,phone— ⁇ contact]]
  • Combo can be used to homogenize a collection. For example, in EmpsDeptsSites there are three different kinds of records, and all share a name attribute. The following Combo operator extracts all names and constructs a homogeneous collection:
  • Combo can be used to dispatch records to different types (e.g., transforming a homogeneous collection into a heterogeneous one).
  • the following Combo operator splits Emp's into Regular and HighPaid:
  • the input has type
  • conditions that are applied to a tag may overlap. For example, employees with saKlOOk may be dispatched to some type, and those with sal>50k to another type. In this case, records that satisfy both conditions will contribute to both outputs.
  • the semantics of combo is that tags not mentioned in the p-former are dropped from the output.
  • the "simple” combo has a complementary semantics: only tags/labels that are to be modified need be mentioned. By default, all others are copied to the output. Moreover, the "simple” combo only does one single action, possible at some depth in the NCR.
  • a simple p-former is a p-former that includes only a single selection (i.e., for a single tag), a single projection, renaming of a single tag or label, or an extension with a single new label.
  • a single selection i.e., for a single tag
  • a single projection i.e., for a single tag
  • renaming a single tag or label
  • an extension with a single new label.
  • it is defined like a p-former with additional syntactic restrictions that ensure that only one action is performed (only a projection, a selection, a renaming, or an extension).
  • the other parts of the simple p-former, that do the copying, are omitted.
  • the simple p-former is illustrated by the following examples:
  • Omitted tags (Emp, normal) are added and omitted labels (name, floor) are added too.
  • the superscript 0 indicates that the combo is "simple" (i.e., all missing tags and labels in p have to be added).
  • the semantics is defined as follows. Given a p-former, p::t -» t',(simple or not) define the completion of p to be c(p) : t — » t" obtained from p by "completing" the missing tags according to the type t (i.e., c(p) should actually be denoted c(p,t); it can be defined inductively; omitted).
  • the semantics of the simple combo operator is:
  • the simple combos are strictly less powerful than combos, because they do not allow deletion of tags from the output type. For example, consider the combo ⁇ ⁇ Dept:[m g r]>(EmpsDeptsSites). Its output type is ⁇ Dept: [mgr]> ⁇ . It may be expressed as a sequence of simple combos:
  • the match operator generalizes the combo operator by relaxing some of its restrictions.
  • the match operator is parameterized by an m-former, m, that is defined inductively like a p-former in Combo, with two generalizations:
  • the output type is technically illegal in the type system, because each record field needs to be either atomic or a set. It can be normalized by pulling out all variant types to the top level and flattening the records.
  • the normalized type is defined to be the output type of ⁇ . We call norm the normalization operation.
  • the norm operation constructs new tag names and new field names by concatenating existing names.
  • the norm operation concatenates two record types, or, if the two types are unions of m and n record types respectively, then constructs a new union type with mn record types, by considering all pairs of concatenations.
  • tag n :r n >) ⁇ tagtag ⁇ :r ⁇
  • tagn. ⁇ n >) ulabelta gl (norm(r i) ) ® ... ⁇ ulabeltagn(norm(r n ) )
  • Join takes a tuple from each input table and tests this combination to see if it meets a predicate; if so, it returns a combined tuple as the result.
  • Join allows specification of a different predicate and name and a different output tag for each pair wise combination of types from the two tables. Not all pairs must have a predicate and output tag: those that do not contribute to the join.
  • An "inner" join is described in the following.
  • join produces a combined tuple as its output.
  • rl ® r2 [ah: th, al 2 :tl 2 , ..., al n : tl n , a2 ⁇ : t2 1; ..., a2 m : t2_J
  • rl [ah: th, al 2 :tl 2 , ..., al coordinate: tl encounter]
  • r2 [a2 ⁇ : t2 ⁇ , ..., a2 m : t2 m ]
  • a join on the two tables can be performed with one join predicate for (Dept x Emp) and another for (Dept x Site):
  • An outer join has the same syntax as an inner join, with minor additional restrictions.
  • the semantics differs:
  • tags tag'i, • • • , tag' p are distinct, and they are also distinct both from tagi i, ... , tagl n and from tag2 l3 ... , tag2 m
  • the outer join is illustrated by a data integration scenario.
  • Sourcel ⁇ sourcel:[namel: string, phonel int, conf :real]> ⁇
  • Source2 ⁇ source2:[name2: string, phone2:int, conf2:real]> ⁇
  • rawIntegratedData The type of rawIntegratedData is: rawIntegratedData : ⁇ sourcel:[namel, phonel, confl]
  • I source2 [name2, phone2, conf2] I integrated: [namel, name2,phonel,phone2,coni ,conf2]> ⁇
  • source 1 may be trusted more than source 2, have some complex rules on how to deal with conflicting information, and drop records where the confidence is too low:
  • integratedData ⁇ p (rawIntegratedData) where the p-former p encapsulates the integration logic:
  • integratedData ⁇ certain: [name, phone], uncertain: [name, phonel, phone2]> ⁇ that is, in some cases both phone numbers are kept since the confidence does not favor one over the other.
  • the Nest operator works like a left outer join, but it nests all matching children within the tuple of each parent. Nesting is commonly done in XML-QL subqueries, and any time there is a l:n parent- child hierarchy in the output. Nest has left outer join semantics, rather than inner join semantics, and that it preserves the order of the parent relation. Nest can rename both the tag types for the nested tuples and for the parent tuples.
  • NeSt ⁇ Dept:[info: ⁇ Emp/($l.dname $2.dept)
  • Site/($l.loc $2.sname)>]>(DeptS, EmpsSites)
  • the result has type:
  • the union of two union types, Ui ⁇ u 2 is defined to consist of all tags in both types, provided that a tag occurring in both Ui and u 2 has identical types in Ui and u 2 .
  • a U B for various types of A and B is illustrated in the following:
  • a U B has type ⁇ Emp:[name, phone]
  • AUB has type ⁇ Emp:[name, phone]
  • the output type is:
  • Aggregate operators are included in the combo operator.
  • the five aggregates in SQL and: count, sum, min, max, avg. Count is treated slightly differently.
  • Let agg be one of sum, min, max, avg, and let b be the base type to which it applies (b can be int, real, or string when agg is min or max, int or real when agg is sum, and real only when agg is avg).
  • u ⁇ tagi : n I ... I tagn : r consume> and let ei : ri — b, ..., e trash : r n — » b be expressions. Then the following is an expression:
  • n,q,c stand also for name, quantity, category.
  • the computation of the total quantities for all categories is performed in three steps: First, all categories are computed:
  • Cat distinct( ⁇ indigenous ⁇ Cat [category - c]
  • the type is ⁇ Cat:[c, total]> ⁇ .
  • NCR-QL XML-QL
  • NCR-QL from-case-where-construct
  • EmpsDeptsSites is the relation defined earlier containing tuples about Employees, Departments, and Sites.
  • HOwnersCities is a relation containing tuples about HomeOwners and Cities. The following NCR-QL query performs some join between the two:
  • NCR-QL query is translated into the algebra using a match operator:
  • the types are (with base types omitted):
  • Joinj ⁇ mll (Rpurchase) X JU ⁇ ml2 (R 12 ) X j ⁇ ...
  • ⁇ q ( ⁇ p (R)) ⁇ r (R) (the combo-combo law)
  • q k > , _ ) ⁇ q ⁇
  • Pm>) ⁇ ...
  • the expression (e) and the condition (c) have a single record argument. (i.e., only use $1, which, by convention, may be omitted)
  • p is a record p-former
  • p i( - b 1 ) : p l ⁇ ,...,(a lk ⁇ b k ) : p k ,c, : eorial...,c p : e p ] .
  • BetsyManages/true - Betsy [name — » name:_]
  • Site/(city " Seattle") — > ⁇ HighEarner: [name — name:_J>
  • Betsy [name - name: _J
  • ⁇ p (Nesting(R, S)) ⁇ p3 (Nest geometric ⁇ ⁇ pl (R), ⁇ p2 (S))) (generic combo-nest law) where pi, p2, p3 are p-formers and n' is an n-former to be described next. The following can identity holds. But, pi and p2 are chosen to do
  • tag 2m /c im "> is a selection that is the same for every i l, ... ,k.
  • ⁇ P (R IX] j S) ⁇ p3 ( ⁇ pl (R)X]j ⁇ p2 (S)) (the combo-join law) where pi, p2, p3 are defined below.
  • cy : rn x r 2j -> bool is the join condition
  • C y ": rn x r2j — bool is a selection predicate. Not all combinations of tags tag;/ must occur in p, but to simplify notations they are included.
  • Cy'(x,y) di j (x) AND eij(y) AND fij(x,y). That is dij(x) contains all conditions that only inspect only values from the left join operand, ej j (y) those conditions that only inspect values from the right join operand, while f]j(x,y) contains conditions that inspect values from both operands and cannot be separated.
  • pi ⁇ tagh / di ⁇ tagli, ...,taglcut/ d combat ⁇ taglcut>
  • p2 ⁇ tag2 1 / ei -> tag2 1; .... tag2 m / e m - tag2 m >
  • condition c 2 ij k looks at three records: x from R, y from S, and z from T. Decomposing it into two pieces:
  • j4 ⁇ tag 2 j, tag 3k / c 4jk ⁇ tag 5jk
  • j3 ⁇ tag ⁇ , tag 5 j k / c 3i j k ⁇ tagijk

Abstract

A model of nested conditional relations ('NCR') is provided for facilitating the integration of multiple data stores with incompatible formats. A data integration engine accepts queries using a standard query language such as XML, executes the queries against the multiple data stores and returns the results. The data integration engine uses XML as the data model in which the data from the various data stores is represented. The data integration converts data in the native format into data in a format referred to as nested conditional relations ('NCR'), applies various operators (e.g. joins, unions) to the data in NCR format to generate the query results in an NCR format and converts the results from NCR format into an XML format.

Description

NESTED CONDITIONAL RELATIONS (NCR) MODEL AND ALGEBRA
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of U.S. Provisional Application No. 60/222,070 filed August 1, 2000 and is related to U.S. Patent Application No. 09/718,228 filed November 21, 2000, U.S. Patent Application No. 09/517,131 filed March 2, 2000 and U.S. Patent Application No. 09/517,468 filed March 2, 2000, which are hereby incorporated by reference.
BACKGROUND
The described technology relates generally to accessing data and particularly to accessing data from data sources with diverse formats.
Large organizations may have their digital data stored in various data stores, such as databases and file systems, in diverse and incompatible formats. Different groups within the large organizations may have created their own data stores to meet the needs of the group. Each group would typically select its own type of data storage system and format to meet its particular needs. Traditionally, these data stores were created independently of any other data stores within the organization. As a result, the various data stores of an organization often contained duplicate and inconsistent data.
Recently, these large organizations have adopted standards such as the extensible markup language ("XML") for representing data in a uniform format. The use of XML by each group within an organization increases the compatibility of the data stores. It is, however, difficult for organizations to provide an XML interface to each of its existing data stores. The organizations would need to expend considerable resources to provide a mapping between their existing data stores or other sources of data and the XML formats.
It would be desirable to have a system that would facilitate the integrating of data stores with incompatible formats. BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 illustrates the schema of this XML document.
Figure 2 represents a Joinln graph (JIG) for the match expression of Table 8.
Figure 3 is a block diagram illustrating the overall organization of an execution program generated by the data integration engine.
Figure 4 is a block diagram illuslrating the function to generate an execution program.
Figure 5 is a flow diagram illustrating processing of the generate extract program function in one embodiment.
Figure 6 is a flow diagram illustrating the processing of the generate extract plan function in one embodiment.
Figure 7 is a flow diagram illustrating processing of the match expression function in one embodiment.
Figure 8 is a flow diagram illustrating the processing of the create Joinln graph function in one embodiment.
Figure 9 is a flow diagram illustrating processing of the generate Joinln graph into one embodiment.
Figure 10 illustrates the tables of the data store.
Figure 11 illustrates the results of the sorted outer union for the tables of Figure 10.
Figure 12 illustrates the SQL query for each of the tables of Figure 10 paid to generate the sorted outer union.
Figure 13 is a flow diagram illustrating the processing of generating a sorted outer union.
Figure 14 is a flow diagram illustrating processing of a generate SQL query function in one embodiment.
Figure 15 is a block diagram illustrating an extract program.
Figure 16 is a flow diagram that illustrates code of a join node of an extract program in one embodiment.
Figure 17 illustrates the output of the nodes of the extraction plan to Figure 15.
Figure 18 illustrates a final NCR structure. Figure 19 illustrates the Correspondence Tree.
DETAILED DESCRIPTION
A method and system for providing data integration of multiple data stores with diverse formats is provided. In one embodiment, the data integration engine accepts queries using a standard query language such as XML-QL, executes those queries against the multiple data stores, and returns the results. The data stores may include relational databases, hierarchical databases, file systems, application data available via APIs, and so on. A query may reference data that resides in different data stores. The data integration engine allows operations such as joins across multiple data stores. In one embodiment, the data integration engine uses XML as the data model in which the data from the various data stores is represented. The data integration engine processes a query by parsing the query into an internal representation, compiling and optimizing the internal representation into a physical execution representation, and then executing the execution representation. By providing a uniform and data model, the data integration engine allows access to data stores in diverse formats.
In one embodiment, the data integration engine executes a query on a data store by first providing a mapping of the data store format into an XML format. The query for the data store is based on XML format. The data integration engine upon receiving a query, generates a native query for the data store from the received query using the provided mapping. The data integration engine then executes the native query to generate data in a native format needed to generate the results of the received query. The data integration engine then converts the data in the native format into data in a format referred to as nested conditional relations ("NCR"). The data integration engine then applies various operators (e.g., joins and unions) to the data in NCR format to generate the query results in an NCR format. The data integration engine then converts the results in the NCR format into an XML format. In this way, the integration engine can provide access to various data sources in different formats.
A nested conditional relation is a table in which each row may have a different schema and each column is either a primitive type or a nested NCR. The schema of each row in an NCR is indicated by a tag, which can be considered to be the zero column of the row. For example, certain rows of the table may represent employees of a company and have columns named "first name," "last name," "phone number," and so on. Other rows in the table may represent departments within the company and have columns named "department name," "department head," and so on. The tag for a row indicates whether the row is an employee or a department row. A column for a certain type of row may itself contain a nested conditional relation. For example, an employee row may include a column named "skills" that contains a table with sub-rows containing information relating to computer skills and accounting skills of the employee. The table may itself be a nested conditional relation in that each sub-row may include a tag indicating whether the row represents a computer skill or an accounting skill. The nesting of nested conditional relations may occur to an arbitrary level. The NCR format is described below in detail.
The following example illustrates a data store, a mapping for the data store, a query, an LMatch representation for the query, a Joinln graph for the query, and an SQL query used to retrieve the data from the data source. Tables 1-3 illustrate an example of data that is stored in a data store such as a relational database. The relational database contains three tables: DEPARTMENTS table, EMPLOYEES table, and BUILDINGSDOCS table.
TABLE 1 DEPARTMENTS
Figure imgf000005_0001
TABLE 2 EMPLOYEES
Figure imgf000005_0002
TABLE 3 BUILDINGSDOCS
Figure imgf000005_0003
Figure imgf000006_0001
The DEPARTMENTS table contains one row for each department of an organization. As illustrated by Table 1, the organization has a finance and an engineering department. The DEPARTMENTS table contains two columns: name and contact. The name column contains the name of the department, and the contact column contains the employee identifier of the contact person for the department. For example, the first row of the table indicates that the department is "finance" and that the contact employee is "E1247." The EMPLOYEES table contains a row for each employee in the organization. Each row includes seven columns: ID, Fname, Lname, Dept, Bldg, Office, and Manager. The ID column uniquely identifies the employee, the Fname column contains the first name of the employee, the Lname column contains the last name of the employee, the Dept column identifies the employee's department, the Bldg column identifies the building in which the employee is located, the Office column identifies the employee's office within the building, and the Manager column identifies the employee's manager. The Dept column contains one of the values from the Name column of the DEPARTMENTS table. The BUILDINGSDOCS table contains a row for each office within each building of the organization. The BUILDINGSDOCS table contains four columns: Building, Office, Phone, and MaintContact. The Building column identifies a bwlding, the Office column identifies an office within the building, the Phone column contains the phone number associated with that office, and the MaintContact column identifies the employee who is the maintenance contact for the office. The combination of the Building and Office columns uniquely identifies each row. The Bldg and Office columns of the EMPLOYEES table identifies a row within the BUILDINGSDOCS table.
Table 4 is an example of data stored as an XML document.
TABLE 4
<deptlist>
<deptname="Finance"> <employee> <name><lϊrst>Bobby</lϊrst><last>Darrows</last></name> <office phone="xll02"/> </employee> <employee>
<name><first>Alice</first><last>LeGlass< last></name>
<office phone="xll03"/> </employee>
</dept>
<dept name="Engineering"> <employee>
<name><fϊrst>David</first><last>McKinzie</last></name> </employee> <employee>
<name><&st>Misha</first><last>Niev</last></name> <office phone="x0150'7> </employee>
</dept> </deptlist>
The XML document includes the root element <deptlist> that has a name attribute and that contains a <dept> element corresponding to each department within an organization. Each <dept> element contains an <employee> element for each employee within the department. Each <employee> element contains a <name> element and optionally an <offιce> element. The <name> element includes a <first> element and <last> element. The <office> element includes a phone attribute. The schema of an XML document may be represented by an XML data type definition ("DTD") of the document. Figure 1 illustrates the schema of this XML document. As this figure illustrates, the schema is specified as a tree-like hierarchy with the nodes of the tree having parent-child relationships. For example, node 104 is the parent of nodes 105 and 108, which are children of node 104. Node 101 corresponds to the <deptlist> element and has one child node 102, which corresponds to the <dept> element. Node 102 has two child nodes, 103 and 104. Node 104 corresponds to the name attribute of the <dept> element and node 104 corresponds to the <employee> element. Node 104 has two child nodes 105 and 108. Node 105 corresponds to the <name> element and has two child nodes 106 and 107. Node 106 corresponds to the <first> element, and node 107 corresponds to the <last> element. Node 108 corresponds to the <office> element and has one child node 109, which corresponds to the phone attribute.
The mapping technique is particularly useful in situations where a legacy database, such as the example database of Tables 1-3, is to be accessed using queries designed for XML data, such as the example of Table 4. The XML schema may be previously defined and many different applications for accessing data based on that XML schema may have also been defined. For example, one such application may be a query of the data. An example query for semi-structured data may be an XML transform that is designed to input data in XML format and output a subset of the data in XML format. For example, a query for the database of Tables 1-3 may be a request to list the ID of each employee in the "Finance" department. The subset of that data that is output corresponds to the results of the query represented by the XSL transform. One skilled in the art would appreciate that queries can be represented in other formats such as XML-QL. When a legacy database is to be accessed, the data is not stored using XML format. Thus, in one embodiment, a query system inputs a semi-structured query and uses a mapping table to generate a structured query, such as an SQL query, that is appropriate for accessing the legacy database. The mapping technique for generating that mapping table is described in the following.
Table 5 is a portion of the mapping table generated in accordance with the mapping technique that maps the XML schema of Table 4 to the legacy database of Tables 1-3.
TABLE 5
Figure imgf000008_0001
The mapping table contains one row for each parent-child relationship of the XML schema. The mapping is further described in U.S. Patent Application entitled "Method and Apparatus for Storing Semi-Structured Data in a Structured Manner." As shown in Figure 1, the XML schema defines eight parent-child relationships such as the relationship between node 102 and node 104. Thus, the mapping table contains eight rows. Each row uniquely identifies a parent-child relationship using the ParentName and ChildName columns. For example, the parent-child relationship of node 102 and node 104 is represented by row 3 as indicated by the ParentName of "dept" and the ChildName of "employee." Each row maps the parent-child relationship to the table in the legacy database that corresponds to that relationship. In the example of row 3, the Table column indicates that the "dept-employee" relationship maps to the EMPLOYEES table. The query system could use only the ParentName, ChildName, and Table columns of the mapping table to generate a structured query from a semi- structured query. For example, if the legacy database had used the same column names as defined by the elements of the XML schema (e.g., "employee" rather than "ID"), then only these three columns would be needed to generate the structured query. For example, if the semi-structured query requested an identifier of all employees within the finance department and the DEPARTMENTS table contained an "employee" column rather than an "ID" column, then the query system could input a semi- structured query with only these three columns and generate a structured query. In the more general case where the columns of the legacy database are arbitrarily named, the mapping table includes a parent key column ("PKey") and a child key column ("CKey"). The parent key column contains the name of the column that identifies the parent of the parent-child relationship. The child key column contains the name of the column that identifies the child of the parent-child relationship. For example, in row 3, the parent is identified by the "dept" column and the child is identified by the "ID" column in the EMPLOYEES table. Thus, to generate the structured query to retrieve the ID of an employee within the finance department, the query that uses a select clause of EMPLOYEES. dept="Finance" would be used. Table 5 also includes a column named "A/E" to indicate whether the row corresponds to an element within the semi- structured data or an attribute of an element with semi-structured data. As illustrated by rows 7 and 8, some of the parent and child keys actually consist of multiple columns that uniquely identify a row in the corresponding table. For example, the rows of the BUILDINGSDOCS table are uniquely identified by a combination of the Building and Office columns.
The query system maps the selections within the semi-structured query to selections within a structured query. The following illustrates the basic format of that mapping when the structured query is an SQL format.
SELECT {TABLE}JCKEY} FROM {TABLE}
WHERE {TABLE}.{PKEY} = pkey
The TABLE, CKEY, and PKEY parameters are replaced by the corresponding values from the row in the mapping table for the parent-child relationships specified by the selection. In other words, this query will find all the children given the key for the parent. The following illustrates the format of the mapping when the query represents the identification of the idea of all employees within the finance department.
SELECT EMPLOYEES.ΓD
FROM EMPLOYEES
WHERE EMPLOYEES.Dept = "Finance"
The query system also allows chaining of keys to effectively navigate through the hierarchy defined by the semi-structured data. The query system uses the joint concept of relationship databases to effect this chaining of keys. The following illustrates chaining:
SELECT {TABLE2}.{CKEY2}
FROM {TABLE1}, {TABLE2}
WHERE {TABLE1}.{PKEY1} = pkey && {TABLE1}JCKEY1}=
{TABLE2}.{PKEY2}
The TABLE1, PKEYl, and CKEYl parameters are derived from the first parent-child relationship in the chain, and the TABLE2 , PKEY2, and CKEY2 parameters are derived from the second parent-child relationship in the chain. The child key associated with the first parent-child relationship matches the parent key associated with the second parent-child relationship. The following is an example of the chaining to identify the building for the employees of the finance department.
SELECT BUΓLDINGSDOCS.BUΓLDING
FROM EMPLOYEES, BUTLDINGSDOCS WHERE EMPLOYEES = "Finance" && EMPLOYEES.BLDG = BUILDINGDOCS.BUILDING && EMPLOYEES.OFFICE = BUrLDINGDOCS.OFFICE
In one embodiment, the mapping table also contains the value rows corresponding to each leaf node, that is a node that is not a parent node. The leaf nodes of Figure 1 are nodes 103, 106, 107, and 109. In one embodiment, each value row identifies an XML element or attribute, the table in the legacy database that contains an element, and the name of the column in the table that contains the value for that element or attribute. Table 6 illustrates the four value rows for the mapping associated with Tables 1-3 and Table 4.
TABLE 6
Figure imgf000011_0002
The "A/E" column identifies whether the row is an attribute or element; the "Name" column identifies the name of the element and attributes; the "Table" column identifies the legacy table; the "Key" column identifies the key for that table; and the "Value" column identifies the name of the column where the value is stored.
Table 7 illustrates a query that is to be applied to the data of Tables 1-3. The query indicates to return the first and last names and phone number of each employee in the engineering department.
Table 7
Figure imgf000011_0001
The data integration engine generates a "match expression" for a logical match operation ("LMatch") for the query when compiling the query. The logical match operation supports operations for performing XML navigation. The match expression defines a tree of navigations. Each node of the tree indicates a navigation type (e.g., child, parent, or sibling), a navigation condition (e.g., a condition on the name of the child), whether the navigation is required, whether there should be a binding to the target of the navigation (i.e., a value returned with the specified name), and whether the result should be nested.
Jo- Table 8 illustrates a match expression for the XML of Table 4 for the query of Table 7. Each row of Table 8 represents a different navigation path. For example, the first row represents a navigation path from the root of the deplist element to its child element of the dept element and then to the name attribute of the dept element. The remaining rows represent different branches on the tree. For example, the second row represents the branch of roøt(deplist), c (dept), c <i(employee), c <i(name), and child(_ιst). The symbols prefixed with "$" represent bindings.
Table 8
Figure imgf000012_0001
Figure 2 represents a Joinln graph (JIG) for the match expression of Table 8. The Joinln graph is a data structure that facilitates the optimization of the query to be executed against the data store. This JIG indicates that the Departments, Employees, and Buildhigdocs tables of the data store are to be joined together. This JIG also indicates the bmdings (e.g., $first) and the join columns (e.g., Name and Dept). The format of the JIG is described below in detail. The JIG is generated from the match expression using the mapping. The data integration engine then generates the query to be executed. The following query is generated.
SELECT EMPLOYEES.Fname, EMPLOYEES. Lname, BUILDlNGSDOCS.phone FROM DEPARTMENTS, EMPLOYEES, BUILDINGSDOCS WHERE DEPARTMENTS.Name = EMPLOYEES.Dept AND
EMPLOYEES.Bldg = BUILDINGDOCS.Building AND
EMPLOYEES.Office = BUTLDINGSDOCS.Office AND
DEPARTMENTS.NAME = "Engineering"
Figure 3 is a block diagram illustrating the overall organization of an execution program generated by the data integration engine. An execution program consist of an extract program 310 and a construct program 320. A compiler of the data integration engine generates the execution program during a compilation phase. The extract program is a series of operations on a data extracted from the data sources. The extract program represents a graph of the operations. The leaf nodes 311 of the extract program represents a sorted outer union operation applied to the data stores 312. The compiler generates a query for each data store in the native query language of the data store to retrieve the results of the sorted outer union. The compiler generates the sorted outer union using the LMatch operation, Joinln graph, and mapping. During execution of the extract program, the generated query is applied to each data store. The construct program accesses the root node 313 of the extract program which retrieves the results generated by the extract program. The construct program collects the data and formats it into an XML output. As discussed below in more detail, the output of each operation of the extract program is in a nested conditional relation format.
Figures 4-9 are flow diagrams illustrating processing of the compiler of the data integration engine in one embodiment. Figure 4 is a block diagram illustrating a function to generate an execution program. The function first generates the extract program and then generates the construct program. In block 401, the function invokes a generate extract program function to generate an extract program for the specified query against the specified data stores. In block 402, the function invokes the generate construct program function to generate a construct program to generate the results from the extracted data.
Figure 5 is a flow diagram illustrating processing of the generate extract program function in one embodiment. In block 501, the function generates an extract plan, hi block 502, the function identifies fragments of the extract plan. A fragment of an extract plan are the set of operations that are applied to data derived from a single data source. Operations that apply to data from multiple data sources are grouped into one fragment. In block 503, the function optimizes the operations of the fragments and then returns.
Figure 6 is a flow diagram illustrating the processing of the generate extract plan function in one embodiment. In block 601, the function receives the XML query. In block 602, the function generates a match expression for the logical match associated with the data store. In block 603, the function creates the Joinln graph from the match expression using the mapping for the data store. In block 604, the function generates the native query from the Joinln graph. The function indicates additional processing to generate the extract plan from the Joinln graph. Blocks 602-604 illustrate the generation of the native query for the sorted outer union of the leaf nodes of the extract plan. The ellipses indicate other processing performed by the function. The function then returns.
Figure 7 is a flow diagram illustrating processing of the match expression function in one embodiment. This function is passed an XML node representing the data store and returns the match expression. This function is recursively invoked for each child node of the passed XML node. In block 701, the function initializes the subtree to the XML node. In block 702-705, the function loops creating a match expression for each child node. In a block 702, the function selects the next child node of the XML node. In decision block 703, if all the child nodes have already been selected, then the function returns, else the function continues at block 704. In block 704, the function recursively invokes the create match expression function passing the child node and receiving a child sub-tree in return. In block 705, the function adds the child sub-tree to the sub-tree and then loops to block 702 to select the next child.
Figure 8 is a flow diagram illustrating the processing of the create Joinln graph function in one embodiment. In block 801, the function invokes the generate Joinln graph passing the match expression and receiving the Joinln graph in return. In block 802, the function merges nodes of the Joinln graph. In block 803, the function processes merging of adjoining nodes of the Joinln graph and then returns.
Figure 9 is a flow diagram illustrating processing of the generate Joinln graph function into one embodiment. This function is passed a match expression and returns a Joinln graph. The function is recursively invoked for each child node of the passed match expression. In block 901, the function sets the Joinln graph to a node corresponding to the root of the match expression. The function retrieves the mapping rows that can further the path from the root. In block 902, the function selects the next child node of the match expression. In decision block 903, if all the children have already been selected, the function returns, else the function continues at block 904. In block 904, the function recursively invokes the generate Joinln graph function passing the selected child node of the match expression and receiving a child Joinln graph in return. In block 905, the function adds the child Joinln graph to the Joinln graph and then loops to block 902 to select the next match expression. Figures 10-15 illustrate the generation of an SQL query for a sorted outer union node of an extract program. Figure 10 illustrates the tables of the data store. The arrows between the tables illustrate joins between tables. For example, arrow 1001 represents a join between the third column of table 1.1 and the first column of table 2.L Figure 11 illustrates the results of the sorted outer union for the tables of Figure 10. Figure 12 illustrates the SQL query for each of the tables of Figure 10 that are used to generate the sorted outer union.
Figure 13 is a flow diagram illustrating the processing of a function to generate a sorted outer union. In block 1301, the function selects the next table of the source data store. In decision block 1302, if all the tables have already been selected, the function continues at block 1304, else the function continues at block 1303. In block 1303, the function invokes the generate SQL query for the selected table and then loops to block 1301 to select the next table. In block 1304, the function executes the generate SQL queries against the tables. In block 1305, the function aggregates of the result of the queries into a table. In block 1306, the function sorts the results and then returns.
Figure 14 is a flow diagram illustrating processing of a generate SQL query function in one embodiment. In block 1401, the function outputs a select, from, and where clause for the query. In blocks 1402-1408, the function loops selecting each table in a join path of the data store. In block 1402, the function selects the next table in the path. In decision block 1403, if all the tables have already been selected, then the function returns, else the function continues at block 1404. In block 1404, the function adds the table to the from clause. In block 1405, the function adds the table to the where clause. In block 1406-1408, the function loops selecting each column of the selected table. In block 1406, the function selects the next column. In decision block 1407, if all the columns have already been selected, then the function loops to block 1402 to select the next table of the path, else the function updates the select clause with the column and then loops to block 1406 to select the next column. Columns of tables not in the selected path are set to null.
Figure 15 is a block diagram illustrating an extract program. Each of the leaf nodes 1501-1505 represent SQL queries that are applied to a data source. Node 1506 represents a nesting of the results of nodes 1501 and 1502. Node 1507 represents a nesting of the results of nodes 1506 and 1503. Node 1508 represents a selection on the results of node 1507. Node 1509 represents a nesting of the results of nodes 1504 and 1505. Node 1510 represents a join of the results of nodes 1508 and 1509. Node 1511 represents a projection of the results of node 1510. Node 1512 represents the construct program that accesses the extract program.
Figure 16 is a flow diagram that illustrates code of a join node of an extract program in one embodiment. In one embodiment, the processing of each node of extract program is performed a pipeline manner, that is each node returns only the data needed to satisfy the next request from the construct program. In decision block 1601, if the right node is a fully processed, then the function continues at block 1602, else the function continues at block 1605. In decision block 1602, if the left node of the join is fully processed, then the function returns, else the function continues at block 1603. In block 1603, the function retrieves at the next results from the left node. In block 1604, the function initializes the right node based on the results returned from the left node. In block 1605, the function retrieves the next results from the right node. In decision block 1606, if the results returned from the right node are contained in nested table, then the function returns an iterator for that table, else the function returns the results. The iterator for a table is an optimization that allows nodes higher in the extract program to retrieve subsequent rows of the nested table without having to invoke lower-level nodes in the extract program.
Figure 17 illustrates the output of the nodes of the extraction plan to Figure 15. When the construct program 1712 invokes root node 1711 of the extract program, that invocation is propagated down to the leaf nodes. The SQL query of node 1701 returns result 1713, and the SQL query of node 1702 returns result 1714. Node 1706 indicates to nest results of nodes 1701 and 1702. In this case, result 1714 is nested within result 1713 as indicated by result 1715. Node 1703 returns result 1716. Node 1707 nests result 1716 within result 1715. The subscript within node 1707 specifies a target for the nesting. In this case, the subscript 2 indicates to nest within the third column of result 1715. (Columns are identified starting with column 0.) Result 1717 represents the result of the nesting. Node 1708 represents selection on the result 1717. The target represented by subscript 2.1 indicates to select the third column and the first row within the third column. The result of the selection is result 1718. Results 1719-1723 illustrates the results of the other nodes of the extract program.
LMatch Operation
The LMatch operator performs navigation-based selection over XML input. The following example illustrates an XMLQL syntax fragment and the LMatch instance that is created to model it inside the compiler:
<aXb><c>$c</></></> ELEMENT_AS $a
LMatch ( $ results , "self (a, $a) -cήild (b, ~ ) -child ( c, $c) " )
The "se//(a,$a) — child(b,—) — child(c,$c)" is a match expression. In this example, the match expression is a tree with three nodes. The general structure of the XMLQL pattern is translated into an isomorphic pattern within the match expression. The XMLQL variables become "bindings" within the navigations. The LMatch operator is one of the logical operators of the internal language of the data integration engine. The LMatch operator is generally the "first" operator that is applied to input data and is responsible for converting XML input in to NCRs that are then further processed by the query engine. The LMatch operator is a logical operator only in that one of the actions of the Compiler is to convert LMatch operators into a data source- dependent form (e.g., SQL for relational databases, or QLL for QL-Lite data sources).
The LMatch operator defines a match against XML data. The pattern is defined by the "match expression," which is a tree of navigation steps. Each navigation step describes a "movement" from a source element or attribute to a target element or attribute. The parameters of the navigation step that govern navigation are as following:
• The type of movement or navigation (child, parent, descendant etc.) The navigation types are based on XPath axes.
• The name of the target element or attribute, which may be a wild card.
• Whether the target should be an element, an attribute, or either.
• Whether the navigation is optional or not.
LMatch matching is top-down on the tree of navigation steps. That is, the match begins at the root of both the XML document and the root of the match expression. Matches for the first navigation step are sought in the entire XML document. If the first navigation is a root navigation, then it matches the root of the XML document (where we interpret root to be the root element, not the document item, as defined in DOM). If the first navigation step is something other than root, it is as a navigation from the root.
Once a node or set of nodes have been identified for the first navigation, the algorithm proceeds recursively: given a matched node, attempt each of the child navigations from the navigation tree (where child here means "child in the navigation tree," rather than child type node). Each attempted navigation will itself yield a new set of zero or more matches, which are then continued in the next level of the recursion, and so forth. While the recursion proceeds down the navigation tree, the navigations do not necessarily proceed "down" the XML tree; navigation types can move in arbitrary directions within the XML document (e.g., ancestor or precedingjsibling).
If an attempted navigation yields zero matches from some source node, then that navigation is said to have failed. If the navigation was not marked as optional=true, then the failure of the naviagation causes the source node to be "unmatched." The following match expression illustrates the failure of a navigation; se /(a,$a)— child(b,$b)— child(c,$c)
The first navigation step may yield a single element <a>. The second step may yield a set of <b> elements, some of which contain <c> elements and some of which do not. When the final navigation is evaluated, it will, for some <b> elements, yield no results. If the navigation is optional (optional=true), then all the <b> elements are included in the result. If, however, the navigation is required (optional=false), then those <b> elements that contain no <c> elements are removed from the set of matches for child b) from the root <a> element. The result contains only <b> elements that actually contain <c> elements. If no <b> elements remain after this process, then the failure propagates upward, ''urrmatching'' the <a> elements (unless the child(b) navigation was optional).
The evaluation of an LMatch operator is a three stage process: first, match the pattern within the LMatch operator against some source of XML; second, connect columns in the LMatch pattern with their associated items in the information set of the XML source; and thirs, structure those connected columns (the extracted information) into an NCR as indicated by the nesting settings on individual navigations. That is, an LMatch operator specifies a structural pattern that is sought after in a document, specifies which parts of that pattern should be returned, and specifies how the returned parts should be organized. The output of an LMatch operator is an NCR that contains the returned parts, organized as specified.
The parameters of the LMatch that govern how results are constructed are these:
• The set of columns returned. An NCR column "names" some piece of information returned from an element or attribute node that has been matched. There are several kinds of columns: o value: the contents of a simple element or attribute o subtree: the entire element (not applicable to attributes) o name: the name of the element or attribute o text: the text value of an element (used to extract text from mixed-content elements; in a simple element it is equivalent to 'value') o table: the table column gives a name to the entire set of results when nested— true
• Whether or not the results of the navigation should be nested.
Each navigation step may have one or more of the column types present. The type of the column is derived from the type of the corresponding contents of the XML document (except for the table column).
These columns are structured into an NCR based on the nested flag and the table column: linested=true, then the table column was specified, and the navigation creates a nested NCR. This NCR contains all the other columns for this navigation step, as well as all the columns generated by the subtree of navigations beneath it. For example: selfip,-)— NcM<i(b,$b,$btable)— child(c,$c) —child(ά,$ά)
The root (top-level) navigation may also be nested or unnested. hi addition, the
LMatch operator, like other operators, provides an additional column that names the its entire schema. The child(b) navigation is a nested navigation that results in a nested NCR, named $btable, in the result. This NCR will contain columns $b (because $b is a column on the child(b) navigation) and $c (because $c is a column on a navigation in child(b)'s subtree). Figure 18 illustrates a final NCR structure.
A depth-first traversal of the match expression of an LMatch operator is used to construct the columns of the output NCR. As a result, the LMatch operator also defines an ordering of the columns as well as their structure and names.
When a navigation matches multiple times, then the results differ based on whether the navigation is nested. If the navigation is a nested navigation, then a nested NCR is created, which will contain the matches. But if the navigation is not nested, then the results are combined via a cross-product with all the other columns in the same table. So, if one <b> element contained multiple <c> elements, the $btable would contain the corresponding <b>-<c> pairs. Navigations that are not nested can be treated as a special case of nested navigations. Thus, an LMatch operator can be evaluated as if all navigations are nested. Then, for each navigation that is actually nested, a an LFlatten operation can be used to remove the table corresponding to the nesting.
A subtree column results in the entire XML subtree, tags and all, being returned as an atomic value. (This corresponds to the ELEMENT_AS notation in XML-QL.) The compiler transforms this column into a more complex LMatch expression that "pulls apart" the entire subtree contents and modifies the rest of the execution unit to reconstruct the result back into a subtree when needed. As a result, subtree columns exist initially, but they are replaced with more complex patterns. Before they are rewritten, the subtree columns are modeled in the NCR schema as a single, static column. After the rewrite, they begin with a table-valued column containing the nested results.
Advantages of the LMatch operator being a single, complex operation include:
1. When queries are generated for query languages which themselves contain some form of matching operations, then mapping onto those operations is enabled.
2. Certain optimizations that may be done on navigational matching are better enabled by capturing succinctly the navigation that is being done. In particular, reasoning about substitution of a descendant relation with a union of paths, and vice versa. Also, also reasoning about document order relations. 3. The LMatch operator combines two kinds of capabilities into a single operator: navigational operations and composition of the results into a complex structure (the NCR). This allows a concise representation of a very common idiom. The LMatch operator can be matched against a tree that represents an XML generator, rather than the actual XML document. For example,
• The XML RDB Map can be interpreted as a generator of an XML document from a relational database. Matching the LMatch operator against an XML RDB Map is a fundamental step in converting the XML query into SQL.
• The Construct Program of a query can be interpreted as a generator of an XML document from an NCR. Matching the LMatch operator against a Construct Program is a fundamental step in composing views.
The algorithm for matching against tree-structured XML generators is very similar to the algorithm for matching against XML input directly. One difference is that where matching against an XML document generates tuples of output, matching against a generator generally produces a Correspondence Tree, which encodes all the potential correspondence points between the nodes of the generator and the navigation steps of the LMatch.
An XML generator is a tree (actually, a forest suffices) where the nodes in the tree represent the generation of XML elements or attributes or their values, and arcs between nodes represent inclusion. For example: element("person") — attribute("ssn") — value() — element("name") — value() — element("address") — value()
The XML generator also indicates the arity of each arc. The values for arity are optional (0 or 1), singular (exactly one) and multiple (0 or more). If an arc is marked multiple, then the generator can generate more than one instance of the child node for each parent instance. In the above example, if the arc between "person" and "name" were marked multiple, then a person could have zero or more names. The arity of an arc is indicated by a subscript on the arc as shown in the following: element("person") — sattribute("ssn") — value()
Melement("name") — value()
0element("address") — value() When no arity is indicated, singular is assumed. If it is not possible to derive arity information from the generator, then multiple is assumed, since it is the most general case.
The Correspondence Tree tracks which navigation steps in the LMatch operator correspond with which nodes in the XML generator. The Correspondence Tree would be isomorphic to the LMatch navigation graph except for one thing: any given navigation step might match against multiple nodes in the generator. The following is an example of an XML generator, an LMatch operator, and the corresponding
Correspondence Tree:
The XML generator: elementC'person' i — attribute("ssn")2 — value()3
— element("name")4 — value()5
— element("name")6 — value()7
The LMatch: selft($ςχsoή)\ — child(ssn)2
Figure imgf000022_0001
Figure 19 illustrates the Correspondence Tree.
The subscripts on nodes in the generator and LMatch distinguish otherwise identical nodes when they appear in the Correspondence Tree. The Correspondence Tree is "read" as: "The root navigation has a single match, namely the elementC'person"^ node of the XML generator. From this generator node, the next LMatch navigation, c <i(name)3, is matched against two different generator nodes, and so on.
The Correspondence Tree is a bipartite graph. A bipartite graph is one in which nodes come in two different alternating types. In this case, the node types are called navigation nodes (which reference navigation steps, and are pictorially indicated with brachets [ ]) and choice nodes (which reference generator nodes, and are pictorially indicated with braces { } ). A bipartite graph is interpreted as having two different kinds of arcs, which are indicated by lines of different weights: light lines are choice arcs (arcs from navigation to choice nodes, choosing amongst multiple correspondences) and heavy lines are navigation arcs (arcs from choice to navigation nodes, following the navigation relationships in the LMatch operator).
A correspondence is a (navigation step, generator node) pair of a correpondence tree. A correspondence is derived from a choice node by including the navigation step from the parent. For example, the following subgraph of a correspondence tree yields the following correspondence: subtree: [ c (name)3 ] — { S: element("name")4 } correspondence: { c d(name)3, element("name")4 }
The following matching algorithm generates the Correspondence Tree, given an LMatch operator and an XML generator as input. The algorithm is a top-down recursion over the LMatch navigation graph.
The XML generator has the following operations:
XMLGenerator.root () -5> ordered list of GeneratorNode
Generate.rNode. type ( ) -5> { "element" I "attribute" | "value" }
GeneratorNode.name () -^ GName
GeneratorNode . genChildren ( ) --> list of GeneratorNode
GeneratorNode. arity (childNode) -► { IIS II 1 lljyjll 1 IIQ1I J
In this example, the LMatch operator is limited to the following navigation types: root, child, self. The nested flag on LMatch navigation steps is irrelevant to matching. The LMatch operator provides the following pseudo code for accessing the match expression:
LMatch . root ( ) - NavStep
NavStep type() -5 { "root" | 'child" | "self" }
NavStep ea() -> { "element" | "attribute" 1 "either" }
NavStep name ( ) -> NName
NavStep navChildren ( ) -_> list of NavStep
NavStep optional ( ) -S boolean
There is also a fimction, nameMatch (GNa e, NName) -» boolean, that returns true or false as the name from a generator node matches the name of an LMatch navigation. The Correspondence Tree provides the following operations:
CorrespondenceTree.root () -j NavigationNode CorrespondenceTree.createRoot ( NavStep ) NavigationNode.ne ( NavStep ) NavigationNode.navStep () -_> NavStep
// Model . navStep () . type () NavigationNode.typef) -5 { "root" | "child" | "self" }
// And . navStep () . name () NavigationNode .name ( ) -_ NName NavigationNode.choiceChildren( ) -> list of ChoiceNode
NavigationNode. addChoiceChild( ChoiceNode )
ChoiceNode. ne ( GeneratorNode, arity )
ChoiceNode. generatorNode ( ) -_► GeneratorNode
ChoiceNode. type () -> { "element" | "attribute" | "value" } // ditto
ChoiceNode. name () -> GName // ditto
ChoiceNode. arity () - { "S" | "M" | "0" } // ditto
ChoiceNode .navChildren ( ) - list of NavigationNode
ChoiceNode . addNavChild( NavigationNode )
The following illustrates the BuildCorrespondence function that is invoked to build a Correspondence Tree for an XML generator and an LMatch operator:
// Assume a rooted LMatch match expression;
// normalize the LMa tch to make this true if necessary.
BuildCorrespondence ( XMLGenerator g, LMatch lm )
{
// create the correspondence tree ct <- new CorrespondenceTree nn <- new NavigationNode ( lm.root() )
// bootstrap the first level of expansion,
// matching root against roots ct. addRoot ( nn ) foreach( gn in g.getRoot() ) { en <- new ChoiceNode ( gn, "M" ) if ( addNavs ( nn, en ) ) nn->addChoiceChild( en )
} return ct
}
The form of the algorithm is mutual recursion between two functions, each of which extends the graph by one level, or fails to do so (because there is no match). The subroutines return boolean values indicating whether or not they were successful; this value is then used to determine whether or not to continue and whether or not to actually add nodes to the graph. The following is the pseudo code for the addNavs function:
// From a given corresponding navigation and choice node pair, // extend the choice node for each child navigation of the
// navstep. boolean addNavs ( NavNode nn, ChoiceNode en )
{ foreach ( step in nn. navStep () .navChildren () ) { stepnavnode <- new NavigationNode ( step ) success <- addChoices ( en, stepnavnode ) // if a navigation is optional, we include the navNode, // even if it failed (the navNode will have no choice children) if ( success I | step. optional ( ) ) cn->addNavChild ( stepnavnode ) else // failure of a required navigation/ abort return false
}
// if no required navigation failed, return true return true
The following is the pseudo code for the addChoices function:
// Given a location in the generator and a requested
// navigation, "follow" the navigation in the generator tree,
// finding a new layer of correspondences . boolean addChoices ( ChoiceNode en, NavNode nn )
{ success <- false foreach ( gn in follo ( cn.generatorNode () , nn. navStep () ) { choicenode <- new ChoiceNode ( gn, cn.generatorNode () .arity (gn) ) thissuccess <- addNavs ( nn, choicenode ) if ( thissuccess ) { nn->addChoiceChild ( choicenode ) success <- true } }
// return true if at least one choice worked out return success;
The following is the pseudo code for the follow function:
// Implement the actual navigation; this would be extended // with more types of navigation as the LMatch is extended // (and would probably require the generator to support more // powerful navigations as well, at least parent () ) . List<GeneratorNode> follow ( GeneratorNode gn, NavStep nav ) {
List<GeneratorNode> result <- (); switch ( nav. type () ) { case "self" : if ( nameMatch ( gn .name ( ) , nav.name ( ) ) result. add ( gn ) case "child" : foreach ( gnkid in gn. genChildren ( ) ) if ( (nav.ea() == "element" || nav.ea() == "either") && gnkid. type () == "element" && nameMatch ( gn. name ( ) , na . ame ( ) ) ) result. add( gn ) else if ( (nav.ea() == "attribute" || nav.ea() == "either")
&& gnkid. type () == "attribute"
&& nameMatch ( gn.name(), nav. name () ) ) result. add ( gn ) case "root" : foreach ( r in xmlGenerator. root ( ) ) // [1] if ( namematch ( r. name ( ) , nav. name ( ) ) result. add ( r ) } return result }
The BuildCorrespondence algorithm presented above does not match against actual XML data. However, an XML document may be considered a degenerate XML generators with singular-arity arcs and constant value nodes and and NCR is built rather than a Correspondence Tree . The relationship between a Correspondence Tree and an NCR is as follows:
• The values for any particular navigation are the concatenation of the values for each choice below that navigation; the result is a list of data.
• For nested navigations, the rows of the nested table are that list of data.
• For an unnested navigation that is at most singular, then the list of data can contain only 0 or 1 rows. In this case, the NCR column is essentially a field that is filled in the by value.
• In general for unnested navigations, the list of data is "joined against" the existing rows of the containing table. If the navigation is optional, the join is an outer join, if required, an inner join. If the list has multiple entries, the effect is a cross product against the other contents of the table.
Because navigations can result in failure that propagates recursively upwards, matches to the leaves are evaluated before committing to any results. Alternatively, the LMatch operation could contain only optional navigations or only required navigations in cases where the data will be present. Similarly, it is possible to eliminate the need to handle joins or cross products by limiting the LMatch operator to only allow unnested navigations when the data is at most singular.
Two type of normalization that can be performed on LMatch operators are removal of (non-root) self navigations and removal of implicit cross-products. The normahzed LMatch operator would consist only of a single root self navigation and following child navigations, where for each child navigation, nested=true. Alternatively, the normalization could cover either (nested=true) or (nested=false and optional=false and the child is known to exist in a strict 1:1 relationship with the parent). Additional normalizations, such as requiring optional=true on all nested child steps, may also possible.
To normalize the LMatch operator, additional operators are inserted to the Logical Extract Program to compensate for the changes to the LMatch operator. These logical operators include the LSelect, LFlatten, and LBox operators. The LSelect operator removes tuples from a table based on some condition. The LFlatten operator flattens a nested table within an NCR. The operator is applied to a single nested table, and the process of flattening removes that table. The LFlatten operator has a boolean parameter "outer" indicating whether the flattening operation should behave like an inner or left outer join — that is, if the nested table is empty, does flattening remove the containing row or not. The LBox operator serves to introduce an artificial level of nesting within a table.
A singular relationship between a child navigation and its parent navigation is identified by examining the XML schema of the data that the LMatch operates against. Initially, the matching algorithm has been run. After that, it can be deteimined, for each navigation step, which place(s) in the schema the LMatch operator could match. From that information, and from the cardinality information available in the schema, it can be identified whether the singular condition holds.
The first version of the algorithm generates an LMatch that contains a single, top-level set/ navigation and otherwise contains only child navigations. All navigations (including the self at the top) have nested=true. The resulting navigations may have optional-true or optional=false. The implementation can be styled in a bottom-up or top-down traversal, but note that in either case compensating operators are to be inserted at both the bottom and top of the chain.
The table below illustrates the various cases that can arise. The right-hand column has examples of the transformations. Here is a sample XML document this can be tested against:
<a>al<b>bl</b> <c>cl<d>dl</dx/c></a> <a>a2<b>bl</bxb>b2</b>
<c>cl<d>dl</dx/c>
<c>c2<d>d2</dx/c></a> <a>a3
<c>cl<d>dl</dX/c>
<c>c2<d>d2</dxd>d3</dx/cx/a> <a>a4<b>bl</bx/a>
Figures 20-25 illustrate normalization.
Figure imgf000028_0001
In one embodiment, the following optimization may be applied. If an LBox is followed by the flattening of all its columns, the nested tables can be joined with a sequence of LJoin operators (as cross products) instead. This optimization could be performed either during this algorithm, or as a post-processing step. To illustrate, the last example above could be rewritten as shown in Figure 25: Alternatively, the normalization can be modified to state that only nested=true are added to child steps that can have multiple (or, possibly, optional) values. This normalization is may be easier for inputs to create NCR' in which 1: 1 elements are listed as flat columns of a row; any nesting on these columns may need to be added by an explicit LBox operator. In the case of a child step that has nested=false, and the step has been marked as singular without changing the value of the nested flag and without adding an LFlatten operation. The other steps do not change; in particular elision of a self step in the general case may result in adding an LBox, possibly followed by a LSelect or LFlatten operation. However, if all child navigations of a se/ navigation are singular, then the LBox and corresponding LFlatten can be omitted. The corresponding LSelect needs to be changed to a test on the NULL-ness of the columns, rather than a test on the emptiness of a nested table. This condition can be detected in a postprocessing step, but it would require information from both the LMatch (the singularity of steps) and correlated information from the logical extraction program (the presence of LBox and LFlatten/LSelect); thus, this optimization may be implemented as an integral part of the recursive algorithm.
Nested Conditional Relations (NCR) Model and Algebra
NCR extends relational algebra in two ways. First, it makes relations heterogeneous (i.e., allows them to contain records of different types). Each record is accompanied by a tag, describing its type, hence the term conditional relation. Second, relations can be nested. The value of an attribute can be either atomic (e.g., int, float, string) or another NCR.
1.1 Relational Algebra
Traditional relational data models have tables that are homogeneous and flat and selection and projection operators select a subset of rows and fields in a table. A homogenous table is one that has rows of the same type. A flat table has atomic fields, that is fields in the first normal form. The following table is a traditional relational data model.
Figure imgf000030_0001
Each row in the table is a record of the same type:
[Name : string, ID : string, Phone : int, Floor : int] Each field in the row is an atomic type. The table is a set of such rows. Its type is:
Depts: {[Name : string, ID : string, Phone : int, Floor : int]}
The following selection operation (σ): σ Fioor>3 (Depts) results in a subset of the original table, consisting of the highlighted rows below:
Figure imgf000030_0002
The following projection operation (TI):
TI Name, Phone (C Floor>3 (Depts)) results in a subset of the original table, consisting of the highlighted rows below:
Figure imgf000031_0001
T e Nested Conditional Relation model (NCRs) has tables that are heterogeneous and nested and have generalized version of selection and projection the select a subset of the fields.
1.2 Conditional Relations
A heterogeneous collection, or conditional relation is a relation which may have rows of different types. The Dept and Persons persons tables below are of the traditional relational model.
Figure imgf000031_0002
Persons
Figure imgf000031_0003
A heterogeneous table consisting of departments and persons is obtained by interleaving the rows of the Depts and Persons tables as shown below: DepartmentsPersons
Dept: Dept: Pers: Pers: Dept: Dept: Pers: Dept: Dept:
Dept:
Figure imgf000032_0001
The Departments rows have four fields and the Persons rows have only three fields. To represent such a table, a tag is added to each row. The value of the tag can be either Dept or Pers. Each row has a structure that depends on this tag. The type of such a row is called a tagged union type, and is denoted as:
<Dept: [Name : string, DD : string, Phone : int, Floor : int] | Pers: [SSN : int, Name: string, Salary: int]> A value of this type is either a record of type [Name : string, ID : string, Phone : int, Floor : int] preceded by the tag Dept, or a record of type [SSN : int, Name: string, Salary: int] preceded by a tag Pers.
The type of the entire table DepartmentsPersons is a set of a tagged union type:
DepartmentsPersons: {<Dept: [Name : string, ID : string, Phone : int, Floor : int] |
Pers: [SSN : int, Name: string, Salary: int]>}
The following selection operator selects rows with all departments above the 3rd floor: σ <Dept(Fioor>3)> (DepartmentsPersons)
The "Dept" tag in the condition (i.e., "Dept/(Floor>3)") indicates to select rows with a Dept tag. The "FIoor>3 " indicates to select rows that have Floor>3. All rows that do not have the Dept tag are selected intact. Thus, the type of the result is:
{<Dept: [Name : string, ID : string, Phone : int, Floor : int] | Pers: [SSN : int, Name: string, Salary: int]>} The result is the highlighted rows as shown below:
Dept: Dept:
Pers: Pers: Dept: Dept: Pers: Dept: Dept:
Dept:
Figure imgf000033_0001
The following selection operation selects departments above the 3r floor AND people earning more than 50000: σ <DeP (Fioor>3) i Pers/(Saiary>5oooo)> (DepartmentsPersons)
The condition applies to both Dept rows and Pers rows. For Dept rows, the condition specifies Floor>3; for Pers rows, the condition specifies Salary>50000. The result consists of the highlighted rows below:
DepartmentsPersons
Dept: Dept: Pers: Pers: Dept: Dept: Pers: Dept:
Dept:
Dept:
Figure imgf000034_0001
The same result can be achieved by applying the two selections in sequence, that is: σ <Dept (Fioor>3) | Pers/(Saiary>50Qoo)> (DepartmentsPersons) = = σ <Dept/(Fioor>3)> (o"<pers/(Saiary>5oooo)> (DepartmentsPersons)) = σ<pers/(Saiary>5oooo)> (o"<DePt/(Fioor>3)> (DepartmentsPersons))
The following projection operation projects out the Name and Phone fields for the Dept rows and the Name field for the Pers rows:
TI <Dept:[Name,Phone] | Pers:[Name]> (σ <Dep</(Floor>3) | Pers/(Salary>50000)> (DepartmentsPerSOns))
The type of the result is:
{<Dept: [Name : string, Phone : int] | Pers: [Name: string]>}
The results consist of the highlighted fields below: DepartmentsPersons
Dept: Dept: Pers: Pers: Dept: Dept:
Pers:
Dept:
Dept:
Dept:
Figure imgf000035_0001
If only a subset of the tags are mentioned in the projection operator, then all rows tagged with the other tags are left unchanged in the result. Thus, the expression above is equivalent to:
TI <Dept:[Name,Phone] | Pers:[Name]> (θ" <Dept/(Floor>3) | Pers/(Salary>50Q00)> (DepartmentsPersons))
= IT <Dept:[Name,Phone]>(π<perS:[ ame]>(σ<Dept/(Floor>3)> (σ<Pers/(Salaιy>50000)>(DepartmentsPerSθns))))
- TI <Dept:|WameIPhone]>(σ<Dept/(Floor>3)>(TI<Pers:[Naιne]> (o<pers (Salary>50Q00)>(DepartmentsPerSOns))))
1.3 Nested Conditional Relations
The following table illustrates NCRs. The Pers rows have an Assignments field that is a non- atomic field. The field contains a nested condition relation in that its sub-rows can be of type Project or Committee.
DepartmentsPersons
Dept: Dept:
Pers:
Pers:
Dept: Dept:
Pers:
Dept: Dept: Dept:
Figure imgf000036_0001
The type of this table is:
{<Dept: [Name : string, ID : string, Phone : int, Floor : int] | Pers: [SSN : int, Name: string,
Assignments: {<Project: [Name: string, Lang:string] Committee: [Name: string]>}]>} Rows tagged with Dept are flat records. The rows tagged with Pers have an Assignments field with its value as another NCR, with its rows tagged either with Project or with Committee. In this example, there is only one level of nesting, in general, however, there may be arbitrary levels.
The following projection operation select the Name and Phone fields of the Dept rows and the Name field of the Pers rows.
II <Deptpfame^hone^erS | ame]> (DepaitmentsPerSOns)
The result is a flat relation of type:
{<Dept: [Name : string, Phone : int] | Pers: [Name: string]>}
Projections and selections can be combined and applied to the inner relations. This is done with combined operator, called Combo, that does both the selection and the projection. A combo operator takes an argument p, called a.p-former, that describes what selections and projections are to be done. The p-former generalizes both the argument in a selection, σp, and that in a projection πp. The following is an example of a p-former: p = <Dept/(Floor>3):[Name,Floor] | Pers/(Name like "S%"):
[Name, Assignments: {<Project/(Lang="C++"): [Name]|
Committee/(Name="Promotions")[Name]>}]>
Then the combo operator is written as:
Σp(DepartmentsPersons) This combo operator applies the selection condition (Floor > 3) and projects on the Name and Floor fields for the Dept rows. The combo operator also selects on the condition (Name like "S%") then projects on Name and Assignments fields on the Pers row. Furthermore, this combo operator processes Assignments recursively as follows. This combo operator applies the selection condition (Lang = "C++") and projects on the Name on Project rows and selects on (Name = "Promotions") and projects on the Name field on Committee rows. The type of the result of this combo operator is:
{<Dept: [Name : string, Floor : int] | Pers: [Name: string, Assignments: {<Project: [Name: string] |
Committee: [Name: string]>}]>}
The results of this combo operator is the highlighted fields below: DepartmentsPersons
Dept: Dept:
Pers:
Pers:
Dept: Dept:
Pers:
Dept: Dept: Dept:
Figure imgf000038_0001
The Combo operator selects a submatrix via a combination of selections (on tags and predicates) and projections (on fields). 2 Operands and Typing
An NCR is strongly typed which will allow type inference and type checking. The basic manipulable types are tables and single-valued attributes. The basic types are primitives or sets which are defined as:
Base or primitive types: b ::= integer | long integer | string | float | decimal [precision.accuracy] | ... Manipulable types: t ::= b Member of base type t ::= {u} Set (table) of tuples from "union type"
Record (i.e., row), variant, and union types are defined as:
Record type: r ::= [a1: tla a2: t2, ..., an: t„] where ai , a2 ... an are distinct labels called attributes, or fields, or just labels.
Variant type: v ::= tag:r where tag is a label, called the tag of the variant type.
Union type: u ::= <vι I V2 I ... I vn> where vi , v2 , ... , vn are variant types having distinct tags.
In the following example, a table is defined that includes both PeopIe(name: String, ssnύnt) rows and Employees(name: String, ssnant, salary: float) rows. The record types for People and Employees are defined as follows: rPeopie : := [name: String, ssn:int] rempioyees ::= [name: String, ssn:int, salary:float]
The variants are defined as:
Vpeopie : "People: [name: String, ssn:int]
Vempioyees ::= Employees: [name: String, ssnint, salary:float] and the union type is:
Upeo ieEmpioyee ..-<People:[name:String,ssn:infJ |
Employees: [name: String,ssn:int,salary:float]>
The NCR type describing a table consisting of both people and employees is: tpeopi mpioyees :~ {<People: [name:String, ssn:int] |
Employees: [name:String, ssn:int, salary:float]>}
Operators
A formal algebra using these types is defined using the following general principles. • Each operator should support tables with heterogeneous rows (i.e. , a table of union types).
• A different action can be specified for an operator for each unique row type.
• Operators focus on the common case and a generalized map operator handles complex cases.
2.1 Project
The Project operator returns a different subset of the attributes from each variant type. The project operator, denoted Up, is parameterized by a list, p, of elements of the form tag:[set of attributes]. "P" is called a. projection p-former. It determines both the type of the project operator and its semantics (which columns are being projected out).
Defn. projection p-former:
Given a union type with u = <tag_ : | ... |tagra: tn>, where = [a.x , : tx 1 , a12 : t12 , ... , alnι : tln ] , ... , tn= [aml Λ,a.αα:ttti2t...,siα. : t- l , a projection p-former with input u is an expression:
Figure imgf000040_0001
where the indexes satisfy:
Figure imgf000040_0002
l≤i„ <i12 <...<i, ≤n
l≤iql <iq2 <...<iqPq <nlq
The "type" of the projection p-former is: p : : u -» u' where u' is defined as follows. Let i/, j=l, ... ,m be:
Figure imgf000040_0003
:tiiu'ah12 :t12'-"'alpι :thlpι]
g = l-aq>,ι • tqι,ι'aq2 • t q2'---'aq'Φq ^qiqp, -I
Figure imgf000040_0004
The output type is u' = <tagt: t'ι| ... |tagm: t'n>.
Defn. Project Operator:
Let p, u, u' be as above, and let t={u}, t'={u'}. The projection operator is: πp(t) = f
The project operator only affects tuples with tags that are mentioned in p. Tuples with other tags are simply copied to the output, unaffected. The following table shows the results of applying a project operator. The input table is an EmpsDeptsSites NCR with Emp, Dept, and Site tags. Emp:
Dept:
Emp:
Site:
Emp:
Dept:
Emp:
Emp:
Dept:
Emp:
Figure imgf000041_0002
The following projection operator is applied to the EmpsDeptsSites table:
π<Emp:[name] | Dept: [name,mgr] | Site: [name,city]>(EnipsDeptsSites)
The result is the following table:
Figure imgf000041_0001
The project operator is equivalent to π<Emp:[name] | Dept [name,mgr] >(EmpsDeptsSites). The Site records are included in the result by default. 2.2 Select
The Select operator returns a different subset of records for each variant type. The select operator, denoted σp, is parameterized by a list, p, of elements of the form tag/condition. "P" is called a selection p-former. It determines both the type of the select operator and its semantics, hi the following, conditions and expressions are defined.
Defn. Expressions:
Given k record types rls ..., rk, and a type t, an expression e of type t with arguments rl3 ..., ric, in notation: e : rι x ... x rk - t is defined inductively below. Expressions occur in a context, which is defined to be a sequence of record types, (r^, r2', ..., rn'); unless specified, the context is empty, n=0.
1. Attribute: if n = [..., a : t, ...], then $i.a is an expression of type t.
2. Context: if rj- = [..., a : t, ...], then $[j].a is an expression of type t.
3_, Scalar operator: if el5 e2 are two expressions of base types bi, b2 respectively, then Qι op e2 is an expression of type b, where op is an operator:
Figure imgf000042_0001
op is one of (+, -, *, /), or a string operator (concat, substr), or any user-defined function on scalar values. NCR expression: if ei, e2, ... are expressions of types Xι, t2, ... , then fi^e^, ... ) is an expression of type t, where f is an NCR expression f: ti x t2 x ... — >• t. When k=l, then only $1 can occur (i.e., not $2, $3, ...) and $l.a is abbreviated with a.
Contexts are not used in selections, but are used later in the combo operator.
Defn. Conditions (predicates) on records:
Given k record types X\, ..., a condition with arguments rls ..., r , in notation: c : ϊι x ... x r — > bool is defined inductively as follows:
1. if ei, e2 are expressions of base types bl5 b2 respectively, both with arguments x\, ..., r , then a oprel e2 is a condition with arguments rls ..., rk, where oprel is < <=, >, >= or string operations such as substr, prefix, suffix, like, etc. 2_ if e is an expression of type (<...| tag:[a1:t1,...,an:tn] |...>} and el3 ..., e„ are expressions of types ti, ... , tn respectively, then:
<tag:[aι:eι, ..., an:e„]> IN e is condition. 3_ if e is an expression of type {... }, then: exists(e) is a condition. 4. true, false are conditions. 5^ if cl, c2 are conditions, then so are Ci and c2, Ci or c2, not ct.
Defn. Selection p-former:
A selection p-former is an expression:
p = < tag/c1 | ... | tagIk/ck > and its "type" is:
p::<tagι:rι | ... | tag_ τ^> → < tag : t ... | taglk : tlk >
where ij,...,ik e {l,...,n} , and Cj is a condition, c- : r — » bool , forj=l,...,k
Defn. Selection operator:
Let p be a selection p-former. The selection operator is: σ P({t}) = {t}
The selection condition(s) only apply to tuples with tags mentioned in p. Tuples with other tags are simply copied to the output, unaffected.
When the following select operator is applied to the EmpsDeptsSites table described above
C> <Emp / (sal>500Q0) | Dept / (mgr="Betsy") | Site / (true)> (EmpsDeptsSites) the result is:
Emp:
Emp:
Site:
Emp:
Dept:
Figure imgf000043_0001
The same operator can be expressed as:
O" <Emp /(sal>50000) | Dept /(mgr="Betsy") > (EmpsDeptsSites)
The select operator operates only on the top level, in that it decides for each top level record whether to keep it or toss it. However, the condition c can look deep inside the current record (e.g., by using existential/universal quantifiers).
2.3 Rename
The rename operator (p) renames tags and/or attributes. It is a generalization of the rename operator in the relational algebra:
Defn. Renaming p-former:
A renaming p-former is an expression: ρ = <tagι->tagι,:[aιι→anl, a12→a12', ... ] | tag2→tag2':[a21->a21', a 2→a22' > Defn. Renaming operator:
Given a renaming p-former p, the renaming operator is:
Figure imgf000044_0001
This operator renames tagi to tag'i, and renames the fields in the record of type tagi by changing an to an', • • • ; renames tag2 to tag2', and so on. All tags and/or labels that are not mentioned are left unchanged. The output type is "isomorphic" to the input type: only the variant and record labels at the top level have changed. The "identity" mapping aj — » a; can be abbreviated to a; in specifying a renaming.
2.4 Extend
The extend operator (ε) adds new fields to a record, each computed by some expression from other fields.
Defn. Extension p-former:
Given union types u = <tagι. ι,...,tagn:r„> and u' = <tagi.τι',...,tagn:rπ'>, an extension p-former of type u -> u' is an expression: p = <tagι:[cιι:en, c12:e12, ... ] | ... | tagn:[Cni:enl, cώ:e„2, ... ]> where the expressions ey, ... have types ejj : r; — » ty1, and n' is a record type obtained by adding the fields [ci^tϋ1, c_:l_ ...] to n.
Defn. Extension operator:
Given an extension p-former p, the extension operator is: εP({t}) = {f}.
The meaning is that new labels cπ, c12, ... are added to the corresponding records, and their values are computed by the exrepssions en, e12, ... The values of all the other labels are copied into the output unchanged. As for the other operators, not all tags need to be mentioned: the missing ones are copied to the output.
2.5 Combo
The Combo operator (Σ) combines Project, Select, Rename, and Extend, and does this to arbitrary nesting levels. The Combo operator is parameterized by an argument that is a deeply structured expression combining arguments of Project (IT), Select (σ), Rename (p), and Extend (ε). Such an expression is called a p-former. Unlike the other operators, combo does not implicitly copy "the other" tags and labels to the output, but deletes them. This allows both copying and deleting. (For convenience, another version of the combo operator may also be defined that copies by default.)
The rules below define a p-former inductively. A p-former, p, has an input type t and an output type , and the p-former is denoted by: p :: t → f The p-former takes an input value of type t and returns either an output value of type t' or "nothing." When used in a Combo operator, ∑p, the type of the Combo operation is {t} — » {t1}. Consider a selection in the relational algebra, σ age<20 , (age<20) is a p-former: it takes a record and returns either the same record (if age<20) or nothing (otherwise).
For each p-former, there is a context (which, recall, is a sequence of record types, (rls ... , rn)). Inner p-formers have a context consisting of all record types of the surrounding records. The top-level p- former has an empty context.
1. The "identity p-former":
_ :: t — > t for every type t returns its input, unchanged
2. The "record p-former":
if pi :: t → t'i , .... :: tit → t'k , are p-formers, {ύ, ..., ik} c {1, 2, ..., n} and ei: .. ,an:tn] - ti", ... , ep: [ai :tι, ... ,an:tn] -> tp" are expressions, then
Kai.bι) :>-JAk → b .- pt. c, : e1,...,cp : ep] :: r -» r', where r = [ai:ti,...,aa:tn] and r' = [bι:tι,,..., bk:tk ,,c1:t1", ..., Cp:tp"].
The (a — b) : p components rename an attribute a to b, and apply recursively the p- former p on the value of that attribute. The c.e components introduce a new attribute
(c) whose value is computed by the expression e over r. If the context of p is (r1} ..., r„), then the context for each of p1; ..., p is (r1; ..., rn, r).
The following condition applies to p-formers in Combo operators:
(*) a. ,...,aik are distinct and i, ..., ep are scalar expressions
(The b's and c's are also distinct, which follows implicitly from their use in the type [bi :ti', ... , b :t ',Ci :ti", ... ,cp:tp"]). The restriction ensures that every complex value is copied to the output at most once, and new values being produced are scalar: this enables simple, pipeline computation.
Given an input value
Figure imgf000045_0001
. . ., an : vn], the p-formers p1; ..., pk first apply to the values vn, ..., Vik respectively. If any of them returns "nothing," then the record p- former returns "nothing." Otherwise, let vi', ..., Vk' be the values returned by the - formers, and let wi, ... , wp be the values record p-former returns the record [bi:vi'
Figure imgf000045_0002
3. The "variant p-former" is parameterized by a selection condition, c:
if c : r - bool and p : : r -» r' then (tag/c -> tag'):p :: tag:r - tag':r'
The condition c is checked first. If it returns false, then the p-former returns "nothing 1 Otherwise, it returns whatever p returns, but changes the tag to tag'. The c may use existential/universal quantifiers on the set fields of r.
4. The "union" p-former is:
if ι::vii -> vι', ..., ρk :: v;k → vk' and { , ..., ik} c { 1, 2, .... n} then <pi | ... | ρk> :: <vι | ... |vn> -» <vι'|... | v '> Given a value of type <vι |... | V > , union p-former checks that it is of one of the variant types v;ι,... ,v;k: if not, it returns "nothing," otherwise it returns whatever the corresponding p-former returns. Neither the variant types Vi , ... , Vk nor the variant types Vi1,... , Vk' are disjoint: however identical tags in the latter correspond to identical types, i.e., <v\ \... | Vk are a correctly formed type.
5. The "set" p-former is:
ifp :: u — » u' then {p} :: {u} -> {u'}
Given a set {xls ..., xn}, the set p-former first applies the p-former p to each element in the set. Let yi, ..., yk be all values returned by p (t'.e., excluding "nothing"), then the set p-former returns {yi, ... , yk}
Defn. Combo operator: ifp is a p-former, p :: t -> f, then:
∑ p ((t}) = ({t*})
Given a set {xl3 ..., xn}, the operator first applies the p-former p to each element in the set. Let yi, ..., yk be all values returned by p (i.e., excluding "nothing"). Then Σ p returns {yl3 ..., yk}
The following p-former is defined for the EmpsDeptsSites table: p = <Emp/(sal>50000) →HighEarner:[name- name:J | Dept/(mgr= "Betsy") — >BetsyManages:[name→name:_] | Site/(true) — »Site: [name— »name:^city— >city : J>
The following combo operation is applied to the table:
p (EmpsDeptsSites)
If(a→ a) is abbreviated with a, p:_ with p, and tag/(true) with tag, then combo operator can then be abbreviated as:
∑<Emp/(sal>5QQ0O)- HιghEarner:[nanιe] | Dept (mgι= "Betsy")→Bets Manages:[name] | Sitc:[name,city]> (EmpsDeptsSites)
The results of this combo operator is:
HighEarner:
HighEarner:
Site:
HighEarner:
BetsyManages:
Figure imgf000046_0001
This example illustrates the use of contexts. The following relation (base types omitted) is used. Pers :{<a: [name, birthday, projects: {<b:[title, deadline, modules:{<c:[id,date]}]>}]>}
Each person has a set of projects, each project has a set of modules. a. The combo Σp2(Pers) returns all persons that work on a project whose deadline is on their birthday. Its definition needs two p-formers: pi = <b/($l.deadline=$[l].birthday):_> p2 = <a/(Exists(Σpl($l.projects)):_>
The context expression $[l].birthday in pi, which retrieves the birthday from one level higher up, is used in the "context" of p2, which defines the outer context to be a record of type [name, birthday, project]. b. The combo Σp3(Pers) deletes from every person all projects whose deadline is on the person's birthday: p3 = <a:[name:_J birthdays projects: {<b/($l. deadline != $[1] .birthday) :_>}]> c. This illustrates the use of a $[2] context. The combo Σp4(Pers) operates as before, but in addition it deletes all modules whose dates are on the person's birthday: p4 = <a:[name:_J birthdays projects: {<b/($l.deadline != $[1] .birthday): [titles deadline:^, modules:{<c/($l.date != $[2].birthday):_ >}
]>
1. A Projection operator is a particular case of the Combo operator. For example π<Emp:[name] | Dept: [name,mgr] | Site: [name,cit ]>(EmpsDeptsSites) is the Same as ∑<Emp:[name] | Dept: [name,mgr] | Site: [name,city]>(EmpsDeptsSites)
2. A Selection operator is a particular case of the Combo operator. For example
C<Emp /(sal>50000) | Dept/(mgr="Betsy") | Site/ (trae)> (EmpsDeptsSites) is the Same aS ∑<Emp / (sal>50000) | Dept / (mgr="Betsy") | Site / (true)> (EmpsDeptsSites)
3. A Renaming operator is a particular case of the Combo operator. For example, given the renaming p-former: p = <Emp— >Employee:[nanιe— >person,ssn,saI,phone— ^contact]] |
Dept- Depart: [name— teammember,loc— »place,mgr] |
Site— >Site:[name,city]> the renaming operator
Pp(EmpsDeptsSites) is the same as
Σp(EmpsDeptsSites).
4. The most general form of a Combo operator is like a submatrix selection. There is no copying and no unnesting involved.
5. Combo can be used to homogenize a collection. For example, in EmpsDeptsSites there are three different kinds of records, and all share a name attribute. The following Combo operator extracts all names and constructs a homogeneous collection:
∑<Emp→Res:[name] | Dept→Res: [name] 1 Site→Res: [name]>(EmpsDeptsSites).
The result is of type {<Res:[name:string]>} which is a homogeneous collection. 6. Combo can be used to dispatch records to different types (e.g., transforming a homogeneous collection into a heterogeneous one). The following Combo operator splits Emp's into Regular and HighPaid:
∑<Emp/(sal<100k)→ReguIar:[name, phone] | Emp/(sal>100k)-»HighPaid: [name,phone]>(Emps)
The input has type
{<EmpIoyee:[name, phone, salary]>},
(base types omitted) while the output has type
{<Regular:[name] | HighPaid: [name, phone]>}
In general conditions that are applied to a tag may overlap. For example, employees with saKlOOk may be dispatched to some type, and those with sal>50k to another type. In this case, records that satisfy both conditions will contribute to both outputs.
2.6 The Simple Combo
The semantics of combo is that tags not mentioned in the p-former are dropped from the output. The "simple" combo has a complementary semantics: only tags/labels that are to be modified need be mentioned. By default, all others are copied to the output. Moreover, the "simple" combo only does one single action, possible at some depth in the NCR.
Defn. Simple p-former:
A simple p-former is a p-former that includes only a single selection (i.e., for a single tag), a single projection, renaming of a single tag or label, or an extension with a single new label. Formally, it is defined like a p-former with additional syntactic restrictions that ensure that only one action is performed (only a projection, a selection, a renaming, or an extension). The other parts of the simple p-former, that do the copying, are omitted. The simple p-former is illustrated by the following examples:
<Emp/(saI>=50000)> this simple p-former is a selection and is equivalent to
<Emp/(sal>=50000):_ | Dept:_ | Other:_>, i.e., copy the other tags unchanged. Thus, to get the equivalent "real" p-former, the omitted tags Dept and Other need to be added.
<Dept:[name, projects]> this simple p-former is a projection and is equivalent to
<Emp:_ I Dept:[name:^, projects:_], Other:_>.
<Dept:[projects:{<urgent/(deadline>="10/10/2010")>}]> this is a selection on the inner relation projects, and is equivalent to
<Emp:_ I Dept:[name:_, floor:_, projects: {<normal:_ | urgent/(deadline>="10/10/2010"):_>}] | Other:_>.
Omitted tags (Emp, normal) are added and omitted labels (name, floor) are added too.
<Dept[projects:{<urgent:[name, team]>}]> this is a projection at a deeper level.
Tags and labels are added, except where projection is done:
<Emp:_ I Dept:[name:_, floor:_, projects: {<normal:[name:^ team:_|, urgent:_>}] |
Other: >
Defn. Simple combo:
Given a simple p-former p, the simple combo operator is:
Figure imgf000049_0001
The superscript 0 indicates that the combo is "simple" (i.e., all missing tags and labels in p have to be added). The semantics is defined as follows. Given a p-former, p::t -» t',(simple or not) define the completion of p to be c(p) : t — » t" obtained from p by "completing" the missing tags according to the type t (i.e., c(p) should actually be denoted c(p,t); it can be defined inductively; omitted). The semantics of the simple combo operator is:
Figure imgf000049_0002
Example
Given NCR EmpsDeptsSites from above, the following are simple combos:
0<Emp/(sai>5θooo):j> (EmpsDeptsSites) is the same as:
∑<Emp/(sal>50000)]:_ | Dept:_>(EmpsDeptsSites)
∑°<DePt [mgr]> (EmpsDeptsSites) is the same as: ∑oeptt∞gr] ] Emp^> (EmpsDeptsSites)
The simple combos are strictly less powerful than combos, because they do not allow deletion of tags from the output type. For example, consider the combo Σ<Dept:[mgr]>(EmpsDeptsSites). Its output type is {<Dept: [mgr]>}. It may be expressed as a sequence of simple combos:
∑ <Emp/false:_> (∑ <Dept:[mgr]> (EmpsDeptsSites)) where the second simple combo is needed to eliminate all Emp records. But, the output type of this expression is {<Dept:[mgr] | Emp:[name, ssn, sal, phone]>}. A type checker could be modified to recognize that some conditions are (equivalent to) false and eliminate the corresponding tag from the output type. If so, combo operators could be expressed as a sequence of simple combo operators.
2.7 Match
The match operator, Ωm, generalizes the combo operator by relaxing some of its restrictions. The match operator is parameterized by an m-former, m, that is defined inductively like a p-former in Combo, with two generalizations:
1. In the record m-former:
[(all → b1) : p ,...,(alk → b ^,^ : e„...,cp : ep] :: [ai:ti,...,a„:t„] - [bι:tι',..., bk:tk',ci:t ", ..., cp:tp"] the labels an, ... , a,k are not required to be distinct, and the expressions el3 ... , ep are not restricted to be scalars. (That is the restriction (*) is dropped.) This allows data values to be copied.
Example: Ω<perS:[Name→Namei,Phone,Name→Name2]>, copies the Name value calling it Namel and Name2. Such copying can be expensive, when the value being copied is a large sub-relation. This is unlike Combo where no copying is done. 2. There exists an "unnest" m-former, with no corresponding p-former: if m : : u — > u' then unnest(m) :: {u} — u'
(This is different from the set p-former, {p} : {u} -» {u'}). Unnest flattens an inner relation. When two or more unnest m-formers are used in a record m-former, their result consists of a Cartesian product: again, this can be expensive.
Whenever an unnest m-former is used inside another m-former m : : t -» t', the result type t' is not a legal type any more, but an "extended" type, t' can be converted back into a legal type using a mapping norm.
Defn. Match operator: if m : : t — > t' is a m-former, where t is a type and t1 an extended type. Then:
Ωm({t}) = {norm(t')} Example:
Ω<τ:[A:unnesto, B:unnestQ]> (^T^ A:{<Ti:[a:tϋ,b:ti2] | T2:[c:t 13, d:t14]>},
B:{<T3:[e:t21, f:t22] | T4:[g:t 23, h:t24]>}]
>}) = {<T:[ A:<Tι:[a:tn,b:tι2] | T2:[c:t 13, d:t14]>,
B:<T3:[e:t2ι, f:t22] | T4:[g:t 23, h:t2 ]>] >})
The only change in the output type is that some braces { ... } have been erased. The output type is technically illegal in the type system, because each record field needs to be either atomic or a set. It can be normalized by pulling out all variant types to the top level and flattening the records. The normalized type is defined to be the output type of Ω. We call norm the normalization operation. The following example illustrates the normalization operation: norm <T:[ A:<Ti:[a:tii,b:ti2] | T2:[c:t ι3, d:t14]>, B: T3:[e:t21, f:t22] | T4:[g:t 13, h:t14]>]>
= <TT1T3:[Aa:t1ι,Ab:t12Be:t2i,Bf:t22] | TT1T4:[Aa:t1i,Ab:ti2Bg:t23,Bh:t24] | TT2T3:[Ac:t13,Ad:t14Be:t21,Bf:t22] | TT2T4:[Ac:t13,Ad:t14Bg:t23,Bh:t24] >
The norm operation constructs new tag names and new field names by concatenating existing names.
Definition of norm. The symbol "®" is used to denote the following operation between record and/or union types:
[aπ:tn, a12:t12, ..., alm:tιm] Θ [a21:t21, a22:t22, ..., a2n:t2n] = [aπ:tπ, a12:t12, ..., alm:tlm, a21:t21, a22:t22, ..., a2n:t2n] r ® <tagi : τ | tag2 : r21 ... | tag„:r„> = <tagt : r ® rx | tag2 : r ® r21 ... | tagn:r ® rn>
<tag1 : τι I tag2 : r21 ... | tagn:rn> ® r = <tagι : ri ® r | tag2 : r2 <E> r | ... | tag„:r„ ® r> <tagn : rn I tag12 : r121 ... | taglm:rlm> ® <tag21 : r21 1 tag22 : r221 ... | tag2„:r2n> = < tagiftagzj : m ® r2j | i = 1, m, j = l,n >
In the last line tag concatenation is used. The norm operation concatenates two record types, or, if the two types are unions of m and n record types respectively, then constructs a new union type with mn record types, by considering all pairs of concatenations.
The symbol "Θ" is used to denote the union of two disjoint union types:
<tagn : ru I tag12 : r121 ... | taglm:rlm> θ <tag21 : r21 1 tag22 : r 21 ... | tag2n:r2n> = <tagn : rn I ... I taglm:rlm | tag21 : r21 1 ... | tag2„:r2„>
Two functions rlabela and ulabeltag. are defined that add or concatenate a record label a, or a tag to a type: rlabela(t) = [a:t] where t is an atomic type or a set type rlabela([aι:tι, a2:t2, ..., a„:t„]) = [aaι:tι, aa :t2, ..., aa„:t„] rlabela(<tagι:rι | tag2:r21 ... | tagn:r„>) =
<tagι:rlabela(ri) | tag :rlabela(r2) | ... | tagn:rlabela(r„)> ulabeltag(r) = <tag:r> where r is a record type
Figure imgf000051_0001
| tag2:r21 ... | tagn:rn>) = <tagtagι:rι | tagtag2:r21 ... | tagtag„:r„ >:
For the definition of the function norm, the definition of types ("normal types") is extended to "extended types. " These are precisely the types t' that can occur in m-formers, m :: t->t'.
= b
= {u}
= u /* this is an extended type and "illegal" under the normal definition */
= [a-ι-tι, a2:t2, ..., a„:t„]
: = <tagι : ri | tag2 : r21 ... | tag„:rn>
The norm operation is defined by: norm(b) = b norm({u}) = {norm(u)} norm([ai:ti, a2:t2, ..., an:tn]) =
Figure imgf000051_0002
® ... ® rlabelan(norm(tn)) norm(<tagι:rι j tag2:r21... | tagn.τn>) = ulabeltagl(norm(r i) ) ® ... θ ulabeltagn(norm(rn) )
The operation norm(t) results in a normal type; and if t is a normal type then norm(t) = t.
2.8 Join
Join takes a tuple from each input table and tests this combination to see if it meets a predicate; if so, it returns a combined tuple as the result. Join allows specification of a different predicate and name and a different output tag for each pair wise combination of types from the two tables. Not all pairs must have a predicate and output tag: those that do not contribute to the join. An "inner" join is described in the following.
Defn. Conditional predicate on two tuples:
The Join and Nest operations need conditions on two records: c : n x η - bool
Defn. Combined tuple:
Join produces a combined tuple as its output. Recall the "® " operator, which returns a new combined tuple. rl ® r2 = [ah: th, al2:tl2, ..., aln: tln, a2ι: t21; ..., a2m: t2_J where rl = [ah: th, al2:tl2, ..., al„: tl„], r2 = [a2ι: t2ι, ..., a2m: t2m]
(There is an implicit condition here that the labels in the two records are disjoint.)
Defn. j -former:
A j -former of type : j :: <tagh:rh | ... | tagln:rl„>, < tag21:r2i| ... | tag2m:r2m> -> <tag', : rl ® r2, I ... I tag* : rl, ® r2, > is an expression of the form: j = <tagl ,tag2 /c, → tag\ | ... | tagllp ,tag2Jp /cp - tag'p > where:
1. ii, ...,ip e {1, ..., n}, ji, ...,jp e {1, ..., m} such that all pairs (ii i), ..., (ip, jP) are distinct (hence p <n2), and all tags tag'i, ..., tag'pare distinct
2. ck : rl, x r2j — > bool, for k = 1, ..., p.
Defn. Join operator:
Let j : : tl x t2 — > t be a j-former. Then the join operator is: Xi ({tl}> {t2}) = {t>
2.8.1.1. LI Example
The following NCRs are used to illustrate a join of two tables:
Figure imgf000052_0001
Figure imgf000053_0001
A join on the two tables can be performed with one join predicate for (Dept x Emp) and another for (Dept x Site):
IX] <Emp,Dept/($l.dept=$2.dname)→EmpLoc | Site,Dept /($l.loc=$2.sname)→FullDept> (EmpSitβS, Depts) returns the following NCR:
Figure imgf000053_0002
2.9 Outer Join
An outer join has the same syntax as an inner join, with minor additional restrictions. The semantics differs:
Defn. oj -former:
Let tl = <tagh:rh I • • • I tagln:rln> and t2 = < tag2ι:r2ι| ... | tag2m:r2m> An oj-former for type: oj:: tl, t2 → tl θ t2 θ
Figure imgf000054_0001
® r2Λ | ... |tag'p : rl ® r2Jp > is an expression of the form: oj = <taglJag2/c1 - tag\ | ... | tagllp Jag2Jp/cp -> tag'p > where:
1. ii, ...,ip 6 {1, ..., n}, j1; ...,jp e {1, ..., m} such that all pairs (i } ), ..., (iP, jP) are distinct (hence p <mn), and all tags tag'i, ..., tag'p are distinct
2. ck : rl, x r2Jk - bool, for k = 1, ..., p.
3. all tags tag'i, • • • , tag'p are distinct, and they are also distinct both from tagi i, ... , tagln and from tag2l3 ... , tag2m
Defn. Outer Join operator:
Let oj :: tl x t2 — » t be an oj-former. The outer-join operator is:
Figure imgf000054_0002
Example
The outer join is illustrated by a data integration scenario. There are two sources of persons' phone numbers where each source has an attribute that tells us the confidence in that piece of information:
Sourcel: {<sourcel:[namel: string, phonel int, conf :real]>} Source2: {<source2:[name2: string, phone2:int, conf2:real]>}
The integration is done in two steps. First, the outer join is computed, then a selection/project that encapsulates the logic of the integration is applied. The first step results in the raw data: rawIntegratedData = Sourcel Cd [X] □ 0j Source2 where the oj-former is:
oj = <sourcel, source2/($l.namel=$2.name2) - integrated >
The type of rawIntegratedData is: rawIntegratedData :{<sourcel:[namel, phonel, confl]
I source2:[name2, phone2, conf2] I integrated: [namel, name2,phonel,phone2,coni ,conf2]>}
That is, it will include all "integrated" records, as well as "dangling tuples" from each source. The second step can now apply an arbitrarily sophisticated integration logic. For example, source 1 may be trusted more than source 2, have some complex rules on how to deal with conflicting information, and drop records where the confidence is too low:
integratedData = Σp(rawIntegratedData) where the p-former p encapsulates the integration logic:
p = <sourcel/(confl > 0.1) — certain: [namel -» name, phonel — » phone] I source2(conf2 > 0.5) — > certain: [name2 — > name, phone2 — phone]
I integrated/(confl. > 0.4) -» certain: [namel -» name, phonel — > phone] I integrated/(conf <0.4 and conf2>0.7) -» certain : [namel— name, phone2— »phone]
] integrated/(confl in 0.1..0.4 and conf2 in 0.5..0.7) → uncertain: [namel — name, phonel, phone2] >
The type of integratedData is: integratedData: {<certain: [name, phone], uncertain: [name, phonel, phone2]>} that is, in some cases both phone numbers are kept since the confidence does not favor one over the other.
2.10 Nest
The Nest operator works like a left outer join, but it nests all matching children within the tuple of each parent. Nesting is commonly done in XML-QL subqueries, and any time there is a l:n parent- child hierarchy in the output. Nest has left outer join semantics, rather than inner join semantics, and that it preserves the order of the parent relation. Nest can rename both the tag types for the nested tuples and for the parent tuples.
Defn. n-former.
Given n x m predicates:
Cy = rli x r2j - bool and n new labels, bl3 ... , bn.. An n-former of type: n :: <tagh : rh | ... | tagl„ : rln> <tag2ι: r2ι| ... | tag2m: r2m> - <tagι:r! ® [b1:{u2}] | ... | tag „ : r„ ® [bn:{u2}]> is given by an expression: n = <tagπ : [b; : <tag21/cϋ | ... | tag2m/cim>] | i = 1, ..., n>
Defn. Nest operator:
Let n :; u1; u2 -> u be an n-former. Then Nestn: ({ui}, {u2}) = {u}
Example This example uses same Depts and EmpsSites relations of types:
Depts :{<Dept: [dname, loc]>}
EmpsSites: {<Emp: [ename, ssn, dept] | Site: [sname, city]>} and performs the operation:
NeSt <Dept:[info: <Emp/($l.dname=$2.dept) | Site/($l.loc = $2.sname)>]>(DeptS, EmpsSites)
The result has type:
{<Dept: [dname, loc, info:{<Emp: [ename, ssn, dept] | Site: [sname, city]>}]>} and is depicted below:
Figure imgf000056_0001
2.11 Union
The union of two union types, Ui Θ u2, is defined to consist of all tags in both types, provided that a tag occurring in both Ui and u2 has identical types in Ui and u2.
Defn. Union operator:
U ({<rϋ, ri2, ..., rin>}, {<rjl5 rj2, ..., rjm>} = {<r.ι, ri2, ..., r;n> θ , rj2, ..., rjm>} Example
The union operator is illustrated using the Depts and EmpsSites tables from the join example. The operation: U (Depts, EmpsSites) results in the NCR:
Dep I dname | loc | Dept: Dept: Dept: Emp: Emp: Site:
Emp: Emp: Emp: Emp:
Figure imgf000057_0001
The expression A U B, for various types of A and B is illustrated in the following:
1. If A:{<Emp:[name, phone]>}, B:{<Dept:[name, floor]>}, then A U B has type {<Emp:[name, phone] | Dept: [name, floor]>} and denotes their disjoint union (i.e., no duplicates are introduced, and duplicate elimination is not needed).
2. If A:{<Emp:[name, phone] | Mngr:[name, beeper]>} and B:{<Dept:[name, floor] | Mngr:[name, beeper]>} then AUB has type {<Emp:[name, phone] | Mngr:[name, beeper] | Dept: [name, floor]>} and means: take the disjoint union of Emp's from A and Dept's from B, and take the regular union of Mngr's from both A and B. The output type is:
{<Emp:[name, phone] | Mngr:[name, beeper] | Emp [name, phone]>}. We need to do duplicate elimination on Mngr's. If the type of Mngr in B is changed, such that the types of Mngr in A and B do not coincide any more, then A U B is illegal.
3. If both A and B have the same type, say {<Emp:[name, phone] | Mngr: [name, beeper]>}, then A U B is the regular union and the output type is the same.
2.11 Distinct
The distinct operator removes duplicates: distinct: {t} -> {t}
2.12 Aggregates
Aggregate operators are included in the combo operator. The five aggregates in SQL and: count, sum, min, max, avg. Count is treated slightly differently. Let agg be one of sum, min, max, avg, and let b be the base type to which it applies (b can be int, real, or string when agg is min or max, int or real when agg is sum, and real only when agg is avg). Consider a union type: u = <tagi : n I ... I tagn : r„> and let ei : ri — b, ..., e„ : rn — » b be expressions. Then the following is an expression:
agg<tagi/ei | ... tagn/e„> : {u} → b
Expressions can be used in combo operators.
Example. The following example uses the NCR:
Products: {<Indigenous:[name, quantity, category], Imported: [n,q,c]>}
Where n,q,c stand also for name, quantity, category. The computation of the total quantities for all categories is performed in three steps: First, all categories are computed:
Cat = distinct(∑ indigenous → Cat [category - c] | Imported → Cat: [c]>(PrθduCts))
The type is {<Cat:[c]>}. Second, the products are nested by categories:
Groups = Nest <Cat:[Prods:<Indigenous/($l.c=$2.categor ) | Imported/($l.c=S2.c)>(Cat, Products)
The type is {<Cat:[c,Prods:{<Indigenous:[...], Imported: [...]>}]>}.
Third, the sum of all quantities is computed:
Answer = Σ <Cat:[c, total:sum<Indigenoιιs/quantity | Imported/q>(Prods)]>(GrOUps)
The type is {<Cat:[c, total]>}.
3 NCR-QL
The following illustrates how XML-QL could be mapped into the algebra. However, XML-QL works over XML data, not NCRs, so an NCR version is defined of XML-QL ("NCR-QL"), that works on NCRs.
The analogy between XML-QL and NCR-QL is the following:
XML-QL: where-construct
NCR-QL: from-case-where-construct
3.1 Query 1
EmpsDeptsSites is the relation defined earlier containing tuples about Employees, Departments, and Sites. HOwnersCities is a relation containing tuples about HomeOwners and Cities. The following NCR-QL query performs some join between the two:
From EmpsDeptsSites, HOwnersCities
Case (Emp:[name: $X, ssn: $Y, sal: $Z, phone: $U], HOwner:[lastname:$V, ziρ:$W]) : (Where $X=$N AND $Z > 100000
Construct EHO:[name:$X,ssn:Y,zip$W]) I (Dept: [name: $X, loc: $Y, mgr: $Z], HOwner:[lastname:$N, zip:$W]) :
(Where $Z=$N
Construct DHO:[name:$Z, dept:$X, zip:$W] I (Deρt:[name:$X, loc:$Y, mgr:$Z], City:[cityname:$N, place:$W]):
(Where $Y = $W
Construct DC: [name :$X, city: $N])
All combinations of tags from EmpsDeptsSites (defined above) and HOwnersCities (some other NCR) listed in the Case statement are inspected, and in each case a different output tag is produced. The corresponding algebra expression is:
Σp(EmpSites > j Depts) where: j = <Emp,HOwner/(name=lastname AND sal > 100000)→EHO
I Dept,HOwner /(mgr=lastname)— »DHO
I Dept, City / (loc=place) → DC>
p = <EHO:[name,ssn,zip], DHO:[mgr→name, name→dept,zip], DC:[name,cityname— »city]>
3.2 Query 2
This query illustrates patterns over sets and how they are translated into the algebra. It roughly corresponds to the XML-QL pattern:
<products> <product> <name> $n </>
<orders> <order> <date> $d </> </> </product> </products> IN Products where <orders> is a set. The NCR-QL query is more powerful since it handles heterogeneous collections of products and orders.
From Products
Case (SeattleProduct:[name:$n, price:$p, orders:$x]): Construct (From $x Case (order: [customer: $c, date:$d]): Where $p<100
Construct usProductDate: [name:$n,date:$d]) I (ParisProduct: [nome$n,prix:$p,orders:$x]): Construct (From $x Case (euOrder:[countιy:$c,date:$d]): Where $p>35
Construct euProductDate:[name:$n, date:$d] I (usOrder:[city:$c,date:$d]): Construct importProductDate:[name:$n,date:$d])
Here Product has type (base types omitted):
{<SeattleProduct: [name,price,orders: {<order: [customer,date]>}] | ParisProduct: [nome,prix,orders:{<euOrder:[country,date] | importOrder:[city,date]>}]>}
While XML-QL had a single pattern, nested patterns in NCR-QL are needed to match over nested sets. The NCR-QL query is translated into the algebra using a match operator:
Σpm(Products)) where: m = <SeattleProduct:[name:_, price:^ orders:unnest{<order:[customer:_, date:_J>}] I ParisProduct: [nome:_, prix:_, orders: unnest {<euOrder: [country :_, date:_] I usOrder:[city:_, date:_]>}]>
Note that Ωm(Products) has type:
{<SeattleProduct: [name,price,orders:<order: [customer,date]>] |
ParisProduct: [nome,prix,orders:<euOrder:[country,date] | importOrder: [city,date]>]>}
Which is the same type as for Products, with inner braces erased. It normalizes to:
{<SeattleProduct.order:[name, price, orders. customer, orders. date] I ParisProduct. euOrder:[nome, prix, orders. country, orders. date] I ParisProduct.importOrder:[orders.city, orders. date] >}
Hence, p is:
<SeattleProduct.order/price<100 - UsProductDate: [name ->• name, orders.date - date] I ParisProduct. euOrder/prix>35 -» EuProductDate: [name -» name, orders.date -» date] I ParisProductimportOrder -> ImportProductDate: [name -» name, orders.date -> date]>
3.3 Query 3
The nesting in NCR-QL is illustrated using subqueries. From Products
Case (product: [id:$x, name:$n, price:$p]: (Where $p<100 Construct ProductWit Orders: [name:$n, orders: From Orders
Case (order: [pid:$y, quantity:$q, date:$d]): Where $x=$y AND $q>5555 Construct order: [date: $d] ] )
The types are (with base types omitted):
Products: {<product:[id, name, price]>} Orders:{<order:[pid,quantity,date]>} The algebra expression equivalent to Query 3 is:
Σp(Nest„(Products, Orders)) where n is: n = <product: [orders: <order/id=pid>]> and p is:
Figure imgf000061_0001
~ ProductWithOrders): [name, orders:{<order/quantity>5555:[date]>}]>
Notice that Nestn(Products, Orders) has type:
Nestn(Products, Orders) : {<product:[id, name, price, orders: {<order: [pid, quantity, date]>}]>}
3.4 General Form of Queries
A general form of algebra expressions for NCR-QL (and, hence, XML-QL) queries is:
Σp(Nestni (Joini , Nest-_>(.roin2, ... Nest-k-! (Joink-ι , Join ) ...))) where:
Joinj = Ωmll(R„) XJU Ωml2(R12) X ...
Figure imgf000061_0002
Joink = ΩπjdCRk!) tXljki ΩπjaCRjώ) [X] |.jki
4 Algebraic Laws
The following three kinds of algebraic laws are formed:
• Push selections and projections down. Selections and projections are captured by the Combo operator, ∑p, hence laws are needed that commute Combo with other operators
• Join reordering: associativity, commutativity.
• Join-nest associativity.
4.1 Laws that Push Combo (=Selections and Projections) Down
4.1.1 The Combo-Combo Law
Let p : tl — _., and q : t2 — > t3 be two p-formers. Then:
Σqp(R)) = Σr(R) (the combo-combo law) Here r is a new p-former defined as r=ppcompose(q,p), by induction on q first, and, where needed, by induction on p second.
/* q = identity */ ppcompose(_, p) = p / * q = record p-former */ ppcomposefl b^ -»cι):qι, ..., (blfc →ck):qk, ck+1:e1,ck+2:e2,...,cr:er.k], _) = [(b -»cι):qι, ...,
(blk ->c ):qk, ck+ι:eι,ck+2'.e2,...,Cr'.Qτ-k] ppcompose^b^ ->cι):qι, ..., (blk - ck):qk, ck+1:e1,Ck+2:e2,...,cr:er-k],[( ah →bι):pι, ...,
(aJm - bs):ps, bs+1:f1,cs+2:f2,...,bm:fm-s] =
[(aJtι →c1):ppcompose(q1, p^ ), ..., (aJ ( →Ct): ppcompose(qk, plt ),
Ct+ι:fIl+1_s , ct+2: fll+2_i 5..., ck: fIk_. , ck+ι: ej p , ck+2: e2 ° p ,..., cr: er.k ° p ]
/* here (jι, ..., jm} c {1, 2, ..., n} and {i ..., ik} c {1, 2, ..., m}, with i1 < i2 <...< it
Figure imgf000062_0001
p is the second p-former, p = [( a ->bι):pl5 ..., (aJm -»bs):ps, bs+1:f1,cs+2:f2,...,bm:fm. s], and e ° p means "apply the p-former p first, then e" and is defined below */ /* q = variant p-former */ ppcompose((tag'/c' — > tag"):q, _) = (tag/c - tag'):q ppcompose((tag'/c' — tag"):q, (tag/c — tag1): p) = (tag/(c and c'°p)→ tag'):ppcompose(q,p)
/* here c'-p is defined below and means: apply p first, then check c' */ /* q = union p-former */ ppcompose(<qi | ... | qk> , _ ) = <qι | ... | qk> ppcompose(<qi | ... j qk> , <pι | ... | Pm>) = < ... | ppcompose(q;, pj) | ... > /* here each qi is paired with all those pj that have an output tag that matches the input tag in qi : according to the definitions there could be one ore more */ /* q = set p-former */ ppcompose({q}, _) = {q} ppcompose({q}, {p}) = {ppcompose(q,p)}
Composition of a p former with an expression, e ° p , and a p-former with a condition, c=p, are defined. In both cases, the expression (e) and the condition (c) have a single record argument. (i.e., only use $1, which, by convention, may be omitted), while p is a record p-former, p = i( - b1) : p ,...,(alk → bk) : pk,c, : e„...,cp : ep ] .
epcompose($l.bj, p) = $1. a, epcompose($l.Cj, p) = ej epcompose(e op e', p) = epcompose(e,p) op epcompose(e',p) where p is +, -, *, /, ... epcompose(f(e1,e2,...), p) = f(epcompose(el5p), epcompose(e2,p), ...) cpcompose(e op e', p) = epcompose(e, p) op epcompose(e',p) where op is < >, <=, >=,
Figure imgf000062_0002
<tag:[a1:epcompose(e1,p), ..., an: epcompose(en,p)]> IN epcompose(e,p) cpcompose(exists(e), p) = exists(epcompose(e,p)) cpcompose(true, p) = true, cpcompose(false, p) = false cpcompose(cι and C2, p) = cpcompose(c1; p) and cpcompose(c2,p) same for or, not
Example. Let p = <Emp/(sal>50000)→HighEarner:[name-→-richName:_] | Dept/(mgr= "Betsy")^BetsyManages:[name^name:_J | Site/(true)→Site:[name-»name:^city-»city:_J> Define R= ΣP(EmpsDeptsSites)
Let q = <HghEamer/(like(richName,"%Snιith%")- HighEarner:[richName - name: J |
BetsyManages/true - Betsy [name — » name:_] | Site/(city=" Seattle") — >■ HighEarner: [name — name:_J>
And define R' = Σq(R) = ΣqP(EmpsDeptsSites)).
Then R' = Σppcompose(q,p)(EmpsDeptsSites), where: ppcompose(q,p) =
<Emp/(sal>50000 and like(name,"%Smith%")) → HighEarner: [name → name:J |
Dept/(mgr="Betsy" and true) -» Betsy: [name - name: _J |
Site/(true and city=" Seattle") — > HighEarner: [name — » name: _ ]>
4.1.2 Applications of the combo-combo law
1.
Figure imgf000063_0001
h d h i h h b rule. The left hand side becomes:
∑ <Emp/(saI=20000)>(∑ <Dep /(mgr="Smith")>(R)) = = ∑<Emp/(sal=20000) | Dept:_ | Other:_>(∑<Emp:_ | Dept/(mgr=" Smith") | Other: _>(R))
= ∑<Emp/(sal=2000Q) | Dept:/(mgr=" Smith") | Other:_>(R)
The right hand side is similarly:
∑ <Dept/(mgr="Smith")>( ∑ <Emp/(sal=20000)>(R)) = = ∑<Emp:_ I Dept (mgr="Smith") | Other:_>( ∑<Emp/(sal=2000Q) | Dept:_ | Other:_>(R)) = ∑<Emp/(saI=20000) | Dept:/(mgr="Smith") | Other:_>(R)
Hence the two are equal.
2. Commuting inner selections.
∑ <Dept:[project:{<urgent/(deadIine>="10/10/2010") >}]>( <Dept:[project:{<urgcnt/(budget<=10000) >}]>(R)) = = ∑ <Dept:[project:{<urgent/ budget<=10000)>}]>(∑ <Dept:[project:{<urgent/(deadline>="10/l0/2010") >}]>(R))
Indeed, take the left hand side and expand the simple combos into combos, then apply the combo-combo law
∑ <Dept:[project:{<urgent (deadline>="10/l0/2010") >}]>(∑ <Dept:[project:{<urgent/(budget<=100Q0) >}]>(R)) = = ∑<Emp:_ I Dept:[name:_, mgr:_, project: {<urgent/(deadline>="10/10/2010") | normal:_>}]>(∑<Emp:_ | Dept:[name:_, mgr:_, project:{<urgent/(budget<=10000):_ | normal:_>}]>(.-^-J ) ~
= ∑<Emp:_ I Dept:[name:_, mgr: , project:{<urgenf(deadliπe>="10/l 0/2010" and budget<= 10000) | normal._>}]>
The right hand side is treated similarly and results in the same expression, hence they are equal.
3. Pushing predicates to sources. Consider a selection on the predicate (name like "Smith%"). The source supports some predicates, but not this one. Suppose it supports the predicate (name like "%Smith%"). The predicate is: c = (name like "Smith%") the source predicate is: cs = (name like "%Smith%") The optimizer knows the following implication: c => cs which is equivalent to: c = c and cs Hence it can perform the following optimization:
∑ <Emp/(name like "Smith%")>(R) = ∑ <Emp/(c)>(R) ~ ∑ <Emp/(c and cs)>(R)
= ∑ <Emp/(o)>( ∑ <Emp/(os)>(R)) = ∑ <Emp/(name like "Smith%")>( ∑ <Emp/(name like "%Smith%")>(R))
4.1.3 The Combo-Nest Law
Let n :: uι, U2→ u be an n-former and p : : u — • u< be a p-former. Then the following holds:
Σp(Nest„(R, S)) = Σp3(Nest„< Σpl(R), ∑p2(S))) (generic combo-nest law) where pi, p2, p3 are p-formers and n' is an n-former to be described next. The following can identity holds. But, pi and p2 are chosen to do
1 ... tag2m:r2m>.
Figure imgf000064_0001
h f h n = <tagϋ : [b; : <tag 1/cil | ... | tag2m/cim>] | i = 1, ... , n> The p-former p has the form: p = <pι I ■ ■ • I Pk> where: p; = tagljt / c/ → tagtfq., q' : rljt ® [b^ :{u2}] → bool, i = 1, .., k, where (ji, ..., j_} c {l, 2, ..., n}
Combo-nest law 1 : Assume that, for some i=l, ... ,k, the predicate c;' ignores the nested part (formally: $1. bj_ does not occur in q'), and that qi is the identity p-former. That is:
Pi = tag^ / ci' -→tag3i:_
Define: p.- tag^ / c;' → tag1Jt :_ /* i.e., tag^ replaces tag3; in pi */ pi = <Pi'> ρ3 = <pι I ... I pi_ι I tagu_ - tag3i:_ | ρi+1 1 ... ρ„>
The combo-nest law is:
Σp(Nestn(R, S)) = ∑p3(Nest„( Σpl(R), S)) (combo-nest law 1) Here n-n and p2 = _. Combo-nest law 2: Assume that, for every i=l, ... ,k, the predicate c;' only depends on its first argument (i.e., it ignores the nested part, bj ), and that q; leaves all fields unchanged except for bj where it applies a selection, and that, moreover, that selection is the same for all i=l, ...,k. More precisely: qi = [....b →b :r]
where ... contains only identity p-formers, i.e., a- a:_ for attributes a, and where: r= <tag2i/cii"| ... | tag2m/cim"> is a selection that is the same for every i=l, ... ,k. Define:
Pi- tagij, ci' → tagXj_ :_ for i = 1, ..., k
/* i.e., _ replaces qi and tag^ replaces tag3; in p, */ pi = <pι' I ... | p„'> p2 = r ρ3 = < tagiji → tagi3ι :_ | ... | tagljk → tagi3k :_ > The combo-nest law is:
Σp(Nest„(R, S)) = Σp3(Nestn( Σpl(R), ∑p2(S))) (combo-nest law 2)
4.1.4 The Combo-Join Law
Let j :: ul, u2 → u be a j-former and p::u - u' be a p-former. Then the following holds:
ΣP(R IX] j S) = Σp3pl(R)X]j Σp2(S)) (the combo-join law) where pi, p2, p3 are defined below.
Notations:
j = <taglj, tag2j / Cij -» tagij' 1 i = 1, ..., n, j = 1, ..., m > p=
Figure imgf000066_0001
tagij" : q ij I i = 1,...,n, j = 1, ...,m>
Here cy : rn x r2j -> bool is the join condition, while Cy": rn x r2j — bool is a selection predicate. Not all combinations of tags tag;/ must occur in p, but to simplify notations they are included. Assume that Cy'(x,y) = dij(x) AND eij(y) AND fij(x,y). That is dij(x) contains all conditions that only inspect only values from the left join operand, ejj(y) those conditions that only inspect values from the right join operand, while f]j(x,y) contains conditions that inspect values from both operands and cannot be separated. The conditions on the left operand are independent of j, i-e.,: dii(x) = di2(x) = ... = dim(x) = di(x) for i=l , ... , n and similarly: ey(x) = e2j(x) = ... = enj(x) = βj(x) for j=l, ... , m
This can often be achieved, by manipulating boolean conditions, factoring out the common parts and pushing the specific parts into fij(x,y). Then define:
pi = <tagh / di → tagli, ...,tagl„/ d„→ tagl„> p2 = <tag21 / ei -> tag21; .... tag2m/ em - tag2m> p3= <tagij'/fij→ tagij" : q ij I i = 1, ... ,n, j = 1, ... ,m>
4.2 Laws that commute joins and nests
4.2.1 Join Associativity and Commutativity
Join commutativity holds:
RIXJ S =
Figure imgf000066_0002
(join commutativity law)
Join associativity: (RlXJji s)|X]j2T = R[X]j3 (S ] j4T) Goin associativity law) holds for various choices of the j-formers j3 and j4.
In R^XJ ji S only a subset of all pairs of tags need to have join conditions. To simplify notations, however, each join considers all pairs of tags. Hence: jl = <tagπ, tag2j / ciij → tag4ij | i = 1,..., n, j = 1, ..., m> j2 = <tag4ij, tag3k / c2ijk - tagijk | i = 1,..., n, j = 1, ..., m, k = 1,..., ρ>
The condition c2ijk looks at three records: x from R, y from S, and z from T. Decomposing it into two pieces:
C2ijk(x,y,z) = c3ijk(x,y,z) AND c jk(y,z) such that the part inspecting only y and z is the same for all i=l, ... ,n. That is, in order to define c4jk(y,z) we inspect each condition C2rjk(x,y,z), ..., C2nj (x,y,z) and factor out what all of them do in common with y and z. Then define:
j4 = <tag2j, tag3k / c4jk → tag5jk | j=l , ... ,m, k = 1 , ... ,p> j3 = <tagπ, tag5jk / c3ijk→ tagijk | i=l,...,n, j=l,...,m, k=l,...,p>
4.2.2 The Join-Nest Rule
The following holds:
R X! Ji (Nest"i(S, T)) = Nestn2((RX] j2 S), T) (join-nest law) provided that jl looks only at the S-component in Nestnl(S, T). In that case, j2 is "the same" as j 1, except that it does not get to see the nested attribute, which jl did not use anyway. Similarly, n2 is "the same" as nl, except that now it gets to see an R component, which it ignores.
From the above description, it will be appreciated that although the specific embodiments of the technology have been described for purposes of illustration, various modifications may be made without deviating from the scope of the invention. Accordingly, the invention is not limited except by the appended claims.

Claims

CLAIMSI/We claim:
1. A computer-readable medium containing a data structure, the data structure having rows and columns, the data structure comprising: a tag column containing tag values that identify a type of the row for each of the rows of the table, each type identifies the columns of that row; and a column of row that includes a plurality of sub-rows, each sub-row representing data of the column for that row.
2. The computer-readable medium of claim 1 wherein each sub-row has a tag column that identifies a type of the sub-row.
3. The computer-readable medium of claim 2 wherein sub-rows are nested to multiple levels.
4. The computer-readable medium of claim 1 wherein the data structure represents a nested conditional relation.
5. A method in a computer system for generating a sorted outer union of relational tables, the method comprising: providing a plurality of relational tables; for each of the tables, generating a query for the table that results in a column for each of column of each table and that results in a value for each column of each table on a path of joins to the table and null for each column of each table not on a path of joins to the table; executing each of the generated queries to generate results; combing the results of the queries into a single results table; and sorting the results table.
6. The method of claim 5 including wherein in the query is an SQL query.
7. The method of claim 5 wherein the generating of a query includes: for each table in the path, adding the table to a from clause; adding a join to a where clause; and adding each column of the table to a select clause.
8. A method in a computer system for executing a query on data collection in diverse formats, the method comprising: providing a mapping of each data collection format to an XML format; receiving a query for a data collection based on the XML format; generating a native query for the data collection from the received query using the provided mapping; requesting execution of the native query to generate data in native format; converting the data in the native format to a nested conditional relation (NCR) format; and applying operators to the data in the NCR format to generate query results in NCR format; and converting the NCR results into an XML format.
PCT/US2001/024353 2000-08-01 2001-08-01 Nested conditional relations (ncr) model and algebra WO2002027551A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001281023A AU2001281023A1 (en) 2000-08-01 2001-08-01 Nested conditional relations (ncr) model and algebra

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US22207000P 2000-08-01 2000-08-01
US60/222,070 2000-08-01

Publications (1)

Publication Number Publication Date
WO2002027551A2 true WO2002027551A2 (en) 2002-04-04

Family

ID=22830685

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/024353 WO2002027551A2 (en) 2000-08-01 2001-08-01 Nested conditional relations (ncr) model and algebra

Country Status (3)

Country Link
US (1) US7756904B2 (en)
AU (1) AU2001281023A1 (en)
WO (1) WO2002027551A2 (en)

Families Citing this family (61)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AUPO489297A0 (en) * 1997-01-31 1997-02-27 Aunty Abha's Electronic Publishing Pty Ltd A system for electronic publishing
US7293228B1 (en) 1997-01-31 2007-11-06 Timebase Pty Limited Maltweb multi-axis viewing interface and higher level scoping
US7707159B2 (en) * 2000-03-02 2010-04-27 Actuate Corporation Method and apparatus for storing semi-structured data in a structured manner
US7152062B1 (en) 2000-11-21 2006-12-19 Actuate Corporation Technique for encapsulating a query definition
AU2001281023A1 (en) 2000-08-01 2002-04-08 Nimble Technology, Inc. Nested conditional relations (ncr) model and algebra
US7873649B2 (en) * 2000-09-07 2011-01-18 Oracle International Corporation Method and mechanism for identifying transaction on a row of data
US20050086584A1 (en) 2001-07-09 2005-04-21 Microsoft Corporation XSL transform
US7251776B2 (en) * 2001-07-13 2007-07-31 Netview Technologies, Inc. System and method for efficiently and flexibly utilizing spreadsheet information
US20030041305A1 (en) * 2001-07-18 2003-02-27 Christoph Schnelle Resilient data links
US7363310B2 (en) * 2001-09-04 2008-04-22 Timebase Pty Limited Mapping of data from XML to SQL
US7281206B2 (en) * 2001-11-16 2007-10-09 Timebase Pty Limited Maintenance of a markup language document in a database
US7496599B2 (en) * 2002-04-30 2009-02-24 Microsoft Corporation System and method for viewing relational data using a hierarchical schema
US7457810B2 (en) * 2002-05-10 2008-11-25 International Business Machines Corporation Querying markup language data sources using a relational query processor
WO2003107174A1 (en) * 2002-06-13 2003-12-24 Cerisent Corporation Xml database mixed structural-textual classification system
AU2003236543A1 (en) * 2002-06-13 2003-12-31 Mark Logic Corporation A subtree-structured xml database
AU2003245506A1 (en) 2002-06-13 2003-12-31 Mark Logic Corporation Parent-child query indexing for xml databases
US7877399B2 (en) * 2003-08-15 2011-01-25 International Business Machines Corporation Method, system, and computer program product for comparing two computer files
US8037102B2 (en) 2004-02-09 2011-10-11 Robert T. and Virginia T. Jenkins Manipulating sets of hierarchical data
US9646107B2 (en) 2004-05-28 2017-05-09 Robert T. and Virginia T. Jenkins as Trustee of the Jenkins Family Trust Method and/or system for simplifying tree expressions such as for query reduction
US7464375B2 (en) * 2004-06-24 2008-12-09 International Business Machines Corporation Method for flattening hierarchically structured flows
US7593923B1 (en) * 2004-06-29 2009-09-22 Unisys Corporation Functional operations for accessing and/or building interlocking trees datastores to enable their use with applications software
US7882147B2 (en) * 2004-06-30 2011-02-01 Robert T. and Virginia T. Jenkins File location naming hierarchy
US7620632B2 (en) * 2004-06-30 2009-11-17 Skyler Technology, Inc. Method and/or system for performing tree matching
US9171100B2 (en) 2004-09-22 2015-10-27 Primo M. Pettovello MTree an XPath multi-axis structure threaded index
US7801923B2 (en) * 2004-10-29 2010-09-21 Robert T. and Virginia T. Jenkins as Trustees of the Jenkins Family Trust Method and/or system for tagging trees
US7627591B2 (en) 2004-10-29 2009-12-01 Skyler Technology, Inc. Method and/or system for manipulating tree expressions
US7636727B2 (en) 2004-12-06 2009-12-22 Skyler Technology, Inc. Enumeration of trees from finite number of nodes
US7630995B2 (en) 2004-11-30 2009-12-08 Skyler Technology, Inc. Method and/or system for transmitting and/or receiving data
US9390132B1 (en) * 2009-10-16 2016-07-12 Iqor Holdings, Inc. Apparatuses, methods and systems for a universal data librarian
US8195693B2 (en) 2004-12-16 2012-06-05 International Business Machines Corporation Automatic composition of services through semantic attribute matching
US7620641B2 (en) * 2004-12-22 2009-11-17 International Business Machines Corporation System and method for context-sensitive decomposition of XML documents based on schemas with reusable element/attribute declarations
US20060136483A1 (en) * 2004-12-22 2006-06-22 International Business Machines Corporation System and method of decomposition of multiple items into the same table-column pair
US8316059B1 (en) 2004-12-30 2012-11-20 Robert T. and Virginia T. Jenkins Enumeration of rooted partial subtrees
US7475070B2 (en) * 2005-01-14 2009-01-06 International Business Machines Corporation System and method for tree structure indexing that provides at least one constraint sequence to preserve query-equivalence between xml document structure match and subsequence match
US8615530B1 (en) 2005-01-31 2013-12-24 Robert T. and Virginia T. Jenkins as Trustees for the Jenkins Family Trust Method and/or system for tree transformation
US7681177B2 (en) 2005-02-28 2010-03-16 Skyler Technology, Inc. Method and/or system for transforming between trees and strings
US8356040B2 (en) 2005-03-31 2013-01-15 Robert T. and Virginia T. Jenkins Method and/or system for transforming between trees and arrays
US7685150B2 (en) * 2005-04-19 2010-03-23 Oracle International Corporation Optimization of queries over XML views that are based on union all operators
US7899821B1 (en) 2005-04-29 2011-03-01 Karl Schiffmann Manipulation and/or analysis of hierarchical data
US20060253476A1 (en) * 2005-05-09 2006-11-09 Roth Mary A Technique for relationship discovery in schemas using semantic name indexing
US8166059B2 (en) 2005-07-08 2012-04-24 Oracle International Corporation Optimization of queries on a repository based on constraints on how the data is stored in the repository
US20070067371A1 (en) * 2005-09-19 2007-03-22 Sbc Knowledge Ventures, L.P. Database structure and method
US20070067343A1 (en) * 2005-09-21 2007-03-22 International Business Machines Corporation Determining the structure of relations and content of tuples from XML schema components
US7664742B2 (en) 2005-11-14 2010-02-16 Pettovello Primo M Index data structure for a peer-to-peer network
US7529758B2 (en) 2006-02-10 2009-05-05 International Business Machines Corporation Method for pre-processing mapping information for efficient decomposition of XML documents
US7913241B2 (en) * 2006-06-13 2011-03-22 Oracle International Corporation Techniques of optimizing XQuery functions using actual argument type information
US7577642B2 (en) * 2006-07-13 2009-08-18 Oracle International Corporation Techniques of XML query optimization over static and dynamic heterogeneous XML containers
US20080016088A1 (en) * 2006-07-13 2008-01-17 Zhen Hua Liu Techniques of XML query optimization over dynamic heterogeneous XML containers
AR071136A1 (en) * 2008-03-31 2010-05-26 Thomson Reuters Glo Resources SYSTEMS AND METHODS FOR TABLE OF CONTENTS
US7958112B2 (en) 2008-08-08 2011-06-07 Oracle International Corporation Interleaving query transformations for XML indexes
US8631028B1 (en) 2009-10-29 2014-01-14 Primo M. Pettovello XPath query processing improvements
US20120016901A1 (en) * 2010-05-18 2012-01-19 Google Inc. Data Storage and Processing Service
US9171039B2 (en) * 2011-09-29 2015-10-27 Sap Se Query language based on business object model
US9454588B2 (en) * 2012-08-14 2016-09-27 International Business Machines Corporation Custom object-in-memory format in data grid network appliance
WO2014038069A1 (en) * 2012-09-07 2014-03-13 株式会社東芝 Structured document management device, method, and program
US9811579B1 (en) * 2012-11-21 2017-11-07 Christopher A. Olson Document relational mapping
US10929858B1 (en) * 2014-03-14 2021-02-23 Walmart Apollo, Llc Systems and methods for managing customer data
US10333696B2 (en) 2015-01-12 2019-06-25 X-Prime, Inc. Systems and methods for implementing an efficient, scalable homomorphic transformation of encrypted data with minimal data expansion and improved processing efficiency
WO2017031082A1 (en) * 2015-08-14 2017-02-23 California Institute Of Technology Algebraic query language (aql) database management system
US10983966B2 (en) * 2016-04-22 2021-04-20 International Business Machines Corporation Database algebra and compiler with environments
KR101731579B1 (en) * 2016-09-07 2017-05-12 주식회사 비트나인 Database capable of intergrated query processing and data processing method thereof

Family Cites Families (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4014004A (en) 1975-08-04 1977-03-22 Harris Corporation Automatic report register
US4947320A (en) * 1988-07-15 1990-08-07 International Business Machines Corporation Method for referential constraint enforcement in a database management system
US5572583A (en) 1992-04-17 1996-11-05 Bell Atlantic Advanced intelligent network with intelligent peripherals interfaced to the integrated services control point
US5421001A (en) * 1992-05-01 1995-05-30 Wang Laboratories, Inc. Computer method and apparatus for a table driven file interface
US5852825A (en) 1994-12-05 1998-12-22 Trimble Navigation Limited Form data message formatting method, program and system
US5913214A (en) 1996-05-30 1999-06-15 Massachusetts Inst Technology Data extraction from world wide web pages
US6052693A (en) * 1996-07-02 2000-04-18 Harlequin Group Plc System for assembling large databases through information extracted from text sources
US6516321B1 (en) * 1996-07-30 2003-02-04 Carlos De La Huerga Method for database address specification
US5826258A (en) 1996-10-02 1998-10-20 Junglee Corporation Method and apparatus for structuring the querying and interpretation of semistructured information
US5970490A (en) 1996-11-05 1999-10-19 Xerox Corporation Integration platform for heterogeneous databases
US5956720A (en) 1997-02-06 1999-09-21 At & T Corp Method and apparatus for web site management
US5909225A (en) * 1997-05-30 1999-06-01 Hewlett-Packard Co. Frame buffer cache for graphics applications
US5983232A (en) * 1997-09-29 1999-11-09 Triada, Ltd. Virtual structured information system
US6269368B1 (en) 1997-10-17 2001-07-31 Textwise Llc Information retrieval using dynamic evidence combination
US6571243B2 (en) 1997-11-21 2003-05-27 Amazon.Com, Inc. Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information
US6076087A (en) 1997-11-26 2000-06-13 At&T Corp Query evaluation on distributed semi-structured data
US6094649A (en) 1997-12-22 2000-07-25 Partnet, Inc. Keyword searches of structured databases
US6016497A (en) * 1997-12-24 2000-01-18 Microsoft Corporation Methods and system for storing and accessing embedded information in object-relational databases
US6012098A (en) 1998-02-23 2000-01-04 International Business Machines Corp. Servlet pairing for isolation of the retrieval and rendering of data
US6356920B1 (en) 1998-03-09 2002-03-12 X-Aware, Inc Dynamic, hierarchical data exchange system
US6493699B2 (en) * 1998-03-27 2002-12-10 International Business Machines Corporation Defining and characterizing an analysis space for precomputed views
US6154738A (en) 1998-03-27 2000-11-28 Call; Charles Gainor Methods and apparatus for disseminating product information via the internet using universal product codes
JPH11296541A (en) * 1998-04-14 1999-10-29 Fujitsu Ltd Structured data management system, and computer-readable recording medium recorded with structured data managing program
US6263332B1 (en) 1998-08-14 2001-07-17 Vignette Corporation System and method for query processing of structured documents
US6308179B1 (en) 1998-08-31 2001-10-23 Xerox Corporation User level controlled mechanism inter-positioned in a read/write path of a property-based document management system
US6487566B1 (en) 1998-10-05 2002-11-26 International Business Machines Corporation Transforming documents using pattern matching and a replacement language
US6678269B1 (en) * 1998-10-05 2004-01-13 Alcatel Network switching device with disparate database formats
US6826553B1 (en) * 1998-12-18 2004-11-30 Knowmadic, Inc. System for providing database functions for multiple internet sources
US6507856B1 (en) 1999-01-05 2003-01-14 International Business Machines Corporation Dynamic business process automation system using XML documents
US6535896B2 (en) 1999-01-29 2003-03-18 International Business Machines Corporation Systems, methods and computer program products for tailoring web page content in hypertext markup language format for display within pervasive computing devices using extensible markup language tools
US6507857B1 (en) 1999-03-12 2003-01-14 Sun Microsystems, Inc. Extending the capabilities of an XSL style sheet to include components for content transformation
US6343287B1 (en) 1999-05-19 2002-01-29 Sun Microsystems, Inc. External data store link for a profile service
US6356906B1 (en) 1999-07-26 2002-03-12 Microsoft Corporation Standard database queries within standard request-response protocols
US6389429B1 (en) * 1999-07-30 2002-05-14 Aprimo, Inc. System and method for generating a target database from one or more source databases
US6601071B1 (en) * 1999-08-04 2003-07-29 Oracle International Corp. Method and system for business to business data interchange using XML
WO2001017286A1 (en) 1999-09-02 2001-03-08 Ho William J Mobile and online information storage and management system
US6754648B1 (en) 1999-09-30 2004-06-22 Software Ag Method for storing and managing data
US6339776B2 (en) 1999-10-04 2002-01-15 International Business Machines Corporation Dynamic semi-structured repository for mining software and software-related information
US6418448B1 (en) * 1999-12-06 2002-07-09 Shyam Sundar Sarkar Method and apparatus for processing markup language specifications for data and metadata used inside multiple related internet documents to navigate, query and manipulate information from a plurality of object relational databases over the web
US6604100B1 (en) 2000-02-09 2003-08-05 At&T Corp. Method for converting relational data into a structured document
US7124144B2 (en) * 2000-03-02 2006-10-17 Actuate Corporation Method and apparatus for storing semi-structured data in a structured manner
US6581062B1 (en) * 2000-03-02 2003-06-17 Nimble Technology, Inc. Method and apparatus for storing semi-structured data in a structured manner
US7152062B1 (en) * 2000-11-21 2006-12-19 Actuate Corporation Technique for encapsulating a query definition
US6449620B1 (en) 2000-03-02 2002-09-10 Nimble Technology, Inc. Method and apparatus for generating information pages using semi-structured data stored in a structured manner
AU2001281023A1 (en) 2000-08-01 2002-04-08 Nimble Technology, Inc. Nested conditional relations (ncr) model and algebra
US6714939B2 (en) * 2001-01-08 2004-03-30 Softface, Inc. Creation of structured data from plain text
US6697818B2 (en) 2001-06-14 2004-02-24 International Business Machines Corporation Methods and apparatus for constructing and implementing a universal extension module for processing objects in a database
US6799184B2 (en) 2001-06-21 2004-09-28 Sybase, Inc. Relational database system providing XML query support

Also Published As

Publication number Publication date
US20020133497A1 (en) 2002-09-19
US7756904B2 (en) 2010-07-13
AU2001281023A1 (en) 2002-04-08

Similar Documents

Publication Publication Date Title
WO2002027551A2 (en) Nested conditional relations (ncr) model and algebra
Melnik et al. Rondo: A programming platform for generic model management
Fegaras et al. Optimizing object queries using an effective calculus
Fagin et al. Clio: Schema mapping creation and data exchange
Melnik Generic model management: concepts and algorithms
Fernandez et al. A semi-monad for semi-structured data (ICDT version)
Atzeni et al. Model-independent schema and data translation
EP1684192A1 (en) Integration platform for heterogeneous information sources
Sur et al. An XQuery-based language for processing updates in XML
Jain et al. Translating XSLT programs to efficient SQL queries
Guravannavar et al. Rewriting procedures for batched bindings
Michel et al. A generic mapping-based query translation from SPARQL to various target database query languages
Koupil et al. A unified representation and transformation of multi-model data using category theory
Holubová et al. Multi-model data modeling and representation: State of the art and research challenges
Kozankiewicz et al. Updatable XML views
Lano et al. Model transformation development using automated requirements analysis, metamodel matching, and transformation by example
Koupil et al. A universal approach for multi-model schema inference
van Emde Boas et al. Storing and evaluating Horn-clause rules in a relational database
Paparizos et al. A physical algebra for XML
Magnani et al. A unified approach to structured and XML data modeling and manipulation
Fisher et al. Algebraic transformation and optimization for XQuery
Song et al. SDTA: An Algebra for Statistical Data Transformation
Yerneni Mediated query processing over autonomous data sources
Omelayenko et al. Knowledge transformation for the semantic web
Manukyan On a Conceptual Data Model with Orientation to Data Integration.

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP