WO2000007123A1

WO2000007123A1 - Methods of deleting information in n-gram tree structures

Info

Publication number: WO2000007123A1
Application number: PCT/US1999/017133
Authority: WO
Inventors: Tao Zhang
Original assignee: Triada, Ltd.
Priority date: 1998-07-28
Filing date: 1999-07-28
Publication date: 2000-02-10
Also published as: AU5239499A

Abstract

Methods of executing delete operations in an NGRAM tree structure are disclosed. One may choose a starting point for a given deletion procedure at the root, or a leaf node or an internal node in the NGRAM tree structure. During the deletion one or more leaf nodes of the NGRAM tree may be constrained and the deletion may be propagated up and down the nodes of the tree at different level.

Description

METHODS OF DELETING INFORMATION IN N GRAM TREE STRUCTURES

Field of the Invention This invention relates generally to deleting information values in an NGRAM system, similar to the deletion of records in a relational database. In particular, the invention is directed toward the deletion of information values in both single and joined NGRAM tree structures.

Background of the Invention

In a relational database, deletion of records is carried out in two steps. First, a query is perfromed with a constraint imposed on one or more fields, afterwhich records are obtained which satisfies the constraint. Then the obtained records may be deleted by eliminating them from the database. In an NGRAM system, of the type described in commonly assigned U.S.

Patent Nos. 5,245,337; 5,293,164; 5,592,667, a record is represented by a set of tokens (or indices) stored at all non-leaf nodes, and dictionary data values stored at all leaf nodes. Deletion of a record means a set of deletions and modifications of the involved tokens and dictionaries. In an NGRAM tree structure, one may carry out deletion of records, starting from a leaf node and propagating throughout the NGRAM tree structure, if a constraint is imposed on the leaf node.

For example, to delete records having a given set of values at leaf node L, the deletion operation may start at the leaf node L. If a constraint is imposed on a number of leaf nodes, deletion of records may start at the closest common ancestor node, followed by a propagation of the deletion throughout the NGRAM structure.

In general, there are two different strategies for deletion of records in an NGRAM system. One is to carry out the deletion physically throughout the NGRAM memory structure. The other is to set a number of flags at each node structure, indicating the tokens and dictionaries related to the records being deleted. However, no physical deletion is carried out in the memory structure, equivalent to virtual deletion.

The former strategy may lead to a very sophisticated and complex procedure. However, the resulting structure after deletion is clean and conducive to general data manipulation such as performing a query and carrying out data mining or OLAP analyses. On the other hand, the latter deletion strategy may be less demanding, but may be problematic with respect to general data manipulation. This invention is primarily directed toward the physical deletion of records in an NGRAM system.

Summary of the Invention This invention resides in methods of executing delete operations in an

NGRAM system. Various alternative procedures are disclosed, based upon what information properties the system has before the deletion and on what information properties one may wish to preserve in the system after the deletion.

In general, there are two kinds of NGRAM systems that have different information properties. One is an undisturbed system which corresponds to an original NGRAM system transformed from a raw database. This system has, among other properties, a fundamental NGRAM information property: two consecutive indices in a look-down (or output lookup) table at any non-leaf node in an NGRAM system satisfies:

t_i+1 - ti < 2, i = 1,2, ... , C - l (1)

where C is the cardinality of the node, and tj and t_{i+ 1} are the ith and i + lth indices. Here, a look-down table is a stream of left or right child tokens whose addresses are the parent node tokens.

A fundamental structure function may be used to describe the relationship between any two pairing parent and child tokens. This function is given by: t p_p ^■• _cc + ' l *N ^Λ _rr_eep__ee_aa_tt Z)

where t_c is a child token and t, is the pairing parent token, and N_repeat is the number of repeated child tokens appearing before the child token t_c.

Another kind of NGRAM systems is a disturbed or reassigned system which does not preserve the above information property. The relationships between parent and child tokens can be any sequence other than that described by the above structure function. Generally speaking, there may exist an infinite number of sequences, each of which corresponds to a reassigned NGRAM system. Such reassignment is typically used for a specific purpose. Most notably, reassignment is used to conserve memory by reassigning the left and right look-down tables in the hashing sequence; that is, for left hashing, sort the left look-down table; sort the right look-down table for right hashing.

In an undisturbed NGRAM system, there are two principle procedures for deletion, depending on the resulting structure one may wish to have after a deletion. If an undisturbed NGRAM system is expected after a deletion, a general procedure is employed, which requires reassignment of the memory structures in order to preserve the original information properties. The reassigned deletion procedure leads to no sign of disturbance due to deletion. All the information properties preserved before a deletion are unchanged after the deletion, including the information property described above. This deletion is the most difficult to implement, but is also the most general. If an undisturbed structure is not necessary to retain, a less general procedure, a partial procedure may be used which requires no reassignment of the memory structure.

In a disturbed NGRAM system, only the less general deletion procedure without reassignment of undeleted tokens may be used. The resulting system after a deletion is also a disturbed system.

One may choose a starting point for a given deletion procedure, such as the root node in an NGRAM tree structure, since this is the place where a record number is identified. Once a record number is identified for a record being deleted, the procedure is as follows:

-delete the token at the root node that describes the record being deleted, -propagate the deletion down to every lower level node, and -delete all the tokens and dictionary values that describe different parts of the record being deleted

This procedure is instructive, but less efficient than it could be because the record numbers for the records being deleted are usually identified by performing a query constrained at one or more leaf nodes. If the constraint is imposed on a single leaf node, traverse twice between the leaf node and the root node is necessary for deletion starting at the root node. However, one traverse between the two nodes is needed if the deletion starts at the leaf node where the constraint is imposed.

If a constraint is imposed on a number of leaf nodes, deletion may start at the closest common ancestor node of all constrained leaf nodes. The procedure then propagates the deletion up and down in all directions. Such a procedure is disclosed herein, although at least a portion of the procedure may be applied directly to the deletion starting at the root node as well.

Brief Descripton of the Drawings FIGURE 1 shows the main branch and side branches as subtree structures, wherein the main branch starts with a leaf node, and ends with the root node, and wherein all dotted circles indicate subtree structures; and

FIGURE 2 shows the main branch and side branches as subtree structures, wherein the main branch starts with a non-leaf node and ends with the root node, and wherein all dotted circles indicate subtree structures.

Detailed Description of the Invention Before describing the invention in detail, several terms will be defined, as follows:

An information value is a value that can be either a raw data value or a token (index) representing the memory address of a combination of data values.

A complete deletion of a token in a non-leaf node removes the token and the corresponding count from the node memory structure, because the count of deletions is equal to the count of occurrences. A complete deletion of a dictionary value in a leaf node removes the value and the corresponding count from the leaf-node memory structure, since the count of deletions is equal to the count of occurrences.

A partial deletion of a token in a non-leaf node modifies the token and the associated count in the node memory structure, because the count of deletions is less than the count of occurrences.

A partial deletion of a dictionary value in a leaf node modifies the value and the associated count in the leaf-node memory structure, because the count of deletions is less than the count of occurrences.

The main branch of an NGRAM tree structure is the segment between a starting node for deletion and the root node. All leaf and non-leaf nodes on the main branch are main-branch nodes. All other nodes that are child nodes of the main-branch nodes are side-branch nodes.

A hybrid node is a node that stores at least two sets of information values. One set must be a set of tokens and the other set can be either a set of raw data values or a set of tokens of different type.

A foreign node is a node storing values that represent references to the referenced node values. The referenced node values are tokens (indices) that represent the memory addresses of the referenced data values.

A look-down table is a mapper between tokens of a non-leaf node and tokens of its child node, which stores a set of the child tokens in such a sequence that the pairing tokens of the parent node are incremental. The purpose is to obtain the pairing child token for any given parent token. This is a one-to-one operation. A more conventional term for a look-down table is an "output lookup table. "

A lookup table is defined as a mapping between tokens of a non-leaf node and tokens of its child node, which stores a set of the parent tokens in a hashing sequence in which all parent tokens pairing with the same child token are stored in the same hashing list. The purpose is to obtain all the pairing parent tokens for a given child token, by looking up the hashing list indexed by the child token. This is a one-to-many operation. A more conventional term for a lookup table is an "input lookup table" or "lookup hash table. "

Reassignment of a node structure is to reassign the mapping between tokens of a non-leaf node and tokens of its child node, or to reassign the mapping between tokens of a leaf node and dictionaries at the leaf node. Compacting a tree is to save memory storage by reassigning all non-leaf node structures in a specified hashing sequence.

DELETION IN AN N-GRAM TREE STRUCTURE A deletion operation starts from one end of the main branch where the end node is not the root node, and propagates through all the main-branch nodes as well as all the side-branch nodes and their substructures.

Deletion Starting at a Leaf Node

Suppose a constraint is imposed on a leaf node, say leaf node N,. The constrained leaf node defines one end of the main branch. The root node is the other end. Assume there are M main-branch nodes N,, N₂,..., N_M, where N₂ is the parent node of node N_{1 ;} N₃ is the parent node of node N₂, ... , and N_M is the root node and the parent node of node N_M.,. The constraint is such that all records that have dictionary values equal to d,, d₂, ... d_ml at node N, are deleted. We may start the deletion from node N,

Deletion at a leaf node N, on the main branch: • Find dictionary values d, , d₂, ... , d_ml and their addresses tj, t₂, ... , t_ml in the leaf node memory structure file. Remove the dictionary values d,, d₂, ..., d_m] from the memory structure file. • Find the corresponding counts in the count file and remove the counts from the count file.

• Propagate tokens t| , t^, ... , t_ml up to the parent node N₂ for deletion.

Set i = 2 and propagate the deletion up to node N; by following a procedure described below, for deletion at a non-leaf node Nj on the main branch. The propagation stops at the root node and all side-branch leaf nodes.

Deletion Starting at a Non-Leaf Node

Suppose a constraint is imposed on a number of leaf nodes. Find the closest common ancestor node, N,, of the constrained leaf nodes. The common ancestor node N_! defines one end of the main branch. The root node is the other end. Assume there are M main-branch nodes N,, N₂, ... , N_M where N₂ is the parent node of node N_{l 5} N₃ is the parent node of node N₂, ... , and N_M is the root node and the parent node of node N_M_, . The constraint is such that all records consisting of combinations of data values described by tokens t{, t₂, ..., t_ml at node N, are deleted, where m, is the number of tokens being deleted. These tokens being deleted are obtained by performing a query with specified conditions setting on the constrained leaf nodes.

1. Deletion at a non-leaf node N, on the main branch:

• Store tokens tj, tj, ..., t_ml in an array D. Store their counts of deletions in an array C^D, equal to the counts of occurrences. Set arrays R, C^R, and X, to be NULL. Carrying out deletion in a subtree structure in which node N, is the root node, by following a procedure for deletion at a side-branch node, to be described below.

• Propagate tokens tj, t₂, ... , t_ml up to the parent node N₂ for deletion. Set i = 2 and continue the following procedure.

2. Deletion at a non-leaf node N, on the main branch:

• Load a lookup table of mapping between the tokens of node N, and the tokens of node N,.,, from the memory structure of node N,. The lookup table is in a hashing form of a set of hashing lists. The tokens of node N,., describe the list indices of the hashing lists in which the tokens of node N, are the list elements. Find the ti^" th, tj^" th, ... , tj^ th lists which contain tokens of node Ni, to be deleted. Get all the tokens of node N, from the above lists or deletion. Assume these tokens tj, tj, ... , t_m, are where m, ≥m,., and their counts of deletions cj, cj, ... , c_mι , equal to the counts of occurrences respectively.

• Remove all the left and right child tokens that pair with tokens t¹, , tj, ..., t_mι from the memory structure of node N,. Remove all the corresponding counts from the count file.

• Assume the sibling node of node N,., is S_lA. Get all the tokens of the sibling node S,.,, that pair with tokens t¹,, tj, ... , t_ml . Eliminating repeated tokens by summing over counts of deletions for each distinct token. Assume these tokens are s¹, , sj, ..., s , where k, ≥m The corresponding counts of deletions are c , c₂, ... , c_kl.

• Identify tokens among si, sj, ... , sj., for complete deletion. The identification of such a token may be made by checking if its count of deletions is equal to the count of occurrences. Stored the identified tokens and their counts of deletions in arrays D and C^D.

• The rest tokens in s¹,, sj, ... , s , are for partial deletion, since the count of deletions is less than the corresponding count of occurrences for each one of them. Store the tokens for partial deletion and their counts of deletions in arrays R and C^R.

• If an undisturbed NGRAM system is not required after deletion, propagate the deleted tokens in D and R, and their counts of deletions in C^D and C^R, to the sibling node S_lΛ, by following a procedure for deletion at a side-branch node, to be described below.

• If an undisturbed NGRAM system is required after deletion, reassignment of undeleted tokens is in order:

Reassign the undeleted tokens for nodes Nj and N_iΛ, using the following formula: t* = t - N_d, (3) where t is a given undeleted token and t* is the corresponding token after reassignment. N_d is the number of tokens being deleted, that are less than t. - Identify each token in R for deletion of the first appearance. Store each recognized token for deletion of the first appearance and the largest token appearing before the first undeleted appearance of each recognized token in pairs in an array X. Reassign the undeleted tokens of node S_U1 by using the reassignment schemes and formulas for deletion at a side-branch node, to be described below.

Propagate arrays D, R, X, C^D, and C^R, to the sibling node Sj.,, by following a procedure for deletion at a side-branch node, to be described below.

• If i = M, stop. Otherwise, set i = i + l and repeat the procedure for deletion at a non-leaf node N= on the main branch. Deletion at a Side-Branch Node

In general, a side-branch node can be a leaf or non-leaf node. A side-branch node may be considered as the root node of a subtree structure. The subtree structure may be a single-leaf tree structure if the side-branch node is a leaf node. In this case, only the procedure for deletion at a leaf node may be used. Or the subtree structure may be a non-cardinal tree structure if the side-branch node is a non-leaf node. In this case, deletion at a side-branch node is treaed as deletion starting at the root node in a tree structure and propagate the deletion down throughout the tree structure.

In the following, we describe a general procedure for deletion of a set of tokens starting from the root node in a tree or subtree structure. The procedure describes how to carry out deletion in a left-hashing memory structure in an explicit form. The procedure applies to all three different cases:

1) undisturbed NGRAM tree structure before and after deletion;

2) undisturbed NGRAM tree structure before deletion and reassigned NGRAM tree structure after deletion; and

3) reassigned NGRAM tree structure before and after deletion.

In the case of 1), the exact procedure described in this section may be followed. In the cases of 2) and 3), the procedure may be simplified by neglecting reassignment of undeleted tokens. It is also possible that a part of a NGRAM tree structure is reassigned and the rest is not. For example, output tokens in the root node memory structure may be dropped in an undisturbed NGRAM tree structure. In this case, mapping between the root-node output tokens and record numbers is lost, equivalent to reassignment of the root-node tokens. The other internal nodes, however, are not necessarily reassigned.

1. If it is a leaf node, go to deletion at a leaf node. If arrays D, R, X, C^D, and C^R are not all NULLs, go to the next step. Otherwise, identify the root-node tokens for deletion and get the count for each root-node token being deleted, from the count file for the root-node. Store all the root-node tokens for complete deletion and their counts of deletions in two arrays D[n] and C^D[n] respectively, where n is the number of the root-node tokens being deleted. Set arrays R, X, and C^R, to be NULL. In the following, D stores tokens for complete deletion and C^D stores their counts of deletions. Array R stores tokens for partial deletions and C^R stores the counts of deletions. Array X stores every token for partial deletion of the first appearance and the largest undeleted token appearing before the first undeleted appearance of the token being deleted partially.

2. From arrays R and D, get a set of n parent-node tokens being deleted (partially and completely) at a non-leaf node, and find pairs of left- and right-child tokens associated with the parent-node tokens in correspondence. Store the three sets of tokens in a composite buffer B sequentially, in the order of a stream of left-child tokens followed by a stream of right-child tokens followed by a stream of parent-node tokens. Sort the left-child tokens in the composite buffer B in an incremental sequence. Subsort the right-child tokens which pair with the same left-child token.

3. Get unique left-child tokens from B and sum over the counts of deletions for each unique left-child token.

(a) If a left-child token appears once and once only in B, get count of deletions for the left-child token, from the count of deletions for the pairing parent-node token. The count of deletions for the parent-node token is obtained from either C^D (if the parent-node token is in D for complete deletion) or C^R (if the parent-node token is in R for partial deletion).

(b) If a left-child token appears more than once in B, get count of deletions for the left-child token by summing over counts of deletions for all the pairing parent-node tokens, stored in C^D and/or C^R.

Store the unique left-child tokens and their counts of deletions in two arrays T,[n,] and C^τ[n,] respectively, where n, is the number of unique left-child tokens in B. For each token in T, (say T,[i], i=0,l,2,... , ,), compare its count of deletions in C^τ (C^τ[i]) and its count of occurrences. The count of occurrences can be obtained from either reading (T,[i] + l)th count in the count file for the left-child node, or summing over all the counts of occurrences for the pairing parent tokens.

(a) If the count of deletions (or C^τ[i]) is smaller than the count of occurrences, it is a partial deletion. i. Store the token and its count of deletions, C^τ[i] , in two arrays R,[m,] and C^R [m,] for partial deletion, where m, is the total number of tokens for partial deletion. ii. If the first appearance is to be deleted, store the token and the largest token before the first undeleted appearance of the token being partially deleted in array X,.

(b) If C^τ[i], the count of deletions, is equal to the count of occurrences, the (T,[i] + l)th count in the left-child count file, store the token and its count of deletions, C^τ[i], in two arrays D,[k,] and C [k,] for complete deletion, where k, is the total number of tokens for complete deletion. Here, n, = m, + k,.

4. Get unique right-child tokens from B and sum over the counts of deletions for each unique right-child token.

(a) If a right-child token appears once and once only in B, get count of deletions for the right-child token from the count of deletions for the pairing parent-node token. The count for the parent-node token is obtained from either C^D (if the parent-node token is in D for a complete deletion) or C^R (if the parent-node token is in R for a partial deletion).

(b) If a right-child token appears more than once in B, get count of deletions for the right-child token by summing over counts of eletions for all the pairing parent-node tokens, stored in C^D and/or C^R. Store the unique right-child tokens and their counts of deletions in two arrays T_r[n_r] and C^τ[n_r] respectively, where n_r is the number of unique right-child tokens in B. For each token in T_r (say T_r[i], i = 0, 1, 2, ..., n_r-l), compare its count of deletions in C (C^τ [i]) and its count of occurrences. The count of occurrences can be obtained from either reading (T_r[i] + l)th count in the count file for the right-child node, or summing over all the counts of occurrences for the pairing parent tokens.

(a) If C [i], the count of deletions, is smaller than the count of occurrences, it is partial deletion. i. Store the token and its count of deletions, C^τ [i], in two arrays R_r[m_r] and C^R [mj for partial deletion, where m is the total number of tokens for partial deletion, ii. If the first appearance is to be deleted, store the token and the largest token before the first undeleted appearance of the token being partially deleted in array X_r.

(b) If C^τ [i] is equal to the (T_r[i] + l)th count in the count file for the right-child node, store the token and its count of deletions C^ [i] in two arrays D^ [k_r] and C [k_r] for complete deletion, where k_r is the total number of tokens for complete deletion. Here, n_r = m_r+k_r. 5. Delete tokens in a node memory structure: (a) Partial deletion: i. Recall that X stores parent-node tokens b b₂, ... , b_N and e,, e₂, ... , e_N, where b_l is the smallest token for deletion of the first appearance, and e_j is the largest token before the first undeleted appearance of token b, . b, is the z^'th smallest token for deletion of the first appearance, and e_j is the largest token before the first undeleted appearance of token b_i; where i = 1, 2, ... , N. N is the number of tokens for deletion of the first appearance. If N is not zero, do the following:

A. If a parent-node token is not among b,, b₂, ... , b_N 5 in X, reassign it by t' = t - (i - j) (4) where i is the number of tokens among b,, b₂, ... , b_N, which are smaller than t. j is the number of tokens among e, , e₂, ..., e_N, which are smaller 10 than t.

B. If a parent-node token is among b,, b₂, ... , b_N in X, let us say b_k, k = 1, 2, ... , N, reassign it by

V = t + (e_k,) - (i - l - j) = e_k - (i - l - j) (5) where i is the number of tokens among b,, b₂, ... , 15 b_N, which appear before the first undeleted reappearance of t. j is the number of tokens among e, , e₂, ... , e_N, which appear before the first undeleted reappearance of t. The above result is shown by an example: assume a 20 stream of pairs are 1 ,1,1,2,3,4,2,3 and 1,2,3,4,5,2,6,3 for left- and right-child tokens, respectively. Suppose we want to delete the second and third pairs. Since the first appearance of the left-child token 1 is not to be deleted. No reassignment of left-child tokens is needed. The first 25 appearances of the right-child tokens 2 and 3 are deleted.

Reassignment of right-child tokens is in order. According to equation (4), the right-child tokens 4, 5, and 6 become 2, 3, and, 5 respectively. The right-child tokens 2 and 3 become 4 and 6 respectively, according to equation (5). The new right-child tokens are 1(1), 2(4), 3(5), 4(2), 5(6), 6(3), where tokens in brackets are old ones and the others are the corresponding reassigned tokens. Another example shows the meaning of the first undeleted reappearance. Deleting the second, third, and sixth tokens among 1,2,3,4,5,2,6,3,2 leads to 1(1), 2(4), 3(5), 4(6), 5(3), 6(2). ii. Similarly, X, stores left-child tokens b,, b₂, ... , b_Nl and e,, e₂, ... , e_Nl, where b, is the smallest token for deletion of the first appearance, and e, is the largest token before the first undeleted appearance of token b, . b, is the /^'th smallest token for deletion of the first appearance, and e, is the largest token before the first undeleted appearance of token b„ where i = 1, 2, ..., N,. N, is the number of tokens for deletion of the first appearance. If N, is not zero, do the following:

A. If a left-child token is not among b,, b₂, ... , b_Nl in X,, reassign it by t' = t - (i - j), where i is the number of tokens among b,, b₂, ... , b_Nl, which are smaller than t. j is the number of tokens among e,, e₂, ..., e_Nl, which are smaller than t.

B. If a left-child token is among b,, b₂, ... , b_Nl in X„ let us say b , k = 1, 2, ... , N„ reassign it by t' = e_k - (i - 1 - j), where i is the number of tokens among b,, b₂, . ., b_Nl, which appear before the first undeleted reappearance of t. j is the number of tokens among e,, e₂, ... , e_Nl, which appear before the first undeleted reappearance of t. iii. Analogously, X_r stores right-child tokens b, , b₂, ... , b_Nr and e, , e₂, ..., e_Nr, where bj is the smallest token for deletion of the first appearance, and e, is the largest token before the first undeleted appearance of token b bj is the /th smallest token for deletion of the first appearance, and βj is the largest token before the first undeleted appearance of token bj, where i = 1, 2, ... , N_r. N_r is the number of tokens for deletion of the first appearance. If N_r is not zero, do the following:

A. If a right-child token is not among b,, b₂, ... , b_N1 in X_r, reassign it by t' = t - (i - j) where i is the number of tokens among b,, b₂, ... , b_Nr, which are smaller than t. j is the number of tokens among e, , e₂, ... , e_Nr, which are smaller than t.

B. If a right-child token is among b, , b₂, ... , b_Nr in X_r, let us say b_k, k = 1, 2, ... , N_r, reassign it by t'

= e_k - (i - 1 - j), where i is the number of tokens among b,, b₂, ..., b_Nr, which appear before the first undeleted reappearance of t. j is the number of tokens among e,, e₂, ... , e_Nr, which appear before the first undeleted reappearance of t.

(b) Complete deletion:

Reassignment of tokens after partial deletion should be taken into account. i. Get the left-hashing lists whose list indices appear in R,. Remove list elements that store parent-node tokens appearing in D and the corresponding right-child tokens, ii. Remove hashing lists whose list indices appear in D, (all the output tokens in these lists must appear in D). iii. Sort the left-child tokens in D,. Reassign left-child tokens by t' = t - (i + 1) if D,[i] < t < D, [i + 1], where i = 0, 1, 2, ... , k, - 2. For those left-child tokens that are larger than D,[kl - 1], perform t' = t - k,. Here, t is a left-child token before the deletion and t' is the reassigned left-child token. iv. Sort the right-child tokens in D_r. Reassign right-child tokens by t' = t - (i + 1) if D_r [i] < t < D_r [i + 1], where i = 0, 1, 2, ... , k_r - 2. For those right-child tokens that are larger than D_r [k_r- 1], perform t' = t - k_r. Here, t is a right-child token before the deletion and t' is the reassigned right-child token, v. Sort the parent-node tokens in D. Reassign parent-node tokens by t' = t - (i + 1) if D[i] < t < D[i + 1], where i = 0, 1, 2, ... , k - 2, where k is the total number of tokens in D. For those parent-node tokens that are larger than D[k - 1], perform t' = t - k. Here, t is a parent- node token before the deletion and t' is the reassigned parent-node token. 6. Deletion in a count file: (a) Reassign counts of occurrences for tokens appearing in R by C

= C - C^R[i], i = 0, 1, 2, ..., m-1, where m is the total number of tokens in R. C is a count of occurrences, stored in the count file for a parent-node token appearing in R. C^R[i] is the count of deletions for the same token. (b) Remove counts of occurrences from the count file for parent-node tokens appearing in D, since the count of deletions is equal to the count of occurrences for each token in D. Reassignment of positions for counts due to reassignment of parent-node tokens should be taken into account. 7. Deletion in a list count file:

(a) Reassign the list count for each of lists whose list indices appear in R, by C = C - C", where C" is the total number of parent- node tokens in the list, that appear in D. C is the list count stored in the list count file for a partially deleted list. C is a reassigned list count.

(b) Remove list counts from the list count file for lists whose list indices appear in D,.

Reassignment of positions for counts due to reassignment of left-child tokens should be taken into account.

8. Set R = R„ C^R = C^R D = D„ C^D = C , X = X„ and n = n,. If the left-child node is an internal node, propagate the deletion down to the left-child node, and repeat the above procedure for the left-child node. If the left-child node is a leaf node, proceed to deletion at a leaf node. 9. Set R = R_r, C^R = C^R D = D_r, C^D = C , X = X_r, and n = n_r. If the right-child node is an internal node, propagate the deletion down to the right-child node, and repeat the above procedure for the right-child node. If the right-child node is a leaf node, proceed to deletion at a leaf node. 10. Deletion at a leaf node: (a) For each leaf token appearing in R, reassign its count in the count file by C = C - C^R[i], i = 0, 1, 2, ... , m, where C represents the original count of occurrences for a leaf token, stored in the count file. C^R[i] represents the count of deletions for the same leaf token being deleted. C' is the reassigned count. Now, X stores leaf-node tokens b,, b₂, ... , b_N and e,, e₂, ... , e_N, where b_l is the smallest token for deletion of the first appearance, and ^ is the largest token before the first undeleted appearance of token b, . b, is the /th smallest token for deletion of the first appearance, and e, is the largest token before the first undeleted appearance of token b_j, where i = 1, 2, ..., N. N is the number of tokens for deletion of the first appearance. If N is not zero, do the following: i. If a leaf token is not among b,, b₂, ... , b_N in X, reassign it by t' = t - (i - j) where / is the number of tokens among b,, b₂, ... , b_N which are smaller than t. j is the number of tokens among e,, e₂, ..., e_N which are smaller than t. ii. If a leaf token is among b,, b₂, ... , b_N in X, let us say b_k, k = 1, 2, ..., N, reassign it by t' = e_k - (i - 1 - j), where / is the number of tokens among b,, b₂, ... , b_N, which appear before the first undeleted reappearance of t. j is the number of tokens among e,, e₂, ..., e_N which appear before the first undeleted reappearance of t. (b) For each leaf token appearing in D, remove its count of occurrences from the count file, and delete the dictionary it represents. Relocations of the dictionaries indexed by the reassigned leaf tokens should be done accordingly. Reassigned positions of counts in the count file should be also consistent with the reassignment of the corresponding tokens after deletion. Figure 1 shows the main branch and side branches as subtree structures. The main branch starts with a leaf node and ends with the root node. All dotted circles indicate subtree structures.

Figure 2 shows the main branch and side branches as subtree structures. The main branch starts with a non-leaf node and ends with the root node. All dotted circles indicate subtree structures.

Deletion in Join NGRAM Tree Structures

Previous sections describe how to perform deletions in a single NGRAM tree structure. In join NGRAM tree structures, the descriptions without modification would deletions beyond trivial rejection.

(1) Change the foreign node, whose tokens reference tokens being deleted, to a hybrid node. The foreign-node tokens that reference the tokens being deleted are replaced by data values that are represented by the referenced tokens; (2) Propagate the deletion of the referenced tokens to the NGRAM tree structure where the foreign node locates. To do this, we get all the foreign-node tokens that reference the deleted tokens and propagate them up to the root node of the tree structure where the foreign node locates. Then, we perform deletions of the resulting root-node tokens, starting at the root node; or (3) Reassign the foreign-node tokens that reference the deleted tokens by some other values. This involves updating operations.

I claim:

Claims

1. A method of deleting a record in an NGRAM tree structure having plurality of nodes at different levels containing token and dictionary values, the method comprising the steps of: choosing a starting point containing information identifying the record to be deleted; deleting a token at the root node which describes the record; propagating the deletion down to every lower-level node in the tree structure; and at each lower-level node, deleting all the tokens and dictionary values associated with the record being deleted.

2. The method of claim 1, wherein the starting point is the root node of the tree structure.

3. The method of claim 1, wherein the NGRAM tree structure represents a disturbed NGRAM system, and wherein the tree structure after deletion also represents a disturbed system.

4. The method of claim 1, wherein the NGRAM tree structure represents a undisturbed NGRAM system, and wherein the method further includes the step of reassigning undeleted tokens such that the tree structure after deletion is also an undisturbed system.

5. The method of claim 1, wherein the step of choosing a starting point containing information identifying the record to be deleted inclutes the step of identifying a record to be deleted by performing a query constrained at one or more leaf nodes.

6. The method of claim 5, wherein the constraint is imposed on a single leaf node, and wherein the method further includes the step of traversing between the leaf node and the root node as necessary for deletion starting at the root node.

7. The method of claim 5, wherein the constraint is imposed on a plurality of leaf nodes, and wherein the method further includes the step of traversing between the nodes as needed if the deletion starts at the leaf node where the constraint is imposed.

8. The method of claim 5, wherein the constraint is imposed on a plurality of leaf nodes, and wherein the method further includes the steps of: initiating the deletion at the closest common ancestor node of all constrained leaf nodes; and propagating the deletion up and down the tree structure.

9. A method of performing a delete operation in an NGRAM tree structure having a main branch, one or more side branches, and a plurality of leaf nodes, comprising the steps of: imposing a constraint on one or more of the leaf nodes; and initiating the deletion operation with respect to the constrained node.

10. The method of claim 9, wherein only a single leaf node is constrianed, and wherein the deletion operation is initiated with respect to this node.

11. The method of claim 9, wherein a plurality of leaf nodes are constrained, and wherein the deletion operation is initiated at the closest common ancestor node of all constrained leaf nodes.

12. The method of claim 9, further including the step of propagating the deletions throughout the main branch and side branches of the tree structure.

13. The method of claim 9, further including the step of deleting tokens with respect to a non-leaf node on the main branch by: removing all the tokens being deleted from the node structure and from the parent node structure, except for the root node; and removing all the corresponding counts of occurrences from the node count file.

14. The method of claim 9, further including the step of deleting dictionary values for a leaf node on the main branch by: removing all the dictionaries being deleted from the leaf-node structure; and removing the corresponding leaf tokens from its parent node structure.

15. The method of claim 9, further including the step of deleting tokens for a non-leaf node on a side branch by: removing tokens whose counts of deletions are equal to the counts of occurrences from the node structure and from its parent node structure; removing the corresponding counts of occurrences from the count file of the node.

16. The method of claim 9, further including the step of deleting tokens for a non-leaf node on a side branch by: modifying tokens whose counts of deletions are less than the counts of occurrences by subtracting the counts of deletions from the corresponding counts of occurrences respectively.

17. The method of claim 9, further including the step of deleting a dictionary for a leaf node on a side branch by: removing dictionaries whose counts of deletions are equal to the counts of occurrences from the leaf node structure; removing the corresponding tokens from its parent node structure; and removing the corresponding counts of occurrences from the count file of the leaf node.

18. The method of claim 9, further including the step of deleting a dictionary for a leaf node on a side branch by: modifying dictionaries whose counts of deletions are less than the counts of occurrences by subtracting the counts of deletions from the corresponding counts of occurrences respectively.

19. The method of claim 9, further including the step of reassigning undeleted and modified tokens and dictionaries in order to preserve complete information properties if such properties were preserved prior to the deletion operation.

20. A method of performing delete operations in a join NGRAM tree structure, comprising the steps of: changing a foreign node to a hybrid node; and a) replacing the foreign node tokens that reference the tokens being deleted by data values described by the referenced tokens; or b) propagating the deletion of the referenced tokens through the

NGRAM tree strucmre where the foreign node locates.