METHODS OF DELETING INFORMATION IN N GRAM TREE STRUCTURES
Field of the Invention This invention relates generally to deleting information values in an NGRAM system, similar to the deletion of records in a relational database. In particular, the invention is directed toward the deletion of information values in both single and joined NGRAM tree structures.
Background of the Invention
In a relational database, deletion of records is carried out in two steps. First, a query is perfromed with a constraint imposed on one or more fields, afterwhich records are obtained which satisfies the constraint. Then the obtained records may be deleted by eliminating them from the database. In an NGRAM system, of the type described in commonly assigned U.S.
Patent Nos. 5,245,337; 5,293,164; 5,592,667, a record is represented by a set of tokens (or indices) stored at all non-leaf nodes, and dictionary data values stored at all leaf nodes. Deletion of a record means a set of deletions and modifications of the involved tokens and dictionaries. In an NGRAM tree structure, one may carry out deletion of records, starting from a leaf node and propagating throughout the NGRAM tree structure, if a constraint is imposed on the leaf node.
For example, to delete records having a given set of values at leaf node L, the deletion operation may start at the leaf node L. If a constraint is imposed on a number of leaf nodes, deletion of records may start at the closest common ancestor node, followed by a propagation of the deletion throughout the NGRAM structure.
In general, there are two different strategies for deletion of records in an NGRAM system. One is to carry out the deletion physically throughout the NGRAM memory structure. The other is to set a number of flags at each node structure, indicating the tokens and dictionaries related to the records being deleted. However,
no physical deletion is carried out in the memory structure, equivalent to virtual deletion.
The former strategy may lead to a very sophisticated and complex procedure. However, the resulting structure after deletion is clean and conducive to general data manipulation such as performing a query and carrying out data mining or OLAP analyses. On the other hand, the latter deletion strategy may be less demanding, but may be problematic with respect to general data manipulation. This invention is primarily directed toward the physical deletion of records in an NGRAM system.
Summary of the Invention This invention resides in methods of executing delete operations in an
NGRAM system. Various alternative procedures are disclosed, based upon what information properties the system has before the deletion and on what information properties one may wish to preserve in the system after the deletion.
In general, there are two kinds of NGRAM systems that have different information properties. One is an undisturbed system which corresponds to an original NGRAM system transformed from a raw database. This system has, among other properties, a fundamental NGRAM information property: two consecutive indices in a look-down (or output lookup) table at any non-leaf node in an NGRAM system satisfies:
ti+1 - ti < 2, i = 1,2, ... , C - l (1)
where C is the cardinality of the node, and tj and ti+ 1 are the ith and i + lth indices. Here, a look-down table is a stream of left or right child tokens whose addresses are the parent node tokens.
A fundamental structure function may be used to describe the relationship between any two pairing parent and child tokens. This function is given by:
t pp ■• cc + ' l *N Λ rreep_eeaatt Z)
where tc is a child token and t, is the pairing parent token, and Nrepeat is the number of repeated child tokens appearing before the child token tc.
Another kind of NGRAM systems is a disturbed or reassigned system which does not preserve the above information property. The relationships between parent and child tokens can be any sequence other than that described by the above structure function. Generally speaking, there may exist an infinite number of sequences, each of which corresponds to a reassigned NGRAM system. Such reassignment is typically used for a specific purpose. Most notably, reassignment is used to conserve memory by reassigning the left and right look-down tables in the hashing sequence; that is, for left hashing, sort the left look-down table; sort the right look-down table for right hashing.
In an undisturbed NGRAM system, there are two principle procedures for deletion, depending on the resulting structure one may wish to have after a deletion. If an undisturbed NGRAM system is expected after a deletion, a general procedure is employed, which requires reassignment of the memory structures in order to preserve the original information properties. The reassigned deletion procedure leads to no sign of disturbance due to deletion. All the information properties preserved before a deletion are unchanged after the deletion, including the information property described above. This deletion is the most difficult to implement, but is also the most general. If an undisturbed structure is not necessary to retain, a less general procedure, a partial procedure may be used which requires no reassignment of the memory structure.
In a disturbed NGRAM system, only the less general deletion procedure without reassignment of undeleted tokens may be used. The resulting system after a deletion is also a disturbed system.
One may choose a starting point for a given deletion procedure, such as the root node in an NGRAM tree structure, since this is the place where a record number is identified. Once a record number is identified for a record being deleted,
the procedure is as follows:
-delete the token at the root node that describes the record being deleted, -propagate the deletion down to every lower level node, and -delete all the tokens and dictionary values that describe different parts of the record being deleted
This procedure is instructive, but less efficient than it could be because the record numbers for the records being deleted are usually identified by performing a query constrained at one or more leaf nodes. If the constraint is imposed on a single leaf node, traverse twice between the leaf node and the root node is necessary for deletion starting at the root node. However, one traverse between the two nodes is needed if the deletion starts at the leaf node where the constraint is imposed.
If a constraint is imposed on a number of leaf nodes, deletion may start at the closest common ancestor node of all constrained leaf nodes. The procedure then propagates the deletion up and down in all directions. Such a procedure is disclosed herein, although at least a portion of the procedure may be applied directly to the deletion starting at the root node as well.
Brief Descripton of the Drawings FIGURE 1 shows the main branch and side branches as subtree structures, wherein the main branch starts with a leaf node, and ends with the root node, and wherein all dotted circles indicate subtree structures; and
FIGURE 2 shows the main branch and side branches as subtree structures, wherein the main branch starts with a non-leaf node and ends with the root node, and wherein all dotted circles indicate subtree structures.
Detailed Description of the Invention Before describing the invention in detail, several terms will be defined, as follows:
An information value is a value that can be either a raw data value or a
token (index) representing the memory address of a combination of data values.
A complete deletion of a token in a non-leaf node removes the token and the corresponding count from the node memory structure, because the count of deletions is equal to the count of occurrences. A complete deletion of a dictionary value in a leaf node removes the value and the corresponding count from the leaf-node memory structure, since the count of deletions is equal to the count of occurrences.
A partial deletion of a token in a non-leaf node modifies the token and the associated count in the node memory structure, because the count of deletions is less than the count of occurrences.
A partial deletion of a dictionary value in a leaf node modifies the value and the associated count in the leaf-node memory structure, because the count of deletions is less than the count of occurrences.
The main branch of an NGRAM tree structure is the segment between a starting node for deletion and the root node. All leaf and non-leaf nodes on the main branch are main-branch nodes. All other nodes that are child nodes of the main-branch nodes are side-branch nodes.
A hybrid node is a node that stores at least two sets of information values. One set must be a set of tokens and the other set can be either a set of raw data values or a set of tokens of different type.
A foreign node is a node storing values that represent references to the referenced node values. The referenced node values are tokens (indices) that represent the memory addresses of the referenced data values.
A look-down table is a mapper between tokens of a non-leaf node and tokens of its child node, which stores a set of the child tokens in such a sequence that the pairing tokens of the parent node are incremental. The purpose is to obtain the pairing child token for any given parent token. This is a one-to-one operation. A more conventional term for a look-down table is an "output lookup table. "
A lookup table is defined as a mapping between tokens of a non-leaf
node and tokens of its child node, which stores a set of the parent tokens in a hashing sequence in which all parent tokens pairing with the same child token are stored in the same hashing list. The purpose is to obtain all the pairing parent tokens for a given child token, by looking up the hashing list indexed by the child token. This is a one-to-many operation. A more conventional term for a lookup table is an "input lookup table" or "lookup hash table. "
Reassignment of a node structure is to reassign the mapping between tokens of a non-leaf node and tokens of its child node, or to reassign the mapping between tokens of a leaf node and dictionaries at the leaf node. Compacting a tree is to save memory storage by reassigning all non-leaf node structures in a specified hashing sequence.
DELETION IN AN N-GRAM TREE STRUCTURE A deletion operation starts from one end of the main branch where the end node is not the root node, and propagates through all the main-branch nodes as well as all the side-branch nodes and their substructures.
Deletion Starting at a Leaf Node
Suppose a constraint is imposed on a leaf node, say leaf node N,. The constrained leaf node defines one end of the main branch. The root node is the other end. Assume there are M main-branch nodes N,, N2,..., NM, where N2 is the parent node of node N1 ; N3 is the parent node of node N2, ... , and NM is the root node and the parent node of node NM.,. The constraint is such that all records that have dictionary values equal to d,, d2, ... dml at node N, are deleted. We may start the deletion from node N,
Deletion at a leaf node N, on the main branch: • Find dictionary values d, , d2, ... , dml and their addresses tj, t2, ... , tml in the leaf node memory structure file. Remove the dictionary values d,, d2, ..., dm] from the memory structure file.
• Find the corresponding counts in the count file and remove the counts from the count file.
• Propagate tokens t| , t^, ... , tml up to the parent node N2 for deletion.
Set i = 2 and propagate the deletion up to node N; by following a procedure described below, for deletion at a non-leaf node Nj on the main branch. The propagation stops at the root node and all side-branch leaf nodes.
Deletion Starting at a Non-Leaf Node
Suppose a constraint is imposed on a number of leaf nodes. Find the closest common ancestor node, N,, of the constrained leaf nodes. The common ancestor node N! defines one end of the main branch. The root node is the other end. Assume there are M main-branch nodes N,, N2, ... , NM where N2 is the parent node of node Nl 5 N3 is the parent node of node N2, ... , and NM is the root node and the parent node of node NM_, . The constraint is such that all records consisting of combinations of data values described by tokens t{, t2, ..., tml at node N, are deleted, where m, is the number of tokens being deleted. These tokens being deleted are obtained by performing a query with specified conditions setting on the constrained leaf nodes.
1. Deletion at a non-leaf node N, on the main branch:
• Store tokens tj, tj, ..., tml in an array D. Store their counts of deletions in an array CD, equal to the counts of occurrences. Set arrays R, CR, and X, to be NULL. Carrying out deletion in a subtree structure in which node N, is the root node, by following a procedure for deletion at a side-branch node, to be described below.
• Propagate tokens tj, t2, ... , tml up to the parent node N2 for deletion.
Set i = 2 and continue the following procedure.
2. Deletion at a non-leaf node N, on the main branch:
• Load a lookup table of mapping between the tokens of node N, and the tokens of node N,.,, from the memory structure of node N,. The lookup table is in a hashing form of a set of hashing lists. The tokens of node N,., describe the list indices of the hashing lists in which the tokens of node N, are the list elements. Find the ti" th, tj" th, ... , tj^ th lists which contain tokens of node Ni, to be deleted. Get all the tokens of node N, from the above lists or deletion. Assume these tokens tj, tj, ... , tm, are where m, ≥m,., and their counts of deletions cj, cj, ... , cmι , equal to the counts of occurrences respectively.
• Remove all the left and right child tokens that pair with tokens t1, , tj, ..., tmι from the memory structure of node N,. Remove all the corresponding counts from the count file.
• Assume the sibling node of node N,., is SlA. Get all the tokens of the sibling node S,.,, that pair with tokens t1,, tj, ... , tml . Eliminating repeated tokens by summing over counts of deletions for each distinct token. Assume these tokens are s1, , sj, ..., s , where k, ≥m The corresponding counts of deletions are c , c2, ... , ckl.
• Identify tokens among si, sj, ... , sj., for complete deletion. The identification of such a token may be made by checking if its count of deletions is equal to the count of occurrences. Stored the identified tokens and their counts of deletions in arrays D and CD.
• The rest tokens in s1,, sj, ... , s , are for partial deletion, since the count of deletions is less than the corresponding count of occurrences for each one of
them. Store the tokens for partial deletion and their counts of deletions in arrays R and CR.
• If an undisturbed NGRAM system is not required after deletion, propagate the deleted tokens in D and R, and their counts of deletions in CD and CR, to the sibling node SlΛ, by following a procedure for deletion at a side-branch node, to be described below.
• If an undisturbed NGRAM system is required after deletion, reassignment of undeleted tokens is in order:
Reassign the undeleted tokens for nodes Nj and NiΛ, using the following formula: t* = t - Nd, (3) where t is a given undeleted token and t* is the corresponding token after reassignment. Nd is the number of tokens being deleted, that are less than t. - Identify each token in R for deletion of the first appearance. Store each recognized token for deletion of the first appearance and the largest token appearing before the first undeleted appearance of each recognized token in pairs in an array X. Reassign the undeleted tokens of node SU1 by using the reassignment schemes and formulas for deletion at a side-branch node, to be described below.
Propagate arrays D, R, X, CD, and CR, to the sibling node Sj.,, by following a procedure for deletion at a side-branch node, to be described below.
• If i = M, stop. Otherwise, set i = i + l and repeat the procedure for deletion at a non-leaf node N= on the main branch.
Deletion at a Side-Branch Node
In general, a side-branch node can be a leaf or non-leaf node. A side-branch node may be considered as the root node of a subtree structure. The subtree structure may be a single-leaf tree structure if the side-branch node is a leaf node. In this case, only the procedure for deletion at a leaf node may be used. Or the subtree structure may be a non-cardinal tree structure if the side-branch node is a non-leaf node. In this case, deletion at a side-branch node is treaed as deletion starting at the root node in a tree structure and propagate the deletion down throughout the tree structure.
In the following, we describe a general procedure for deletion of a set of tokens starting from the root node in a tree or subtree structure. The procedure describes how to carry out deletion in a left-hashing memory structure in an explicit form. The procedure applies to all three different cases:
1) undisturbed NGRAM tree structure before and after deletion;
2) undisturbed NGRAM tree structure before deletion and reassigned NGRAM tree structure after deletion; and
3) reassigned NGRAM tree structure before and after deletion.
In the case of 1), the exact procedure described in this section may be followed. In the cases of 2) and 3), the procedure may be simplified by neglecting reassignment of undeleted tokens. It is also possible that a part of a NGRAM tree structure is reassigned and the rest is not. For example, output tokens in the root node memory structure may be dropped in an undisturbed NGRAM tree structure. In this case, mapping between the root-node output tokens and record numbers is lost, equivalent to reassignment of the root-node tokens. The other internal nodes, however, are not necessarily reassigned.
1. If it is a leaf node, go to deletion at a leaf node. If arrays D, R, X, CD, and CR are not all NULLs, go to the next step. Otherwise, identify the root-node tokens for deletion and get the count for each root-node token being deleted, from the
count file for the root-node. Store all the root-node tokens for complete deletion and their counts of deletions in two arrays D[n] and CD[n] respectively, where n is the number of the root-node tokens being deleted. Set arrays R, X, and CR, to be NULL. In the following, D stores tokens for complete deletion and CD stores their counts of deletions. Array R stores tokens for partial deletions and CR stores the counts of deletions. Array X stores every token for partial deletion of the first appearance and the largest undeleted token appearing before the first undeleted appearance of the token being deleted partially.
2. From arrays R and D, get a set of n parent-node tokens being deleted (partially and completely) at a non-leaf node, and find pairs of left- and right-child tokens associated with the parent-node tokens in correspondence. Store the three sets of tokens in a composite buffer B sequentially, in the order of a stream of left-child tokens followed by a stream of right-child tokens followed by a stream of parent-node tokens. Sort the left-child tokens in the composite buffer B in an incremental sequence. Subsort the right-child tokens which pair with the same left-child token.
3. Get unique left-child tokens from B and sum over the counts of deletions for each unique left-child token.
(a) If a left-child token appears once and once only in B, get count of deletions for the left-child token, from the count of deletions for the pairing parent-node token. The count of deletions for the parent-node token is obtained from either CD (if the parent-node token is in D for complete deletion) or CR (if the parent-node token is in R for partial deletion).
(b) If a left-child token appears more than once in B, get count of deletions for the left-child token by summing over counts of deletions for all the pairing parent-node tokens, stored in CD and/or CR.
Store the unique left-child tokens and their counts of deletions in two arrays
T,[n,] and Cτ[n,] respectively, where n, is the number of unique left-child tokens in B. For each token in T, (say T,[i], i=0,l,2,... , ,), compare its count of deletions in Cτ (Cτ[i]) and its count of occurrences. The count of occurrences can be obtained from either reading (T,[i] + l)th count in the count file for the left-child node, or summing over all the counts of occurrences for the pairing parent tokens.
(a) If the count of deletions (or Cτ[i]) is smaller than the count of occurrences, it is a partial deletion. i. Store the token and its count of deletions, Cτ[i] , in two arrays R,[m,] and CR [m,] for partial deletion, where m, is the total number of tokens for partial deletion. ii. If the first appearance is to be deleted, store the token and the largest token before the first undeleted appearance of the token being partially deleted in array X,.
(b) If Cτ[i], the count of deletions, is equal to the count of occurrences, the (T,[i] + l)th count in the left-child count file, store the token and its count of deletions, Cτ[i], in two arrays D,[k,] and C [k,] for complete deletion, where k, is the total number of tokens for complete deletion. Here, n, = m, + k,.
4. Get unique right-child tokens from B and sum over the counts of deletions for each unique right-child token.
(a) If a right-child token appears once and once only in B, get count of deletions for the right-child token from the count of deletions for the pairing parent-node token. The count for the parent-node token is obtained from either CD (if the parent-node token is in D for a complete deletion) or CR (if the parent-node token is in R for a partial deletion).
(b) If a right-child token appears more than once in B, get count of deletions for the right-child token by summing over counts of
eletions for all the pairing parent-node tokens, stored in CD and/or CR. Store the unique right-child tokens and their counts of deletions in two arrays Tr[nr] and Cτ[nr] respectively, where nr is the number of unique right-child tokens in B. For each token in Tr (say Tr[i], i = 0, 1, 2, ..., nr-l), compare its count of deletions in C (Cτ [i]) and its count of occurrences. The count of occurrences can be obtained from either reading (Tr[i] + l)th count in the count file for the right-child node, or summing over all the counts of occurrences for the pairing parent tokens.
(a) If C [i], the count of deletions, is smaller than the count of occurrences, it is partial deletion. i. Store the token and its count of deletions, Cτ [i], in two arrays Rr[mr] and CR [mj for partial deletion, where m is the total number of tokens for partial deletion, ii. If the first appearance is to be deleted, store the token and the largest token before the first undeleted appearance of the token being partially deleted in array Xr.
(b) If Cτ [i] is equal to the (Tr[i] + l)th count in the count file for the right-child node, store the token and its count of deletions C^ [i] in two arrays D^ [kr] and C [kr] for complete deletion, where kr is the total number of tokens for complete deletion. Here, nr = mr+kr. 5. Delete tokens in a node memory structure: (a) Partial deletion: i. Recall that X stores parent-node tokens b b2, ... , bN and e,, e2, ... , eN, where bl is the smallest token for deletion of the first appearance, and ej is the largest token before the first undeleted appearance of token b, . b, is the z'th smallest token for deletion of the first appearance, and ej is the largest token before the first undeleted appearance
of token bi; where i = 1, 2, ... , N. N is the number of tokens for deletion of the first appearance. If N is not zero, do the following:
A. If a parent-node token is not among b,, b2, ... , bN 5 in X, reassign it by t' = t - (i - j) (4) where i is the number of tokens among b,, b2, ... , bN, which are smaller than t. j is the number of tokens among e, , e2, ..., eN, which are smaller 10 than t.
B. If a parent-node token is among b,, b2, ... , bN in X, let us say bk, k = 1, 2, ... , N, reassign it by
V = t + (ek,) - (i - l - j) = ek - (i - l - j) (5) where i is the number of tokens among b,, b2, ... , 15 bN, which appear before the first undeleted reappearance of t. j is the number of tokens among e, , e2, ... , eN, which appear before the first undeleted reappearance of t. The above result is shown by an example: assume a 20 stream of pairs are 1 ,1,1,2,3,4,2,3 and 1,2,3,4,5,2,6,3 for left- and right-child tokens, respectively. Suppose we want to delete the second and third pairs. Since the first appearance of the left-child token 1 is not to be deleted. No reassignment of left-child tokens is needed. The first 25 appearances of the right-child tokens 2 and 3 are deleted.
Reassignment of right-child tokens is in order. According to equation (4), the right-child tokens 4, 5, and 6 become 2, 3, and, 5 respectively. The right-child tokens 2 and 3 become 4 and 6 respectively, according to equation (5).
The new right-child tokens are 1(1), 2(4), 3(5), 4(2), 5(6), 6(3), where tokens in brackets are old ones and the others are the corresponding reassigned tokens. Another example shows the meaning of the first undeleted reappearance. Deleting the second, third, and sixth tokens among 1,2,3,4,5,2,6,3,2 leads to 1(1), 2(4), 3(5), 4(6), 5(3), 6(2). ii. Similarly, X, stores left-child tokens b,, b2, ... , bNl and e,, e2, ... , eNl, where b, is the smallest token for deletion of the first appearance, and e, is the largest token before the first undeleted appearance of token b, . b, is the /'th smallest token for deletion of the first appearance, and e, is the largest token before the first undeleted appearance of token b„ where i = 1, 2, ..., N,. N, is the number of tokens for deletion of the first appearance. If N, is not zero, do the following:
A. If a left-child token is not among b,, b2, ... , bNl in X,, reassign it by t' = t - (i - j), where i is the number of tokens among b,, b2, ... , bNl, which are smaller than t. j is the number of tokens among e,, e2, ..., eNl, which are smaller than t.
B. If a left-child token is among b,, b2, ... , bNl in X„ let us say b , k = 1, 2, ... , N„ reassign it by t' = ek - (i - 1 - j), where i is the number of tokens among b,, b2, . ., bNl, which appear before the first undeleted reappearance of t. j is the number of tokens among e,, e2, ... , eNl, which appear before the first undeleted reappearance of t. iii. Analogously, Xr stores right-child tokens b, , b2, ... , bNr
and e, , e2, ..., eNr, where bj is the smallest token for deletion of the first appearance, and e, is the largest token before the first undeleted appearance of token b bj is the /th smallest token for deletion of the first appearance, and βj is the largest token before the first undeleted appearance of token bj, where i = 1, 2, ... , Nr. Nr is the number of tokens for deletion of the first appearance. If Nr is not zero, do the following:
A. If a right-child token is not among b,, b2, ... , bN1 in Xr, reassign it by t' = t - (i - j) where i is the number of tokens among b,, b2, ... , bNr, which are smaller than t. j is the number of tokens among e, , e2, ... , eNr, which are smaller than t.
B. If a right-child token is among b, , b2, ... , bNr in Xr, let us say bk, k = 1, 2, ... , Nr, reassign it by t'
= ek - (i - 1 - j), where i is the number of tokens among b,, b2, ..., bNr, which appear before the first undeleted reappearance of t. j is the number of tokens among e,, e2, ... , eNr, which appear before the first undeleted reappearance of t.
(b) Complete deletion:
Reassignment of tokens after partial deletion should be taken into account. i. Get the left-hashing lists whose list indices appear in R,. Remove list elements that store parent-node tokens appearing in D and the corresponding right-child tokens, ii. Remove hashing lists whose list indices appear in D, (all the output tokens in these lists must appear in D). iii. Sort the left-child tokens in D,. Reassign left-child tokens
by t' = t - (i + 1) if D,[i] < t < D, [i + 1], where i = 0, 1, 2, ... , k, - 2. For those left-child tokens that are larger than D,[kl - 1], perform t' = t - k,. Here, t is a left-child token before the deletion and t' is the reassigned left-child token. iv. Sort the right-child tokens in Dr. Reassign right-child tokens by t' = t - (i + 1) if Dr [i] < t < Dr [i + 1], where i = 0, 1, 2, ... , kr - 2. For those right-child tokens that are larger than Dr [kr- 1], perform t' = t - kr. Here, t is a right-child token before the deletion and t' is the reassigned right-child token, v. Sort the parent-node tokens in D. Reassign parent-node tokens by t' = t - (i + 1) if D[i] < t < D[i + 1], where i = 0, 1, 2, ... , k - 2, where k is the total number of tokens in D. For those parent-node tokens that are larger than D[k - 1], perform t' = t - k. Here, t is a parent- node token before the deletion and t' is the reassigned parent-node token. 6. Deletion in a count file: (a) Reassign counts of occurrences for tokens appearing in R by C
= C - CR[i], i = 0, 1, 2, ..., m-1, where m is the total number of tokens in R. C is a count of occurrences, stored in the count file for a parent-node token appearing in R. CR[i] is the count of deletions for the same token. (b) Remove counts of occurrences from the count file for parent-node tokens appearing in D, since the count of deletions is equal to the count of occurrences for each token in D. Reassignment of positions for counts due to reassignment of parent-node tokens should be taken into account.
7. Deletion in a list count file:
(a) Reassign the list count for each of lists whose list indices appear in R, by C = C - C", where C" is the total number of parent- node tokens in the list, that appear in D. C is the list count stored in the list count file for a partially deleted list. C is a reassigned list count.
(b) Remove list counts from the list count file for lists whose list indices appear in D,.
Reassignment of positions for counts due to reassignment of left-child tokens should be taken into account.
8. Set R = R„ CR = CR D = D„ CD = C , X = X„ and n = n,. If the left-child node is an internal node, propagate the deletion down to the left-child node, and repeat the above procedure for the left-child node. If the left-child node is a leaf node, proceed to deletion at a leaf node. 9. Set R = Rr, CR = CR D = Dr, CD = C , X = Xr, and n = nr. If the right-child node is an internal node, propagate the deletion down to the right-child node, and repeat the above procedure for the right-child node. If the right-child node is a leaf node, proceed to deletion at a leaf node. 10. Deletion at a leaf node: (a) For each leaf token appearing in R, reassign its count in the count file by C = C - CR[i], i = 0, 1, 2, ... , m, where C represents the original count of occurrences for a leaf token, stored in the count file. CR[i] represents the count of deletions for the same leaf token being deleted. C' is the reassigned count. Now, X stores leaf-node tokens b,, b2, ... , bN and e,, e2, ... , eN, where bl is the smallest token for deletion of the first appearance, and ^ is the largest token before the first undeleted appearance of token b, . b, is the /th smallest token for deletion of the first appearance, and e, is the largest token before the first undeleted
appearance of token bj, where i = 1, 2, ..., N. N is the number of tokens for deletion of the first appearance. If N is not zero, do the following: i. If a leaf token is not among b,, b2, ... , bN in X, reassign it by t' = t - (i - j) where / is the number of tokens among b,, b2, ... , bN which are smaller than t. j is the number of tokens among e,, e2, ..., eN which are smaller than t. ii. If a leaf token is among b,, b2, ... , bN in X, let us say bk, k = 1, 2, ..., N, reassign it by t' = ek - (i - 1 - j), where / is the number of tokens among b,, b2, ... , bN, which appear before the first undeleted reappearance of t. j is the number of tokens among e,, e2, ..., eN which appear before the first undeleted reappearance of t. (b) For each leaf token appearing in D, remove its count of occurrences from the count file, and delete the dictionary it represents. Relocations of the dictionaries indexed by the reassigned leaf tokens should be done accordingly. Reassigned positions of counts in the count file should be also consistent with the reassignment of the corresponding tokens after deletion. Figure 1 shows the main branch and side branches as subtree structures. The main branch starts with a leaf node and ends with the root node. All dotted circles indicate subtree structures.
Figure 2 shows the main branch and side branches as subtree structures. The main branch starts with a non-leaf node and ends with the root node. All dotted circles indicate subtree structures.
Deletion in Join NGRAM Tree Structures
Previous sections describe how to perform deletions in a single NGRAM tree structure. In join NGRAM tree structures, the descriptions without modification would
deletions beyond trivial rejection.
(1) Change the foreign node, whose tokens reference tokens being deleted, to a hybrid node. The foreign-node tokens that reference the tokens being deleted are replaced by data values that are represented by the referenced tokens; (2) Propagate the deletion of the referenced tokens to the NGRAM tree structure where the foreign node locates. To do this, we get all the foreign-node tokens that reference the deleted tokens and propagate them up to the root node of the tree structure where the foreign node locates. Then, we perform deletions of the resulting root-node tokens, starting at the root node; or (3) Reassign the foreign-node tokens that reference the deleted tokens by some other values. This involves updating operations.
I claim: