CA1214284A

CA1214284A - Sparse array bit map used in data bases

Info

Publication number: CA1214284A
Application number: CA000453210A
Authority: CA
Inventors: Amnon Waisman; Andrew M. Weiss
Original assignee: Wang Laboratories Inc
Current assignee: Wang Laboratories Inc
Priority date: 1983-05-02
Filing date: 1984-05-01
Publication date: 1986-11-18
Also published as: US4606002A; EP0124097A3; EP0124097B1; JPS6037048A; EP0124097A2; JPH0766347B2; DE3484910D1

Abstract

ABSTRACT OF THE DISCLOSURE

A data base system uses a self-descriptive index key format having variable length data fields so that the data base system manipulation is independent of the type and arrangement of the data being stored and retrieved.
The data is characterized by three index variables which represent the data table, the record in that table, and a particular field within that record.
Each table is composed of data imbedded in the B-tree index structure of the data base. In order to access records using the field variables, the data base additionally includes an inverted B-tree index logically related to the original index. The operation of the index is enhanced by the use of data compression and the use of a sparse array bit map to represent the record associated with each field. The index structure within the data base allows each index variable to identify data by means of the index variables inde-pendently of the physical location in which the data is stored. The efficiency of storage is further enhanced by storing the data in variable length data records. The sparse array bit maps also be used to provide inter-record relationships for records stored in different data tables.

Description

2 SPARSE ARRAY BIT MAP USED It DATA BASES

This invention is related to methods for storing data on digital 6 computers, and in particular, to an improved data base system for storing and 7 retrieving large amounts of data.

BACKGROUND OF THE INVENTION

Computers are used today to store vast amounts of information about 11 anything conceivable. The data is stored in large data bases. Once the 12 information is stored, it has to be found quickly when needed. One of the 13 techniques used in data bases is the use of inverted lists, 14 Inverted lists are used in data bases to allow or fast searches. For example, in a hospital data base, a request may be issued to find all the 16 records of patients allergic to penicillin. One slow way to get this 17 information from the data base would be to search each and every one of the 18 patient records, selecting the records with penicillin in the allergy field.
19 A better way and a common practice in data bases is to maintain inverted lists in the data base. In this example, using an inverted list allows a 21 list ox all patients allergic to penicillin to be found relatively easily and quickly.

23 Unfortunately, inverted lists require a lot of storage space, take a lot sly of time to create, and have to be efficiently organized to enable the data to be found quickly. Usually inverted lists take more storage space than the original data bases which they support.
Most data bases use inverted lists for only selected data items and therefore allow for fast searches only on selected items.
SUM RYE OF THE INVENTION
Briefly, the present invention is a data base in which data is stored in data tables and methods for constructing such a data base. Each data table in the data base includes one or more records, and each record includes one or more fields. Each data table is identified by means of a record index value, each record in a data table by means of a record serial number, and each field in a record by means of a field index value. The record serial numbers for a data table are divided into ranges. Each range contains a predetermined number of record serial numbers, and each range is represented by a consecutive range value. Associated with each data table there is an inverted list including a pour-amity of keys, each key being associated with a particular field of the data table and representing the occurrence of a particular data value in the field. Each of the keys has associated with it one or more pointers, and each pointer contains one of the range values and a sparse array bit map which indicates which of the records in the range specified by the range value contain an occurrence of the particular data value.
Certain fields in a data table may represent relation-ships between the records in that data table and selected records I

from a second data table. The data contained in these fields employs the range values for the ranges to which the selected records belong and a sparse array identifying the specific select ted records in the specified range.
The inverted list is stored by means of apparatus which includes means for dividing the list of record serial numbers into ranges and assigning the range values, means for selecting range values for those ranges containing record serial numbers which occur in the inverted list, and means for encoding the positions of the occurring record serial numbers for each selected range in a sparse array for that range.
The data base of the invention is created by the steps of providing the data tables and the records contained therein, assigning a record index value to each data table, assigning a record serial number to each record in the table, dividing the record serial numbers into ranges and assigning a range value to each range, dividing the records into fields containing data values and assigning a field index value to each field, and providing inverted lists associated with the data tables by per-forming the steps of creating keys, each of which is associated with a particular field and represents the occurrence of a par-titular data value in the field, and providing pointers associa-ted with the keys, each pointer including one of the range values and a sparse array bit map which indicates which of the records in the range specified by the range value contain an occurrence of the particular data value.

I

If the data table contains fields which establish rota-tionships between the records in that data table and selected records in another data table, the method further includes the steps of providing fields in the data tables which represent a relationship between each record in the data table and selected records from the other data table and storing data in the fields which represents the selected records by means of the range vet-us specifying the ranges to which the serial numbers of the sol-acted records belong and the sparse array which identifies the selected records in the selected ranges.
The method of storing the inverted list has the steps of dividing the possible record serial numbers into ranges and assigning them range values, selecting the range values for those ranges containing record serial numbers which occur in the invert ted list, and for each selected range, encoding the positions of the occurring record serial numbers in the range in a sparse array.
DESCRIPTION OF THE DRAWINGS
The improvements of the present invention over the prior art and the advantages resulting therefrom will become more apparent upon reading the following description of the preferred embodiment in which:
FIG. 1 is a diagram showing a B-tree type of index;
FIG. 2 is a diagram representing a data base which is used in explaining the operation of the present invention;
FIG. 3 shows the data of FIG. 2 as it is kept by the present invention;

-pa-I

FIG. 4 shows the data of FIG. 3 as it is stared in a compressed format;
FIG. 5 shows the organization of an inverted key; and FIGS. 6 and 7 show the organization of a sparse array bit map;
FIG. 8 shows the manner in which bit integers represent range values;

-3b-I ( 1 FIG. 9 shows how a sparse array replaces a list of record serial numbers 2 in an inverted list; and

3 FIG. 10 illustrates the results of an AND-NOT operation on a sparse

4 array.

7 It will be helpful! before describing the invention, to briefly explain 8 the operation of a B-tree type of index. Referring to Fig. 1, there is shown g a B-tree structure for accessing records in an alphabetically arranged data base. The B-tree in FIG 1 is made up of two levels 10, each in turn 11 comprised of one or more blocks 12. The top level is composed of one block 12 which is made up of a number of entries. Each of the entries is made up of a 13 key, which identifies the data, and a pointer. In the first two levels of 14 the B-tree, the pointers identify the location of blocks on lower levels of the B-tree which provide further indexing of the data. In thy last row, the 16 entries in -each block are associated with individual records in the data 17 base, and the pointers in the lowest level of a B-tree point to the location 18 of this data.
19 To use a B-tree index, a program will search along the keys in the top level of the Tree until it finds a key which indicates where the data is to I be found. For example, if FIG. 1 is an index for a list of names and the 22 name to be found is Morris, the program would search along the top level 23 until it found thwack (M O) which would contain the name sought. The key ~2~2~ ( 1 associated with that key allows the program to locate the proper place to 2 start searching in the next level, in this case, at the beginning of the M's.
3 The same procedure is carried out in the next level which further narrows 4 down the search area until, at the lowest level, the pointer locates the data entry for Morris. B-trees provide a significant increase in access time and 6 faster response time compared to a straight sequential searching of the 7 data. B-trees are well-known methods of indexing data, and a further 8 discussion of the use of B-tree indexes can be found in many references.
9 The data base described herein identifies all data fields within the data base by means of three numbers: the record identifier, the record serial 11 number, and the field identifier. These terms are explained with reference 12 to FIG. 2. FIG. 2 represents part of a data base which might be used by a 13 hospital. The hospital data base would have many data tables including the 14 two shown in FIG. 2, the patient table and the doctor table. Each of these tables is identified by a unique record identifier (RI) number In FIG. 2, 16 the the doctor data is in a table which has a RI of 17, and patient data is 17 in the table whose RI is 47.
18 Each table is divided into records corresponding with the individual 19 patients and doctors. Each patient or doctor record is identified by its own record serial number (RUN). The data for each person is divided into fields, 21 each field representing a different piece of data associated with that 22 person, and each of these fields has its own field identifier (FIX). The data 23 shown in FIG. 2 is only exemplary. In an actual application, a very large _ 5 _ I

1 amount of data might well be included in the data base including a large 2 number of tables, each with a large number of entries. For example, a large 3 hospital might have to keep data on 2000 patients who are in the hospital at 4 any one time and hundred of thousands of former patients, not to mention data on doctors, employees, and so forth. Clearly, such a data base will require 6 a large amount of storage. Nevertheless, any data entry in such a data base 7 can be uniquely identified by the three variables RI, RUN, and FIX
8 Referring to FIG. 3, there is shown the result when the data from the 9 patient data table of FIG. 2 is stored within the B-tree index itself. The entries in FIG. 3 represent the bottom level of a B-tree. In the B-tree shown 11 in FIG. 1, each entry includes a key, to identify the data, and a pointer to 12 locate the index or data. In the data base described herein, the pointers in 13 the lowest level of the B-tree are replaced with the actual data. Another 14 way of putting this is that the data in the B-tree defines itself, or is self-identifying. The key provides a means of locating any particular piece 16 of data. In other words, the key is a "logical address" which, by following 17 the procedure described above, can be used to access a particular piece of 18 data.
19 The use of self-identifying data has several important benefits. The size of the data base may be changed and additional fields may be added to 21 the records without any need to change programs for accessing data and 22 without any need to reorganize the data base. This is because the logical 23 address is independent of the physical location, or address, of the data in ~2~4~

1 memory. Put another way, no matter how the B-tree is physically rearranged 2 in memory as data is added to or deleted from the data base, and no matter 3 where the data is actually physically located in memory, the data can always 4 be located using the keys embedded in the data base.
Storing the data and keys shown in FIG. 3 might appear to require a lot 6 of overhead in terms of storage space for a large data base. In the present 7 system, this is not the case Information such as that shown in FIG. 3 is 8 usually compressed before storing. For example, the data of FIG. 3 may be 9 compressed as explained below in connection with FIG. 4. As will be seen, due to the arrangement of the data in the present data base, this compression 11 scheme significantly reduces the amount of data to be stored. It should be 12 appreciated, however, that other schemes of data compression are known to 13 those in the art, and these compression schemes are, in general, applicable 14 to the data base system of the present invention. Thus the use of a particular compression scheme in describing the referred embodiment should 16 not be taken as a limitation on the invention.
17 FIG. 4 shows the method by which data is compressed in the embodiment 18 described. In FIG. 4, CAL is the compression length, which is the number of 19 initial digits in the current key which are the same as the preceding key; AL

is the key length; and DO is the data length Assume that line 42b in FIG. 4 21 represents the first piece of data aye in the patient table shown in FIG. 3.

22 Since there are no preceding entries, the compression length is Nero. The 23 key length is 6, the number of digits (bytes) in the RI-RSN-FI key. The data ~LZ14~

1 length is 5 for the 5 bytes of data in the name.
by 2 In the next line I, the compression length is 5, since the first five Jo 3 digits of the key are identical for both the first and second data entries.
4 The key length for the second and subsequent entries becomes 1, since the only change in the key is the FIX variable, the RI and RUN values remaining 6 constant for the remaining fields in the first record. Lines 46b-50b in FIG.
7 4 are similarly compressed.
8 The above-described process gives quick access to data in fields that are 9 to be accessed via the RUN key. Alone, however, this method does not provide for quick access to records based on the value of data in one of the fields 11 associated with that record. For instance, finding a patient whose name is 12 unknown who lives at a particular address or compiling a list of all patients 13 who live in Boston requires a sequential search through all of the address 14 fields of all the patient records.
Access to individual records based on data in the fields may be easily 16 added to the data base structure described above. While most B-tree indexes 17 require a separate B-tree for each field variable which one wishes to search, 18 the present invention requires only a single B-tree to provide access to as 19 many fields as desired. This is done in the following manner.
Generally, in the described embodiment, all data fields will be indexed 21 in the inverse key table, although some fields, such as a "miscellaneous 22 comment" field may not be. For each field to be indexed, an inverted key 23 table is constructed in the following manner. First, the inverted key must I

1 be logically located with respect to the original data in the particular data 2 table under consideration. In the described embodiment, this is done by 3 assigning only odd numbers as RI values and by assigning the associated inverted key the following even number. A B-tree is then constructed with a data structure inverse to the original format. I.e., for each different ox field value, the inverted table will list all the records which contain that 7 value in that field. Sometimes it is necessary to identify one or more 8 patients in the data base using only address information, for example, all 9 patients living in a particular area. Referring to the data base shown in FIG. 1, a geographical search may be easily implemented using the inverse 11 lists stored in the present invention by adding one to the RI value to 12 generate the key for the associated inverted table and then searching entries 13 under the FIX value indicating address.
14 Referring to FIG. 5, an example is given of an inverted key for the city field in the data base of FIG. 3. The key is found by adding one to the RI
16 value to get 48; the FIX value corresponding to the city field is selected;
17 the B-tree is searched to find these RI and FIX values and the data following 18 this is searched for the desired city, in this case, Boston. The numbers 19 following Boston are the Runs of the patient records for Boston patients.
The Runs are actually functioning as pointers to the detained refer to the 21 records in the ROY patient data table which contain information on the 22 patients living in Boston. The data shown in FIG. 5 can be compressed for 23 storage in a manner similar to the compression scheme described above. As _ 9 _ ~2~284 l can be seen from an inspection of FIG. 5, the inverted key data will compress 2 greatly 3 There are several advantages to this method of indexing the field data.
4 As mentioned above, the key or logical address of the data in the inverted tables is independent of the actual location in memory of the data. This 6 allows the data base to be enlarged or modified without having to change the 7 values of the pointers in the inverted table, and it also makes the data base 8 independent of the type of data stored in it and thus more generally 9 applicable to a wide variety of data basso The arrangement of the Runs is lo also advantageous for searches having multiple field keys e.g., a search for 11 all patients living in a particular city who have a particular illness. The 12 RUN lists for the city and illness fields will both be arranged in numerical 13 order. This makes it easy to determine a match by comparing two lists of 14 Runs and selecting the Runs which match.
In terms of both speed and memory requirements, the access and 16 manipulation of data in the inverted table can be further enhanced by a 17 technique which will be referred to as a sparse array bit map. This is a 18 method of compressing the inverted list by representing the existence of 19 individual records in an inverted table by individual bits which require much less storage space than the individual Runs Using the inverted table 21 structure shown in FIG. 5, the list of Runs for each inverted table entry is 22 replaced by a sparse array in which the presence of a few bits represents the 23 occurrence of a particular record in the list. The present invention allows 1 multi-digit Runs to be replaced with a few bits, as will be seen below.
2 Thus, in a large data base having tens or hundreds of thousands of records, 3 many four- or five-digit Runs requiring four or five bytes of storage each 4 can be replaced with a few bits.
The sparse array bit map is generated in the following manner. First, 6 the list of all Runs is divided into ranges. In the described embodiment, 7 each range includes 512 records. Each range is assigned a consecutive range 8 value REV Thus, Runs O through 511 would fall into the first range having 9 a REV of O; Runs 512 through 1023 would have a REV of l; and so forth. If one or more Runs in the inverted list fall within a range, the associated REV is 11 stored in the inverted table. Only ranges with non-null sparse arrays are 12 defined.
13 The location of each RUN occurring within a range is stored in a sparse 14 array which represents individual Runs within a range. Referring to FIG. 6, the top line is one byte 52 in which each of the individual bits represents 16 the occurrence in a list of at least one and possibly as many as 64 Runs in 17 a particular range. Each bit of the top-level byte 52 in FIG. 6 represents a 18 corresponding eight-bit byte, shown on a second level 54 in FIG. 6, and each 19 bit of each byte on the second level represents one eight-bit byte on a third level 56. In each upper level, a bit is set if the corresponding byte on the 21 next lower level contains a one in any of its eight bits. There are 64 bytes 22 having 512 bits in third level 56. Each of these 512 bits represents a 23 corresponding RUN in the range.

I ( 1 The presence of a set bit (represented by a bit having a value of one in 2 this embodiment) in any of the bit positions of byte 52 thus represents the 3 occurrence of between 1 and 64 Runs in the inverted list. The absence of a 4 set bit (a Nero in this embodiment) in any of the bit positions in byte 52 represents the absence of 64 Runs in the inverted list. (A byte filled with 6 eight zeros is represented by an "x" in the corresponding box in FIG. 6.) 7 Therefore, the presence or absence of 512 individual Runs can be represented 8 by the data structure shown in FIG. 6.
9 For each zero in byte 52, the corresponding byte in level 54 and the corresponding eight bytes in level 56 will all be zero, and there is no need 11 to store these nine bytes individually, since they contain redundant 12 information. Similarly, each zero in the second level 54 represents a byte 13 on the third level 56 with eight zeros, Thus, all of the information in the 14 data structure shown in FIG. 6 can be stored by storing only those bytes which contain one or more ones. With the sparse array of FIG. 6, three bytes 16 is the minimum number of bytes which must be stored to represent the 17 occurrence of a RUN in a range. The maximum number of bytes which must be 18 stored is seventy-three, i.e., all the bytes in levels 52, 54, and 56. This 19 will occur only when each of the bottom level 56 bytes has at least one bit set, resulting in seventy-three bytes representing between 64 and 512 Runs 21 Thus the reduction in required data storage space depends on the particular 22 pattern of Runs within the range 23 After the sparse array of FIG 4 6 is constructed, the data is stored in I I

l the inverted table in the manner shown in FIG. 7, where each REV represents 2 the range value for each range which contains at least one RUN in the 3 inverted list, and the range is followed by 3 to 73 bytes of the 4 corresponding sparse array. This data may then be compressed as described above before being stored in the B-tree.
6 The maximum number of values in each range is 512 in the described 7 embodiment. With 8-bit bytes, the number contained in each range must be an 8 integral power of eight. Practical considerations of disk sector lengths and 9 access times make 512 more desirable Han the next higher power of eight, 40~8, in the described embodiment. In other applications, larger or smaller if ranges may be preferable.
12 Determining the range value and the sparse array from a RUN is 13 straightforward. The RUN is divided by the extent of each range, 512 in the 14 described embodiment. The integer part of the result is the range number, and the remainder is the bit position within the sparse array which 16 corresponds to that RUN.
17 In the present embodiment, the range value is stored as a bit integer of 18 one to four bytes. The number of bytes or length of the bit integer is 19 stored in the first two bit positions of the first byte of the bit integer.
A value of 00 indicates that the bit integer requires one byte; a value of 01 21 indicates that the bit integer requires two bytes; and so on. The remaining 22 part of the first byte and any additional bytes stores the range value in 23 binary. This is shown in FIG. 8. The top bit integer represents a range ~gz8~

1 value of 5 and requires only one byte. The second bit integer represents a 2 range value of 100 and requires two bytes. Using this format with a maximum 3 of 4 bytes to represent the bit integer, range values up to approximately one 4 billion can be represented. Bit integers formed in this manner have the additional advantage that all bit integers will collate correctly according to their numeric value.
7 Another advantage of the sparse array bit map is the ease with which 8 lists of Runs may be compared to find the result of logical operations which 9 may be required to define a particular subset of the data base. This is because range values represent sets of Runs t and set operations are thus 11 applicable to sparse array and range values, Set operations include 12 intersection, union, and relative difference functions, which implement 13 logical AND, OR, and AND-NOT functions, respectively.
14 For example, suppose a list is to be compiled of all patients who live in Boston AND who have had the flu. This is the same as determining the 16 intersection of the RUN lists (represented by sparse arrays) following the 17 "flu" and "Boston" values in the inverted table The two lists of all Boston 18 patients and all patients with the flu may be taken directly from the 19 inverted lists in table 48 (partially shown in FIG. 4) for the city and illness fields from the patient data table shown in FIG. 2. Next, the two 21 lists are searched for range values which are the same. If one or more 22 entries are found which have the same range number, the sparse arrays must 23 then be compared. Referring to FIG. 6, it can be seen that Aiding the top 1 level bytes from each of two sparse arrays to be Aided (which are referred to 2 as the "input" arrays below) will produce a byte which represents the top 3 level byte of the "output" sparse array representing the intersection of the 4 city Runs and the illness Runs If the output array top-level byte is null, i.e., all zeros, the process need go no further, since this indicates that 6 there are no common elements in the two inverted lists. If there are one or 7 more bits equal to one in the output top-level byte, the corresponding 8 second level bytes from the input arrays are Aided. Again, the presence of a 9 null byte indicates that there are no common members from the Runs represented by that byte. If the Aiding of bytes from the second level 54 11 results in a byte having a bit equal to one, the process is repeated for the 12 third level 56. If there are common members of the two sets, the series of 13 bytes produced during the above-described operation is the sparse array which 14 represents the intersection of the two sets.
A similar procedure is followed to perform a logical OR operation to 16 determine the union of two input arrays. The top-level bytes from the two 17 arrays are first Owed to produce the top-level byte of the output array.
18 Bytes on the second level are treated in one of two different ways. If the 19 top-level bytes of both input arrays have a one in the same bit position, then the associated second-level byte of the output array is created by Owing 21 the individual second-level bytes from each input array. If. however, only 22 one of the top-level bytes has a one in a particular bit position then the 23 associated second-level byte in the output array is merely the associated ~z~D~2B4 1 second level byte from that input array. The same procedure is followed for 2 the third level.
3 In the preferred embodiment, the relative difference between two sparse 4 array bit maps is implemented as a logical AND-NOT function. In other words, given two input sparse arrays or sets A and B, the present embodiment 6 determines the values in the set A AND NOT B. It should be noted that the 7 complement of a sparse array is -simply determined by taking the relative 8 difference between a full sparse array and the array to be complemented.
9 Referring to FIG. lo two simplified sparse arrays having only two levels with three bits per byte is shown to illustrate the procedure for determining 11 the relative difference. The two sparse arrays are designated as A and B. To lo begin the operation, the A sparse array is copied into the area in which the 13 result sparse will appear. Starting with top level 72, if a bit is set 4 (i.e., equal to 1 in this embodiment) in the first array, A, and the corresponding bit is not set in the second array, B, the result of A AND-NOT
16 B is merely A, and the byte on the next level 74 corresponding to that bit 17 position remains the same, since it is taken directly from the A sparse 18 array. This is shown by the leftmost byte in the lower level of the result 19 sparse array.
If a bit is not set (i.e., is equal to zero) in the first array, A, the 21 result of A AND-NOT B is 0, and thus the corresponding bit in the result 22 sparse array is reset (i.e., set to zero). This is not shown in FIG. lo 23 If the corresponding bits of both A and B are set (i.e., equal to one), 1 then the corresponding bytes on the next lower level must be compared. If a 2 bit in the A byte is set and the corresponding bit in the corresponding 3 byte is 0, then the corresponding bit in the result sparse array remains 4 set. Otherwise, the corresponding bit is reset (ire., is set to zero). This is shown in the center and right hand bytes of lower level 74 in FIG. 10. If 6 the result of this operation is a null byte, a zero must be propagated up to 7 the next higher level in the result sparse array. This is shown in the 8 rightmost bytes of lower level 74.
9 The sparse array bit map technique described above may also be used to link records with other records stored in different data tables having 11 different RI's. This is an efficient way to represent inter-record 12 relationships. Referring back to FIG. 2, there is shown part of a hospital 13 data base in which data about doctors is kept in one table and data 14 pertaining to patients is kept in another table. Suppose, for example, a list of all patients for each doctor is to be added to the data base. This 16 could be done by adding a patient field to the doctor data table in which the 17 names of all patients for each doctor is stored. In the present invention, 18 the patient data for each doctor may be efficiently kept by storing the Runs 19 for each patient in a sparse array bit map. The logical operations described above can also be used in implementing inter-record query based on the 21 inter-record connection fields described above.
22 FIG. 9 shows part of the doctor table from FIG. 2 which includes a field 23 for associating each doctor with his or her patients. The patient field, 1 having a FIX of 54, contains the Runs from the patient table (whose RI is 17) 2 Of all patients for each doctor. In the present invention, the numerical 3 representation of each RUN is replaced by the range values and sparse arrays 4 which represent the Runs In FIG. 9, for example, the first patient of Doctor Freud has a RUN, shown in column 60, which falls in the first range, 6 having a REV equal to 0. Rather than storing the RUN directly, the present 7 invention stores the range value and the sparse array representative of 165.
8 The range value is shown in column 62, and column 64 shows the values within 9 that range which the sparse array bit map represents. The second through fourth patients having Runs 6410-6412 all fall within the same range. These 11 patients are represented by a one-byte range value and a sparse array having 12 three bytes, Thus, in this case, the present invention requires only four 13 bytes to represent these three patients.
14 This method has several advantages over storing the patient names themselves. The storage space required by the sparse array is less than 16 would be required by the patient's name. By including a reference to the 17 patient's RUN in the patient table, the data base may easily access the data 18 stored for each patient. Using a sparse array bit map to provide 19 inter-record relationships, a list may be easily compiled no only of all patients of a particular doctor, but also, for example, of all patients of a 21 particular doctor living in a particular area. This method of associating a 22 patient with a doctor also avoids the problem of ambiguity between several 23 patients having the same name in the patient table.

I

There has been described a new and useful method for data base storage and access. It should be appreciated that modifications and additions will be apparent to those of ordinary skill in the art in applying the teachings of the invention described herein to various applications. Accordingly, the invention should not be limited by the description herein of a preferred embodiment but, rather, the invention should be construed in accordance with the following claims.

Claims

THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE
PROPERTY OR PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:

1. A method of storing data in a database comprising the steps of:
providing a plurality of data tables, each data table including a plurality of records;
identifying each data table by assigning a unique record index value to each data table;
assigning each record within a data table a record serial number unique to that record within the data table;
dividing the record serial numbers of each data table into ranges, each range including a predetermined number of record serial numbers, and each range being assigned a consecutive range value;
dividing the records in each data table into a plurality of fields wherein each field within a data table is identified by a field index value and wherein each field within a data table contains data values of a selected type;

providing a plurality of inverted list tables, each inverted list table being associated with a respective one of the data tables, including the steps of:
creating a plurality of keys, each key being associated with a particular field and representing the occur-rence of a particular data value in that field;
providing one or more pointers associated with each key and representative of the record serial numbers of the records which contain the data value represen-ted by the associated key, each pointer including a range value and a sparse array bit map representative of record serial numbers.

2. The method of claim 1 wherein the step of providing a sparse array bit map includes the steps of:
determining a plurality of bottom level bytes, each byte having an equal plurality of bottom level bits, wherein the number of bottom level bits in said bottom level bytes is equal to the number of record serial numbers in each range and wherein each bottom level bit is associated with a respective one of said record serial numbers in each range;
encoding the presence of each record serial number within a range by setting the bottom level bit which is associated with each such record serial number;
determining a plurality of upper level bytes, including a top level byte and a plurality of bytes on one or more intermediate levels such that the number of bits in the bytes in each level is equal to the number of bytes in the next lower level, each bit in each upper level byte being associated with a respective one of the bytes on the next lower level;
setting the bits in the upper level bytes whose associated byte in the next lower level contains at least one set bit; and storing the bottom level and upper level bytes which contain one or more set bytes.

3. The method of claim 2 wherein the step of determining a plurality of upper level bytes includes the steps of:
providing a top level byte having n bits; and providing one intermediate level having n bytes including n2 bits, whereby n3 record serial numbers within a range are represented by the sparse array.

4. The method of claim 2 wherein the step of providing a plurality of upper level bytes includes the step of providing a top level byte having eight bits and one intermediate level having eight bytes of eight bits each.

5. A method of storing data in a data base comprising the steps of:
providing a plurality of data tables, each data table including a plurality of records;

serial numbers within the range and a sparse array bit map associated with each range value and repre-sentative of which record serial numbers within the associated range occur in the inverted list table;
providing within a first data table from among said plurality of data tables a designated field representative of a relation-ship between each of the records in the first data table and selected records from a second data table; and storing said relationship in the data base by storing in the designated field data representative of the record serial numbers of said selected records from the second data table, said representative data including the range values of the record serial numbers of the selected records and a sparse array bit map associated with each range value and representa-tive of the record serial numbers of the selected records.

6. A method of storing an inverted list of record serial numbers in a data base system including the steps of:
dividing the list of possible record serial numbers into ranges having a predetermined number of record serial numbers, each range being assigned a consecutive range value;
selecting the range values for each range which contains at least one record serial number which occurs in the inverted list; and encoding the position in each selected range of each record serial number in the inverted list by means of a sparse array.

7. A data base system comprising:
a plurality of data tables, each data table having a unique record index value to identify each data table;
each data table including a plurality of records;
each record within a data table being identified by a record serial number unique to that record within the data table;
the records in each data table including a plurality of fields wherein each field within a data table is identified by a field index value and wherein each field within a data table contains data values of a selected type;
means for dividing the record serial numbers of each data table into ranges, each range including a predetermined number of record serial numbers, and each range being assigned a con-secutive range value;
a plurality of inverted list tables for providing a means of rapid access to selected data values, equal in number to the number of data tables, each inverted list table being associated with a respective one of the data tables, including a plurality of keys, each key being associated with a particular field and representing the occurrence of a particular data value in that field;
one or more pointers associated with each key and repre-sentative of the record serial numbers of the records which contain the data value represented by the associated key, each pointer including a range value and a sparse array bit map representative of record serial numbers.

8. The data base system of claim 7 wherein the data values are stored in the data tables in the form of a B-tree index having a plurality of levels and wherein the data values stored in the data base are stored within the B-tree as entries in the bottom level of the B-tree;
and wherein each data entry includes a key part and an associated data value, the key part including the record index value, the record serial number, and the field index value of the associated data value; and wherein each data value in the data base is stored immediately following the associated key, whereby each key provides a logical address of its associated data value.

9. The data base system of claim 7, wherein the sparse array bit map includes:
a plurality of bottom level bytes, each byte having an equal plurality of bottom level bits, wherein the number of bottom level bits in said bottom level bytes is equal to the number of record serial numbers in each range and wherein each bottom level bit is associated with a respective one of said record serial numbers in each range;
wherein the data base system includes means for encoding the presence of each record serial number within a range by setting the bottom level bit which is associated with each such record serial number;

a plurality of upper level bytes, including a top level byte and a plurality of bytes on one or more intermediate levels such that the number of bits in the bytes in each level is equal to the number of bytes in the next lower level, each bit in each upper level byte being associated with a respective one of the bytes on the next lower level;
and wherein the data base system further includes means for setting the bits in the upper level bytes whose associated byte in the next lower level contains at least one set bit; and means for storing the sparse array bit map bytes which contain one or more set bytes.

10. The data base system of claim 9 wherein the plurality of upper level bytes include:
a top level byte having n bits; and one intermediate level having n bytes including n2 bits, whereby n3 record serial numbers within a range are represented by the sparse array.

11. The data base system of claim 9 wherein the plurality of upper level bytes include a top level byte having eight bits and one intermediate level having eight bytes of eight bits each.

12. A data base system comprising:
means for providing a plurality of data tables, each data table including a plurality of records;
means for identifying each data table by assigning a unique record index variable to each data table;
means for assigning each record within a data table a record serial number unique to that record within the data table;
means for dividing the record serial numbers of each data table into ranges, each range including a predetermined number of record serial numbers, and each range being assigned a consecutive range value;
means for dividing the records in each data table into a plurality of fields wherein each field within a data table is identified by a field index variable and wherein each field within a data table contains data values of a selected type;
means for providing a plurality of inverted list tables equal in number to the number of data tables, each inverted list table being associated with a respective one of the data tables, each inverted list table including:
a plurality of keys, each key being associated with a particular field and representing the occurrence of a particular data value in the associated field; and one or more pointers associated with each key and representative of the record serial numbers of the records which contain the data value represented by the associated key, each pointer including a range value representative of the occurrence in the inverted list table of one or more record serial numbers within the range and a sparse array bit map associated with each range value and representative of which record serial numbers within the associated range occur in the inverted list table;
means for providing within a first data table from among said plurality of data tables a designated field representative of a relationship between each of the records in the first data table and selected records from a second data table; and means for storing said relationship in the data base by storing in the designated field data representative of the record serial numbers of said selected records from the second data table, said representative data including the range values of the record serial numbers of the selected records and a sparse array bit map associated with each range value and representative of the record serial numbers of the selected records.

13. The data base system of claim 12, further comprising:
a B-tree index having a plurality of levels and wherein the data values stored in the data base system are stored within the B-tree as data entries in the bottom level of the B-tree;
wherein each data entry includes a key part and an associated data value, the key part including the record identifier value, the record serial number, and the field index value of the associated data value; and wherein each data value in the data base is stored immediately following the associated key , whereby each key provides a logical address of its associated data value.

14. A system of storing an inverted list representing record serial numbers in a data base system, comprising:
means for dividing the list of possible record serial numbers into ranges having a predetermined number of record serial numbers, each range being assigned a consecutive range value;
means for selecting the range values for each range which con-tains at least one record serial number which occurs in the inverted list; and means for encoding the position in each selected range of each record serial number in the inverted list by means of a sparse array.