WO1984002022A1

WO1984002022A1 - Dynamic data base representation

Info

Publication number: WO1984002022A1
Application number: PCT/US1983/001696
Authority: WO
Inventors: James Edward Weythman
Original assignee: Western Electric Co
Priority date: 1982-11-15
Filing date: 1983-11-02
Publication date: 1984-05-24
Also published as: EP0126123A1; CA1200015A

Abstract

A data base management system in which the internal record format is different from the external record format. Interface programs to translate between the two formats dynamically as records are needed are also disclosed. The internal formats are selected to correspond to data structures provided for in the programming language used while the external format is selected to minimize the difficulty of data base creation and maintenance.

Description

DYNAMIC DATA BASE REPRESENTATION

Technical Field

This invention relates to computerized data bases and, more particularly, to dynamically alterable representations of data base information for use in a computerized information retrieval system. Background of the Invention

It has become customary to represent large volumes of information as records in a computer-accessible data base. Such data bases are typically stored on large volume storage facilities such as magnetic disk systems. In the prior art, each record in such a storage medium has been organized and formatted to take advantage of the accessing capability of the particular computer hardware and the particular computer software available for accessing and manipulating such information.

When the data base accessing programs must be changed to take advantage of, or to respond to, new circumstances, it is often necessary to reorganize the data base so as to reflect new data formats. Adding or deleting a data item, for example, may require a reorganization of the entire data base to reflect this change. If the data base is very large, the size of the data base itself may become the limiting factor in the evolution of the data base system. That is, at some size in the data base, it becomes uneconomical to reformat the data in the data base in response to changes in the outside environment. At this point in time, the data base ceases to evolve and becomes a static representation of the outside environment at an unchanging point in time. The usefulness of the data base therefore tends to decrease as that point in time becomes more remote in history. Summary of the Invention In accordance with the illustrative embodiment of the present invention, these and other deficiencies are

OMPI λr^ ΪPO overcome by separating the data representation technique used outside of the processing computer from the data representation used inside the processing computer. By separating these two data representations, the limiting effects of the data base size on the development of the information retrieval system is removed.

More particularly, a basic record definition is initially devised. From this generalized record definition, two different and distinct, but logically equivalent, data representations are derived. A first data representation, which can be called the external record, exists in the data base itself, but outside of the computer program utilizing that data base. The second data representation is called the internal record, since it exists within the computer program which manipulates the data base and thus shares the data structure and format which is native to the programming language of the computer program itself. Input and output transfer functions are provided to convert information between the internal and the external data format. Moreover, the input transfer function can be used to reconcile structural differences between the external records and any desired new internal format which new programming techniques make desirable. The internal and external record formats can therefore be independent of one another and each take advantage of the peculiar requirements for that environment.

For example, the external data format can be constructed so as to be meaningful in and of itself, independent of any processing program. Thus, the external data can be formatted as a simple hierarchy of alphanumeric character strings which define pairs of data items and data values in a form which can be read and understood by human beings. That is, each line of a record comprises a name of an information entity and a value for that entity. The lines are indented appropriately to reflect the hierarchical relationships between the data items. The internal data format, on the other hand, can be structured

f OM in such a fashion as to take advantage of all of the expressive power of the native programming language used to create the processing program.

With this arrangement, it can be seen that the external data base representation can be created and maintained as simple alphanumeric text created and maintained by standard word-processing techniques. Additions, deletions, and modifications to data items can be made using these word-processing techniques. The internal data representations required to process the information in such a data base is created dynamically from the external data representation on request by the processing program. The computer program which accesses the external record transforms it into an internal record format which is most suitable for processing by the requesting program. It can be seen that different requesting programs, written in different native programming languages (NPL's), may well be provided with separate and distinct internal representations of the same data to take advantage of the peculiar capabilities of the native programming languages and the peculiar physical characteristics of the hardware system being used. Evolution to entirely new hardware systems is therefore possible without changing the data base whatsoever. Brief Description of the Drawing

FIG. 1 is a general block diagram of a data base storage and retrieval system utilizing the dual data representation in accordance with the present invention; FIG. 2 is a more specific block diagram of the off-line set-up used to implement the system of FIG. 1; FIG. 3 is a more specific block diagram of the run time process used to implement the system of FIG. 1;

FIG. 4 is a flow chart of the GETREC routine used in FIG. 3; and FIG. 5 is a flow chart of the PϋTREC routine used in FIG. 3.

OMPI - . W3PO Detailed Description

Referring more particularly to FIG.l, there is shown a generalized block diagram of a data base system illustrating the use of different data representations inside and outside of the computer. This system can, for convenience, be termed a Generalized Record Interfacing Technique (GRIT) . In particular, data base 10 includes a large plurality of records, each of which is represented in an alphanumeric form directly accessible and editable with a standard word processing system 11. That is, all characters, alphabetic, numeric and line and space controls, are in a standard code such as ASCII which can be displayed or printed directly, without processing, on the word processing terminal 11. The word processing terminal 11 may therefore be used to add new records to data base 10, to modify existing records in data base 10 or to remove records from data base 10.-

The format (i.e., written arrangement) of each of the records in data base 10 is described in a canonical form in record definitions 12. In general, these formats can be entirely arbitrary, but preferably are arranged to render the data and data relationships obvious to the casual user on inspection. In* the record definitions 12, for example, a label for each data field may be supplied, along with the maximum length, in characters, for the data to be entered into that field. These record definitions 12, in machine-readable form, are supplied to a record compiler 13. Compiler 13 is a computer program (called a General Record Interfacing Processor, GRIP) which takes the record definitions 12 and converts each of them into a set of structure definitions 14 in the native programming language (NPL) and an initialized NPL data structure 15, called a record descriptor.

GRIT record definitions are processed by the GRIT compiler 13. For each record, two files 14 and 15 are produced, one (file 14) containing the definition of a data structure, and the other (file 15) containing a record descriptor.

The data structure definition 14 represents the

GRIT record definition in the native programming language or NPL (e.g., the C programming language). This definition will be used by the application programmer to create various instances of the data structure as needed for the application.

The record descriptor 15 is a symbol table with sufficient information to relate names in the form of ASCII strings to particular offsets and limits in a given data structure. This descriptor is used by the basic on-line

I/O functions getrec() and putrec(), to be described, in mapping between the internal program and external file GRIT formats. The record descriptor is in the form of an initialized data structure in the NPL, a common data structure used for all record descriptors, each with their own initialized instance.

GRIT record definitions 12 are written in a record definition language. This is a simple notation which is in the form of an indented outline. For instance, the example info name [40] addr loc [2] r [6] ext [4] defines a GRIT record called "info" which contains two fields: name, a field up to 40 characters long and, "addr" (short for "address"), a field which itself consists of three fields: "loc", "rm", and "ext" whose maximum lengths, in characters, are 2, 6, and 4, respectively.

There are no intrinsic limits to the length or depth (indentation) of a GRIT record definition. A formal description of the record definition language is given below in a Backus-Naur Form (BNF) grammar. Record-definition: field-definition

field-definition: user-name object-definition

object-definition: [ integer ] object-definition: —»^*definition-list-4—

definition-list: definition-list: definition-list field- definition' definition-list: definition-list array- definition array-definition: user-name [ integer ] object definition User-name is supplied by the user and must conform for the name syntax of the NPL. The arrows,—• and**—, denote the increase and decrease, respectively of indentation by one tab character. Thus, the phrase —*definition-list*-*—denotes an indented list of definitions. Integer is an integer number which, in a field-definition, denotes the maximum number of characters in the field and, in an array-definition, denotes the number of objects in the array.

The grammar may be interpreted as follows: A record-definition is a field-definition. A field- definition is a user-name followed by an object-definition. An object-definition is either an integer within brackets or an indented definition-list. A definition-list is zero or more field-definitions or array-definitions. An array- definition is a user-name, followed by a bracketed integer, followed by an object-definition.

Observe that the grammar is recursive. For example, a field-definition contains an object-definition, which contains a definition-list, which contains a field- definition, which is where we started. This aspect of the definition gives rise to the general tree-like nature of GRIT record definitions where each field may, in turn, be composed of other fields.

The GRIP compiler 13 uses a parser, which implements the above BNF grammar to construct an internal tree-like representation of the GRIT record definition. This tree is then used by two subroutines; one produces a file 14 containing the NPL structure definition, and the other produces the file 15 containing the record descriptor, an initialized NPL structure. The NPL compiler 16 therefore compiles the record descriptors 15 into object code format 17 suitable for directly loading into a computer memory.

Application programs 18, written in the same source code, are compiled in the NPL compiler 19, together with the NPL structure definitions 14. Compilers 16 and 19 may be the same compiler program used at different times for the two compilations (record descriptors 15 and application programs 18) . Once compiled into object code, the record descriptor (s) 17 and the object applications programs from compiler 19 are passed to loader program 20 which loads these modules into the internal memory 21 of a general purpose digital computer. The memory 21 is illustrated in FIG. 1 as a memory map divided into four sections 22, 23, 24 and 25. The user application programs, in object code, are loaded into section 22. The record descriptor (s) are loaded into memory 21 in section 23. Two other programs are always resident in memory 21: a "GETREC" program in section 24 and a "PUTREC" program in section 25. GETREC program 24 is an input transfer function which loads an externally formatted data base record from data base 10 into internally formatted program memory 21. PUTREC program 25 is an output transfer function which copies an internally formatted record in program memory 21 to the data base 10 in external format. GETREC and PUTREC are generalized programs which use the information in a designated record descriptor 23 to direct the internal/external format transformations. There is, of course, a unique record descriptor 23 for each distinct record definition 12. Data base 10, of course, includes the circuitry and mechanisms necessary to convert a record address into the physical motions or electrical signals actually required to access the record. In a magnetic disc storage system, for example, the address must be converted into a disc number, a sector number and a track number to permit the read/record head to move to the proper track in the appropriate sector of the identified disc in the disc pack. Both the GETREC and the PUTREC programs require three pieces of information (parameters) for each access of data base 10. They require the address in memory 21 at which the record is stored (PUTREC) or where the record is to be stored (GETREC) . Each of programs 24 and 25 also required the address of the record descriptor 23 to direct the transformation between internal and external formats. Finally, programs 24 and 25 require the address in data base 10 to which or from which a record is to be moved.

It will be noted that all elements and procedures above dotted line 20 are prepared off-line, i.e., when the computer is not actually engaged in processing data from data base 10 and, indeed, may be carried out on a totally different computer. Moreover, the procedures above dotted line 20 need be performed only once for each new set of application programs. Once these programs are loaded in object form into memory 21, these programs may be executed innumerable times by simple user requests without recompiling the object code. Thus, the procedures identified below dotted line 26 are considered to be "on- line" in that they can be invoked and utilized time after time and dynamically in response to the particular needs of the application for which application programs 18 were written.

It will be noted that other application programs similar to programs 18, and other record definitions similar to definition 12, may be compiled together at a different time and by a different person to run in the same computer 21, and use the same data base 10. Indeed, these programs can be written in a totally different programming language utilizing totally different data structures. In this case, compilers 16 and 19 would comprise compilers for the language used and the record compiler 13 would compile the record definitions 12 into appropriate record descriptors 15 for the language used. The GETREC program 24 and the PUTREC program 25 have the ability, as will be described, to expand or contract data fields by truncation and concatenation to accommodate the different record definitions.

It can be seen that, in accordance with one feature of the present invention, the format of data records in data base 10 has been separated from and is totally independent of the representation of the record in computer memory 21. Record interfacing programs 24 and 25 serve as translating mechanisms between the external record representation in data base 10 and the internal record representation in memory 21. Records are converted on the fly, as needed. The record descriptor (s) 23 contain all of the information necessary to make these translations.

The records in data base 10 can therefore be formatted in such a fashion as to simplify addition, deletion and modification, perhaps by a simple and standard word processor 11. New fields can be added to records and new types of records added as straight text. All that is required is that a new record definition 12, corresponding to the new or modified records, be written and the new record descriptors 15 and structure definitions 14 be recompiled for loading into memory 21. No structural changes are required in data base 10. It is therefore possible to program new applications for the data base.

-£URE

OM requiring new data items, without having to reconstruct the entire data base 10. Data base applications can, therefore, grow gracefully without the enormous burden of rewriting the entire data base every time a field or record changes.

FIG. 2 is a more detailed diagram of one example of the off-line set up taking place in the upper portion of FIG. 1. A specific, but simplified, data structure is defined and a specific, but simple, user operation is specified. The high level programming language used in the example of FIG. 2 is the "C" language, described in detail in "The C Programming Language" by B. W. Kernighan and D. M. Ritchie, Prentice-Hall, 1978.

More specifically, the record definition 12 is shown in FIG. 2 as a simple six line list. The first line, starting at the left margin, contains the record identifier, the name "info" in FIG. 2. The second line, indented one level from the margin, is the name or label of the first field of the record and, in square brackets, the maximum number of characters permitted for that field. In FIG. 2, the first field is given the label "name" and is a maximum of forty characters in length.

The third line of record definition 12 has the label "addr," standing for an address. Rather than having a field length, the field name "addr" actually identifies a plurality of subfields, shown in FIG. 2 as being double indented from the margin. Thus, the field "addr" consists of three subfields named "loc," "rm" and "ext," having maximum character length of two, six and four characters, respectively. The label "loc" stands for "location" which is a two-character indication of a geographical location (e.g., MH stands for Murray Hill, New Jersey). The label "rm" stands for "room number" and, thus, contains the room number at the location for the person identified by "name". The label "ext," of course, is the telephone extension number of the identified person.

It will be noted that the record definition 12 contains information concerning a large class of data base records (e.g., all of the employees of a particular company). Moreover, the definition format of record definition 12 in FIG. 2 allows all fields of the record to be specified by a name or a label and indicates the maximum storage space which will be required for each field in the record. Finally, record definition 12 permits the representation of hierarchical relationships among data fields by the level of indentation. The number of fields and the level of indentation is unlimited and hence the definition format of FIG. 2 is suitable for almost all data base records.

Using the record definition 12, the GRIP program 13 generates a record descriptor 14 and a C language data structure definition 15. The record descriptor 14 merely repeats the information in record definition 12 in a format suitable for compilation by C compiler 19. Record descriptor 14 contains the field names as literals along with minimum and maximum field delimiters. The record descriptor 14 will be used during run time to identify the fields by name and hierarchical position, as well as the field size limits.

Data structure 15 is a definition of the general data structure in the particular programming language to be used by the application programmer. In the illustrative example, this is the C language and the conventions followed in data structure 15 are the conventions of the C language. Other programming languages could, of course, have been chosen, in which case data structure 15 would have followed the conventions of that language (e.g., FORTRAN, COBAL, BASIC, etc.). Block 18 represents a particular illustrative user application program, also written in the C language. The function of user program 18 is simply to print the

OMPI names and the extension numbers of the individuals represented by the data records defined in box 12. Other, more complicated, procedures are possible, including adding, deleting or modifying data records. In box 18, the first line is an unexecutable comment, "/* . . . */" being the comment delimeters in the C language. This comment identifies the procedure as one for printing the name and extension of all persons having a record in the data base, i.e., producing a telephone directory listing.

The second line in box 18 indicates that this program must be linked to the data structure definition 15 ("info.h"). The data structure definitions 15 must be available to program 18 in order properly to interpret, store and return the data records recovered from the data base. Line 3 reserves a memory area, called "buf", in which the "info" records are to be temporarily stored during processing.

The fourth line of the program 18 is the start of the program proper and, together with the balance of the lines, comprises the entire program. The fifth line establishes a loop in which the procedure "getrec" is called for each record in the data base file. The three parameters in parentheses after "getrec" are 1) the address of the temporary storage for the record (buf) , 2) the address of the record descriptor for record type "info", and 3) the name of the file from which the record is to be returned (stdin) . The print command in the next line includes the formatting information for printing each line, and the last two lines specify the locations in the buffer where the data to be printed is stored.

The user source code 18 and the record descriptor 14 are compiled together by compiler program 19 and stored as an object code executable program 22 in a storage file called "a.out". When the program is to be executed, the a.out user program 20 is loaded into the internal memory of a general purpose digital computer; and

OMPI control of the computer is transferred to the location "a.out".

It should be noted that all of the procedures described in connection with FIG. 2 can be done "off-line". That is, these procedures can be carried out long before the actual telephone directory listing is needed. The object code 22 can then be executed at the time the telephone directory is needed. Indeed, since the information in the data base continually changes due to moves, new lines, retirements and promotions, the object code 22 can be executed many times, possibly on a periodic basis.

In FIG. 3, there is shown the nature of the process which takes place each time the object code 22 of FIG. 2 is executed. The record descriptor 30 corresponds exactly to record descriptor 14 in FIG. 2, but is now in object code format rather than source code. That is, the information content of descriptor 14 is represented in the internal binary form suitable for direct retrieval by the computer.

The computer program 24, called "getrec", also resides in the computer in object form. The data base 10 includes at least one record of type "info" and resides on an external storage medium such as a magnetic disc. The storage space 31 is a portion of the internal memory of the computer set aside as a temporary storage location for records from data base 10. Storage space 31 is identified by the label "buf". The print name and extension procedure 18 also resides, in object code, in the memory of the computer. From time to time, procedure 18 calls upon (transfers control to) the "getrec" program 24. Program 24 reads the external record "info" from data base 10 and transforms it into an internal record in "buf" storage space 31 in accordance with the format information in record descriptor 30. As shown in storage space 31, the values (contents) of the data fields from the file "info" in data base 10 are stored in buffer storage 31 as a plurality of character codes. The unused but reserved storage space in buffer 31 is filed with "end-of-field" characters (\0s) and each field is terminated with an end- of-field character. The end-of-field character may be any character, or character string not found in the values of any of the fields.

Printing routine 18, using the record descriptor 30 to locate desired values, takes the desired field values (name and ext.) from buffer 31 and passes them to an output printing medium 32. Medium 32 can, for example, be a standard printer from which printed pages of the telephone directory are taken.

It will be noted that the system of FIGS. 2 and 3 separates the internal and the external format of the data base records. The external record, in data base 10, is in a format and uses storage conventions readily creatable and editable by standard word processing systems. The internal record in buffer memory 31, on the other hand, is in the form best suited for processing by the computer, and in the form prescribed by the programming language chosen. The programs "grip" 13 (FIG. 1) , and "getrec" 24 and "putrec" 25 (FIG. 1) permit the programming language, the internal record format and the external record format to be chosen independently in accordance with available skills, abilities and resources. The data base can grow gracefully without rewriting large portions of the processing software, and the processing software can be changed radically without reformatting the data base.

In FIG. 4 there is shown a flow diagram of the programmed procedure "getrec". In response to a request 40 to execute the getrec program, the getrec program first clears buffer 31 (FIG. 3) in box 41. In box 42, the symbolic name "info" for the data record to be retrieved is translated into the physical address of that record in data base 10.

In box 43, the external record in data base 10 is retrieved using the physical address obtained in box 42.

f OMP This information is processed before final storage in buffer 31. In box 44, for example, if a field of data in the external record does not have a corresponding name in record descriptor 30 (FIG. 3) , that field is discarded since no mechanism is available for storing it in or retrieving it from the internal buffer 31.

Correspondingly, in box 45, if a record field is specified in descriptor 30, but has no corresponding field in the* external record, the space reserved for this field value in buffer 31 is filed with nulls (blanks). Attempts to print this value will result in printing blanks.

If the field value in the external record is longer than that specified by descriptor 30, that field value is truncated to the length specified in descriptor 30 in box 46. (As previously noted, shorter field values are filled with end-of-field delimiters.) In box 47, if the record descriptor 30 specified a single-valued field, and the external record includes a multiple-valued field (plural indented subfields) , the plural field values are simply concatenated to the permissible length of the single-valued field descriptor. On the other hand, if the descriptor 30 specifies a multiple-valued field and the external record has only a single-valued field, box 48 repeats the same single field value in each of the multiple subfields. Finally, in box 49, the resulting internal record is stored in buffer register 31. At this time, the execution of the getrec procedure is completed and, in box 50, control is passed back to the place in the calling program where the getrec program was originally invoked. It will be noted that the steps 44 through 48 of the flow diagram of FIG. 4 permit the use of newer versions of user programs with older versions of the data base and vice versa. Thus, only the newer user programs, requiring additional record fields, need be recompiled with new record descriptors. The older user programs can continue to be executed and unused information simply discarded.

OMPI IPO ^" This also allows graceful growth of the data base system without recoding existing programs each time the data base records are augmented. A pseudocode implementation of the "getrec" program is shown in Table I. Table-I getrec ( buffer, rec_desc, file ) clear buffer call getfield() return

getfield() get name from file look up name in rec_desc use rec_-_desc entry to set receiving offset and limit in buffer if field is elementary copy value from file into buffer observing limit else for each subfield call getfieldO return

In FIG. 5 there is shown a flow diagram for the computer program "putrec" which returns data records to the data base after processing. In FIG. 3, putrec was not invoked because no changes were made in the data and, hence, the version already stored in the data base could remain. In this sense, the retrieval of records from data base 10 is nondestructive.

A request 50 to put a record into the data base has the same three parameters as the getrec procedure: the internal buffer location ("buf") , the external record name ("info"), and the name of the file in which the record is stored ("stdin") . In box 51, any null fields in the buffer 31 are discarded, since no effort need be expended to store a null field. In box 52, the values in buffer 31 and the information in record descriptor 30 are used to

-i B-

OMPI reformat the internal record in a form suitable for storage in data base 10, adding the field labels from the literals in the record descriptor. In box 53, the symbolic address of the data base record is translated to a physical address. In box 54, the external record is loaded into data base 10 at the physical address obtained in box 53. In box 55, the procedure "putrec" is completed, and control is returned to the calling program.

A pseudocode implementation of the putrec program is shown in Table II.

Table I_I putrec ( buffer, rec_desc, file ) call putfield ( rec_desc ) return

putfield ( rec_desc ) if rec__desc is a group node for each sub_rec_desc call putfield ( sub_rec^desc ) else if field is non-null (in the buffer) write field name[s] to file as necessary copy value from program buffer to file return

Claims

1. A data base system comprising means for representing data base records in a first external format in a data base storage medium, means for representing each of said data base records in a second different format in the internal storage medium of a digital computer, and means for translating one of said data base records between said first and second formats only when said one record is being processed.

2. The data base system according to claim 1 wherein said first format includes only word processing alphanumeric, symbolic and control characters.

3. The data base system according to claim 1 wherein said second format corresponds to a data structure of a programming language to be used in said data base system.

4. The data base system according to claim 1 wherein said translating means further comprises means for contracting or expanding data field lengths by truncation or null filling, respectively.

5. The data base system according to claim 1 wherein said translating means further comprise means for repeating or concatenating data field values when said internal and external formats prescribe different numbers of subfields.

6. The data base system according to claim 1 further comprising means for utilizing said data base records.

7. The method of utilizing data base records comprising the steps of storing data base records on a data base storage medium in a format suitable for use by word processing equipment, storing data base records in the internal memory of a digital computer in a format corresponding to a data structure of a programming language, and

OMPI translating one record at a time between said data base storage medium format and said internal memory format only in connection with the processig of said one record by said digital computer.

8. The method of claim 7 wherein said step of translating further comprises the steps of contracting data field lengths in one of said formats to match the corresponding but shorter data field lengths in the other of said formats, and expanding data field lengths in one of said formats to match a corresponding but longer data field length in the other of said formats.

9. The method of claim 7 wherein said step of translating further comprises the steps of repeating data field values occurring singularly in one of said formats to match multiple data field value positions in the other of said formats, and concatenating multiple data field values in one of said formats to match single field position in the other of said formats.

10. The method of claim 7 further comprising the step of adding, deleting or modifying data field values in said data base records.