WO1991016682A1

WO1991016682A1 - A method of structuring or storing data within a file

Info

Publication number: WO1991016682A1
Application number: PCT/GB1991/000666
Authority: WO
Inventors: Sydney Reading Hall
Original assignee: International Union Of Crystallography
Priority date: 1990-04-26
Filing date: 1991-04-26
Publication date: 1991-10-31
Also published as: EP0526516A1; GB2243467B; AU7763591A; JPH05509183A; GB2243467A; GB9009447D0

Abstract

A method of structuring or storing data within a file has the following steps: (i) arranging the file into a plurality of data blocks each preceded by a respective data block code; and (ii) arranging the data within each block into a plurality of data items each preceded by a respective data name; wherein the data block codes are taken from a first predetermined set, may occur in any order, and have a first common feature, and wherein the data names are taken from a second predetermined set, may occur in any order, and have a second common feature, the first and second common features being readily distinguishable. The file is visually readable as text in addition to being machine readable.

Description

A method of structuring or storing datawithin a file.

The present invention relates to handling data and more particularly to a method of structuring or storing text data within a file and to a file containing such data.

Many existing procedures for computer archiving use a 'fixed format' file in which the data structure is determined by specific data requirements. A fixed format file is simple and fast to access but the data structure cannot be modified without reformatting existing files.

Other archival files are based on 'pre-defined free formats'. This approach does not restrict data to specific positions in the file. Data 'keys' are often used to aid in data recognition and this permits fewer restrictions on the ordering of data lines and items. This is an important advantage over the fixed format files. Access to free format files currently in use still requires some advance knowledge of the expected data types and the data structure. The addition of any new data types or structures also requires that processing software be modified. This means that existing data processing software must be altered to provide common access to files which pre- and post-date the file changes. The term 'free format' is therefore misleading because it really refers to an improved flexibility within a relatively restricted data structure.

The inflexibility of the two traditional archival approaches described above restricts the exchange of data, even within the same discipline, especially if the number and nature of data types changes rapidly and continually. This is the case in many data processing fields and as a result a vast repertoire of specialized and 'local' file formats has evolved over the years. A diversity of file formats is tolerable when electronic data transfer is infrequent and processing speeds require that file formats be finely tuned to specific applications. Rapid increases in computing power and in computer networks have signalled an end to this rationale. In the era of widespread data exchange, global data bases, electronic mail and electronic publication submission, the critical need is for a general, flexible and extensible file format. The present invention seeks to provide an improved file format and an associated method of handling data which overcome one or more of the above problems.

According to a first aspect of the present invention there is provided a method of structuring or storing data within a file comprising the following steps:

(i) arranging the file into a plurality of data blocks each preceded by a respective data block code; and

(ii) arranging the data within each block into a plurality of data items each preceded by a respective data name; wherein the data block codes are taken from a first predetermined set, may occur in any order, and have a first common feature, and wherein the data names are taken from a second predetermined set, may occur in any order, and have a second common feature, the first and second common features being readily distinguishable.

The nature of the file is preferably such that it is visually readable as text in addition to being machine readable. Each text line contains up to a pre-set maximum number of visible ascii characters. The limit will normally be set at eighty. Each data item may be directly preceded by the respective data name. Alternatively, a plurality of data names in a group may be followed by a like plurality of data items repeated a desired number of times.

The first common feature may be the text string 'data_', and the members of the first set are of the form 'data_blockcode' where 'blockcode' is a unique block code in each case. The second common feature may be just an underline '_', and the members of the second set are of the form '_name' where 'name' is a respective data name.

The data handled may relate to any desired subject, but the method is especially suitable for crystallographic data. Another suitable use is for chemical data

(especially molecular data) in the chemical, and pharmaceutical fields.

The method is especially suitable for the archiving of data and for inputting data to data-bases because of its facility for upwards compatibility and flexibility.

The method is also particularly advantageous for the electronic transport of text and data, via computer networks or magnetic media. It is particularly well- suited for submitting publications to technical journals.

An advantage of the method is that each data item being stored in the file as a text, character or numerical quantity is uniquely identified by a text name. Thus the text name serves as an identifier which can be interpreted visually as well as by machine. Furthermore, the data items may appear in any order.

It will be seen that the method relates to a process for handling text data; although it is primarily intended for computer application it does not itself relate to a computer program. Nor does it relate to a method of presenting information because its format of presentation is arbitrary. According to a second aspect of the present invention there is provided means for structuring or storing data within a file comprising:

(i) means for arranging the file into a plurality of data blocks each preceded by a respective data block code; and

(ii) means for arranging the data within each block into a plurality of data items each preceded by a respective data name; wherein the data block codes are taken from a first predetermined set, may occur in any order, and have a first common feature , and wherein the data names are taken from a second predetermined set, may occur in any order, and have a second common feature, the first and second common features being readily distinguishable. According to a third aspect of the invention there is provided a data file comprising a plurality of data blocks, each preceded by a respective data block code, and, within each block, a plurality of data items each preceded by a respective data name, wherein the data block codes are taken from a first predetermined set, may occur in any order, and have a first common feature, and wherein the data names are taken from a second predetermined set, may occur in any order, and have a second common feature, the first and second common features being readily distinguishable.

The many types of information to which the file is well suited include crystallographic data. According to a fourth aspect of the present invention there is provided a method of retrieving data from a file of the above type, comprising listing the requested data items and outputting the requested data items in the order requested, the output file having the same format as the accessed file.

A preferred embodiment of the present invention will now be described, by way of example. First of all it will be of assistance to review three existing "pre-defined free format" files.

The BCCAB archive file is used by the Cambridge Data Centre (U.K) to prepare the packed crystallographic organic structural data base file ASER. In Appendix 1 is an extract from one entry of the BCCAB file. The format is "free" in the sense that many lines have an identifying code (e.g. #Author) which provides flexibility in the order of lines, and for optional line input. Certain data items are "free" in that they are separated either by a single blank or comma. However, all line identifying codes, and many data sequences, are predefined and have a fixed function within the BCCAB definitions. Software processing this format expects predefined protocols to be observed. Violations of this protocol, or the presence of foreign data, will out of necessity be treated as a processing error and terminate data access.

The second example of a "pre-defined free format" file is that used by the XTAL3.0 Crystallographic Program System (Hall & Stewart, 1990), as shown in Appendix 2. It is classed as a "free format" file because every line, and many individual data items, are tagged with an identification code. This provides for variations in the order of line input but only within strict guidelines. In this file the program initiation lines (those with the line codes in upper case letters) may be in order but the optional control lines (codes in lower case letters) are specific to a particular program. Data items, and data codes, are also specific to a line. Violation of an input rule will terminate data processing of this file. These types of restrictions are typical of those placed on many "predefined free format" files. The last example of a "pre-defined free format" file is the Standard Crystallographic File Structure (Brown, 1988) as shown in Appendix 3. This is an archival file structure which is more restrictive than the previous two examples. There is some flexibility in the order data sequence (note the end-of-sequence code *EOS) but the data items and the character positions within a sequence are fixed. The addition of extra date types to a SCFS file is almost impossible without invalidating the format of previously archived data.

Turning now to the present invention, a Self-defining Text Archive and Retrieval (STAR) file, is proposed especially for the computer archiving and electronic transmission of text and numerical data. This file contains standard ascii text which defines both the data structure (i.e. the arrangement of the data) and the data items. Each data item is explicitly identified by a name and these may be stored in any order. Simple syntactical rules applied to the data names provide access to each data item in a STAR file. No other knowledge of the data items is required.

A STAR file is normal text data that can be edited and read with a text editor. Its contents are intelligible as text and can be stored or transmitted electronically without conversion. The structure of a STAR file is simple. Each file is divided into a sequence of data blocks which contain individual data items. The identity of each data item is determined by a preceding data name. It is possible to repeat data items by placing them within simple looping structures.

It should be noted that a STAR file can be defined by only a few simple rules. This ensures maximum flexibility in data storage and its widest possible applicability. No assumptions are made about the order of the data blocks or data items, other than the requirement that identifying names be unique. There are no rules regarding the placement of data names or data items within a data block, other than the requirement that the name must precede the item. Access to data in a STAR file is made simply by requesting a specific data name within a specific data block. No prior knowledge is needed about either the data type, whether the item is looped, or whether an item exists in the file. As an introduction to STAR file concepts, here are some examples of data syntax. A data block is identified by a unique string with the construction 'data_blockcode'.

An example follows. data_crystal_structure

A data item is identified by a unique data name which starts with an underline '_'. Three examples of data names followed by their associated data items, follow.

_cell_volume 2310 (2)

_chemical_formula 'C23 H36 07' _publication_author_address

; Prof Barry O'Connell

Department of Chemistry

University of Kalamazoo

Michigan U.S.A. ;

A data item may be repeated individually or in a group. These are referred to as looped data items and are specified with a 'loop_'string. Here is an example of looped data items. loop_

_exptl_crystal_face_h

_exptl_crystal_face_k

_exptl_crystal_face_l

_exptl_crystal_face_distance

0 0 1 0. 01 2 0 0 1 0 . 012

1 0 0 0 02 3 -1 0 0 0 . 023

A STAR file is a formatted sequential file containing text lines of standard visible ascii characters. It may be viewed or edited with any standard text editor. A STAR file is divided into any number of sequential data blocks. The information within a data block defines the data structure (i.e. the data order), and the data items. All of this information is intelligible as text.

The "save frame" command will now be described. The principal purpose of the "save frame" command is to define a block of data items that can be internally referenced within a data block via a single code. This code is the "save frame" code which is used within the data block as a character string preceded by a "$" character.

The save frame command enables data definitions to be repeated within a data block, and yet these definitions are insulated from one another. A save frame definition may precede or follow its reference as a $<frame-code>.

Frame codes may be also referenced within other save frames. Recursive references to save frames are not perm it t ed. The following nine syntax rules provide the specifications for a STAR file.

1. A text string is defined as either a sequence of non-blank characters, a sequence of characters bounded by matching single or double quotes (i.e. <'> or <">), or a sequence of lines bounded by a semicolon <;> as the first character of a line. A text string must not span more than one line, except if bounded by semicolons.

2. A data name is a text string starting with an underline'_'.

2 . A data item is a text string not starting with an underline '_', and preceded by the identifying data name.

4. A data loop iε a list of data names, followed by a repeated list of data items, and preceded by the text string 'loop_'. 5. A save frame is a sequence of data names, data items and data loops preceded by the text string 'save_ framecode' where 'framecode' is a unique identifying code within a data block. A save frame sequence is closed by another save frame command, by the text string 'stop_' or by a data block command.

6. A data block is a sequence of data names, data items, data loops and save frames preceded by the text string 'data_blockcode' where 'blockcode' is a unique identifying code within a STAR file. The data block sequence is closed by another data block command or the en d o f t h e S T A R f i l e . 7. A data name must be unique within each save frame sequence and a data block sequence. A save frame declaration must be unique within a data block sequence. The save frame code may be referred to within a data block as the data item '$framecode'.

8. Except if contained within a text string, a sequence of blank or tab characters is used only to separate text strings. 9. Except if contained within a text string, a single sharp '#' signals that the characters following on a line are used for comment only.

The key to accessing a STAR file is the data name. It is essential that the data names needed for a given application be defined carefully and precisely in a distributed Glossary. Data names and their definitions must not be changed in the lifetime of the archive file, but new names and definitions may be added as needed. A glossary does not restrict the data that can be stored in a STAR file; it is only to provide information about data items in general use.

One application of the STAR file is as a basis for a

Crystallographic Information File (CIF). This application will be used to illustrate the STAR file concepts.

Since the CIF is intended only for crystallographic data and text, this application has imposed some formatting constraints, other than those of the STAR syntax, which simplify data handling but do not inhibit flexibility. These constraints involve certain data typing and the text string limitations which may be of use in other scientific applications and are cited here. 1. Lines may not exceed 80 characters in length.

2. Data names and block codes may not exceed 32 characters in length. 3. A data item is assumed to be of type number if it is not bound by matching single or double quotes, and starts with digit 0-9, a plus '+ ' , a minus '-', or a period '.'. A number may be in integer, real or scientific format. If a number is concatenated with another number bounded by parentheses, it is taken to be the standard deviation [e.g. nn.nnn(m)]. 4. A data item is assumed to be of type text if it extends over more than one line.

5. A data item is assumed to be of type character if it is surrounded by matching single and double quotes and is not either of type number or type text.

6. Only one level of loop_ data is permitted. Additional levels of repeated data must be stored as lists within a single text string.

Appendix 4 shows an example of a CIF file containing two data blocks 'manuscript' and 'crystal-structure'. Data is retrieved from a STAR file by locating its data name. This would normally be done by 'parsing' the file and locating a request list of data names. Existing software called QUASAR uses this approach to access a STAR file. Data items and data blocks are output by QUASAR in the order requested. The QUASAR output file is also in STAR format. For a given data block the same data item may be requested up to 5 times. The STAR file is always checked for logical integrity. The names of the archive file (i.e the input STAR file) and output file are specified as the strings 'star_arc' and 'star_out', respectively. These are entered at the start of the requested list. In the example request list shown in Appendix 5 these files names are 'qtest.arc' and 'qtest.out'.

Appendix 6A and 6B shows the file 'qtest.out' which is output after entering the request list of Appendix 5. The output is itself a STAR file that can also be processed by a request list. Note that requested items missing from the archive file are flagged with '??'. Appendix 7A and 7B shows examples of save frame commands relating to a standard molecular data format.

The above-described file formats and the associated method of handling data have the advantage of generality, upwards compatibility and flexibility. The file is machine-independent and portable so that data items are accessible quite independently of their point of origin . It is fundamental that the file allows for future data to be incorporated without the need to modify existing files.

The STAR file format meets the requirements of a "universal" archival file. It may be used for archiving all types of text and numerical data, in any order. It is particularly suited to electronic transmission purposes.

The advantages of upwards compatibility and flexibility are two very desirable properties for any new archive system. These properties are especially important for fields, such as crystallography, where there is a wide diversity of data types, and where the archival requirements may vary from site to site. It is essential that data files written in one laboratory can be read easily in another, independent of the software on which it was generated. It is also important that these files can be easily "viewed" without the need for sophisticated archival software.

Also important for the long term is the flexibility and the eye-readable nature of the STAR format. Because a CIF may contain "local" as well as "global" data items, it is ideal for internal as well as external data communication purposes. Existing program systems, such as XTAL, currently use self-defining binary files internally because these are faster and more compact than character files. As computer technology improves the value of a flexible, eye-readable, and easily editable, character format outweighs speed and disc considerations.

If parts of a file are lost, e.g. during electronic communication, the whole file is not corrupted; thus the file format has the advantage of being robust. With a data-base such loss of characters might cause corruption.

APPENDIX 5

star_arc_qtest.arc

star_out_qtest.out

data_manuscript

_manuscript_summary data_crystal_structure

_chemical_name

_publication_title

_publication_author_name

_publication_author_address

_cell_a

_cell_b

_cell_c

_cell_alpha

_cell_beta

_cell_gamma

_chemical_name

_symmetry_space_group

_symmetry_pos_in_XYZ

_atom_site_label

_atom_site_x/a

_atom_site_y/b

_atom_site_z/c

_atom_site_U_iso

_atom_site_label

_exptl_radiation_wave_length

_exptl_radiation_type

_exptl_crystal_face_distance

_exptl_dummy

_exptl_crystal_face_h

_exptl_crystal_face_k

_exptl_crystal_face_l

_atom_site_label

_atom_site_U_iso

_publication_author_name

data_manuscript

_manuscript_summaay APPENDIX 6A

data_manuscript

_manuscript_sultimary

;

This is some dummy text to show how a multiple data-block STAR ; file works !

# -----end-of-data-block------ data_crystal_structure

_chemical_name

;

3-(2,5-dihydro-4-hydroxy-5-oxo-3-phenyl-2-furyl)propionic acid ;

_publication_title

;

Structure of WF-3681,

3-(2,5-Dihydro-4-hydroxy-5-oxo-3-phenyl-2-furyl)propionic Acid. ; loop_

_publication_author_name

_publication_author_address

"O'Connell- Barry"

; Department of Chemistry

University of Kalamazoo

Michigan U.S.A.

'Clark, Joan I.'

; University of Washington

Seattle WA 98195

U.S.A.

;

_cell_a 18.757(8)

_cellb_b 7.282(2)

_cell_c 17.511(8)

_cell_alpha 90

_cell_beta 91.20(3)

_cell_gamma 90 _chemical_name

;

3- (2, 5-dihydro-4-hydroxy-5-oxo-3-phenyl-2-furyl)propionic acid ;

_symmetry_space_group '-C 2yc'

loop_

_symm etry_pos_in_xyz

x,y,z'

-x,-y,-z'

-x,y,1/2-z'

x,-y,1/2+z'

1/2+x,1/2+y,z'

1/2-x,1/2-y,-z'

1/2-x,1/2+y,1/2-z'

1/2+x,1/2-y,1/2+z'

APPENDIX 6B

loop_

_atom_site_label

_atom_site_x/a

_atom_site_y/b

_atom_site_z/c

_atom_site_u_iso

_atom_site_label

C1 .6237(1) -.2055(4) -.3119(2) .053 C1

C2 .6022(2) -.2468(6) -.2322(2) .059 C2

O5' .7504(1) .0454(3) .0417(1) .056 O5'

_exptl_radiation_wave_length 1.54179

_exptl_radiation_type ?? # requested item not present loop_

_exptl_crystal_face_distance

_exptl_dummy # ?? requested item not present

_exptl_crystal_face_h

_exptl_crystal_face_k

_exptl_crystal_face_l

0.012 ?? 0 0 -1

0.012 ?? 0 0 1

0.023 ?? 1 0 0

0.023 ?? -1 0 0

loop_

_atom_site_label

_atom_site_u_iso

C1 .053

C2 .059

O5' .056

loop_

_publication_author_name

"O'Connell, Barry"

'Clark, Joan I.'

# ------ end-of-data-block----- data_manuscript

_manuscript_summary

;

This is some dummy text to show how a multiple data-block STAR file works ! ;

# -----end-of-data-block------ APPENDIX 7A data_SMD_Example_3

#_{---------------------------------}

_table_of_contents

This example illustrates the description of a simple chemical reaction in which one of the reactants and the product are expressed as generic structures.

;

_atom_bond_order_convention simple

save_methyl

loop_

_atom__identity_node

_atom_identity_symbol 1 C 2 C

loop_

_atom_bond_node_1

_atom_bond_node_2

_atom_bond_order 1 2 sin

loop_

_attached_hydrogen_node

_attached hydrogen_count 1 3

save_ethyl

loop_

_atom_identity_node

_atom_identity_symbol 1 C 2 C 3 C

loop_

_atom_bond_node_1

_atom_bond_node_2

_atom_bond_order 1 2 sin 1 3 sin

loop_

_attached_hyάjogen_node

_attached hydrogen_count 1 3 2 3

save_R1

loop_

_variable_alternative_number

_variable_identifier_symbol

_variable_node 1 $methyl 1 2 $ethyl 1 save_carboxylic_acid

loop_

_atom_identity_node

_atom_identity_symbol 1 $R1 2 C 3 0 4 0

loop_

_atom_bond_node_1

_atom_bond_node_2

_atom_bond_order 1@1 2 sin 2 3 dou 2 4 sin loop_

_attached_hydrogen_node

_attached hydrogen_count 2 0 3 0 4 1 APPENDIX 7B

save_alcohol

loop_

_atom_identity_node

_atom_identity_symbol 1 C 2 C 3 O

loop_

_atom_bond_node_1

_atom_bond_node_2

_atom_bond_order 1 2 sin 2 3

loop_

_attached_hydrogen_node

_attached hydrogen_count 1 3 2 2 3 1 save_ester

loop_

_atom_identity_node

_atom_identity_symbol 1 $R1 2 C 3 O 4 O 5 C 6 C

loop_

_atom_bond_node_1

_atom_bond_node_2

_atom_bond_order 1@1 2 sin 2 3 dou 2 4 sin 4 5 sin 5 6 sin loop_

_attached_hydrogen_node

_attached hydrogen_count 2 0 3 0 4 0 5 2 6 3

stop_

loop_

_reaction_component_number

reaction_component_symbol

reaction_component_type

1 Scarboxylic_acid reactant

2 Salcohol reactant

3 $ester product

loop_

_reaction_pathway_reactant

_reaction_pathway_product

1.1 .1

1.2 .2

1.3 .3

1.4, 2.3 .4

2.1 .6

2.2 .5

Claims

1. A method of structuring or storing data within a file comprising the following steps:

2. A method of structuring or storing data within a file according to claim 1, wherein the file is readable as text in addition to being machine readable.

3. A method of structuring or storing data within a file according to claim 2, wherein each text line contains up to a pre-set maximum number of visible ascii characters.

4. A method of structuring or storing data within a file according to any preceding claim, wherein each data item is directly preceded by the respective data name.

5. A method of structuring or storing data within a file according to any of claims 1 to 3, wherein a plurality of data names in a group are followed by a like plurality of data items repeated a desired number of times.

6. A method of structuring or storing data within a file according to any preceding claim , wherein the first common feature is the text string 'data_', and the members of the first set are of the form 'data_blockcode' where 'blockcode' is a unique block code in each case.

7. A method of structuring or storing data within a file according to any preceding claim wherein the second common feature is an underline '_', and the members of the second set are of the form '_name' where 'name' is a respective data name.

8. A method of structuring or storing data within a file according to any preceding claim, wherein the data handled is crystallographic data.

9. Means for structuring or storing data within a file comprising: (i) means for arranging the file into a plurality of data blocks each preceded by a respective data block code; and

(ii) means for arranging the data within each block into a plurality of data items each preceded by a respective data name; wherein the data block codes are taken from a first predetermined set, may occur in any order, and have a first common feature, and wherein the data names are taken from a second predetermined set, may occur in any order, and have a second common feature, the first and second common features being readily distinguishable.

10. A data file comprising a plurality of data blocks, each preceded by a respective data block code, and, within each block, a plurality of data items each preceded by a respective data name, wherein the data block codes are taken from a first predetermined set, may occur in any order, and have a first common feature, and wherein the data names are taken from a second predetermined set, may occur in any order, and have a second common feature, the first and second common features being readily distinguishable.

11. A data file according to claim 10 which is readable as text in addition to being machine readable.

12. A data file according to claim 10 or 11, wherein the file relates to crystallographic data.

13. A method of retrieving data from a data file according to any of claims 10 to 12, comprising listing the requested data items and outputting the requesteddata items in the order requested, the output file having the same format as the accessed data file.