WO1999056226A1

WO1999056226A1 - Alerting user-processing sites as to the availability of information

Info

Publication number: WO1999056226A1
Application number: PCT/GB1999/001241
Authority: WO
Inventors: Rachel Hammond; Llewelyn Ignazio Fernandes
Original assignee: The Dialog Corporation Plc
Priority date: 1998-04-24
Filing date: 1999-04-22
Publication date: 1999-11-04
Also published as: GB2336697A; GB9808801D0; EP1073976A1

Abstract

Data files (251) are received by a central processing system (104) and these files analysed to determine whether they contain information which is relevant to user-specified characteristics. On detecting such a condition, an alert signal is supplied to the respective user (115). The incoming data files are analysed with respect to common data characteristics (252) to generate common category associations (253). The data files are then processed with respect to user-specific data characteristics (256). The user-specific data characteristics include examples of the common data characteristics (260) and the specific processing procedures make use of the previously defined common category associations.

Description

1

Alerting User-Processing Sites As To The Availability Of

Information

Field of the Invention The present invention relates to the alerting of user-processing sites as to the availability of data associated with user-specific characteristics.

Introduction to the Invention

Search engines are known for identifying particular text files of interest from large, often distributed, databases. These known processes operate by performing free text searching in which a user specifies words which they believe are contained within the target file.

A problem with this known technique is that a simple enquiry can generate thousands of hits many of which are totally irrelevant to the user's needs. Furthermore, many relevant files may be missed because they do not actually contain the specific words chosen.

Procedures for classifying volumes of data so as to facilitate subsequent searching are known but the classification process often involves manual intervention thereby making it time consuming and prone to human error. Procedures are known for processing data files so as to determine whether the file should be associated with a particular information category. The known processes require machine readable association files (or outline files) which are used as a basis for analysing an incoming data file. The processing of a data file in combination with an outline file results in a numerical score value being produced, defining an extent to which the data file is relevant to a particular category. Thereafter, a decision may be made as to whether the file should be included in the category by performing a threshold comparison.

In practical systems, thousands of such outline files would be required in order to provide a useful level of categorisation. The applicant's co-pending British patent application GB 98 08 808.1 , along with the present applicant's co-pending European patent application (DGC-P11-EP) and the present

Assignee's co-pending United States patent application (DGC-P11-US) describe a procedure for automating the generation of outline files by making reference to files that have already been placed in the category and by making reference to files that are not appropriate to the category. In this way, new categories may be identified and appropriate outline files constructed.

Outline files work well if the size of incoming data files is similar to the size of the files referred to during the outline file generation process. If an incoming data file is smaller than the preferred size, score values may be adjusted as described in the applicant's co-pending British patent application GB 98 08 807.3, along with the present applicant's European patent application (DGC-P12-EP) and the present Assignee's co-pending United States patent application (DGC-P12-US). Alternatively, if an incoming data file is much larger than the preferred size, the file may be divided into a plurality of file sections whereafter the categorisation process is performed for each of the sections, as described in the applicant's co-pending British patent application GB 98 08 805.7, along with the applicant's co-pending European application (DGC-P13-EP) and the present Assignee's co-pending United States patent application (DGC-P13-US).

Some categories may be considered as being particularly important to users and users may wish to receive immediate notification upon particular files being detected. Although many categories may be included within a system, these categories may be less than perfectly adequate in terms of defining the highly important issues of interest. Under these circumstances, specific outline files could be generated for a particular application. However, it is appreciated that such outline files are probably only of interest to particular users and the duration over which these outline files are required may be relatively short. Thus, it is difficult to justify the generation of outline files, in commercial terms, for specific user applications.

Summary of The invention

According to a first aspect of the present invention, there is provided apparatus for analysing incoming data files to associate said files with predetermined categories, comprising data file input means, processing means and alerting means, wherein said data file input means supplies data files to said processing means; said processing means is configured to: associate data files with common categories, and to associate data files against user-specified categories; and said alerting means is configured to alert a user on detecting a user-specified association, wherein said processing means is also configured to associate data files against user- specified categories that include references to said selected ones of said common category associations. In a preferred embodiment, storage means are included and the processing means is configured to write a table to said storage means identifying files associated to each of the common categories.

According to a second aspect of the present invention, there is provided a method of analysing incoming data files to associate said files with predetermined categories, comprising the steps of receiving data files; associating said data files with common categories; associating said data files against user-specified categories; and alerting users on detecting a user- specified association, wherein associations for user-specified categories include references to selected ones of the common category associations.

Brief Description of The Drawings Figure 1 shows a data distribution environment in which data is received from a plurality of data sources;

Figure 2A details a data processing, storage and retrieval system shown in Figure 1, including a central processing system, a user-specific processor and a plurality of subsidiary processors; Figure 2B illustrates underlying principles of operation for the preferred embodiment;

Figure 3 identifies procedures performed by the data procession, storage and retrieval system shown in Figure 1;

Figure 4 details the process for generating common characteristics for association with data files identified in Figure 3;

Figure 5 details the process for generating or modifying an outline file identified in Figure 4;

Figure 6 shows a terminal display of outline files represented graphically; Figure 7 details an outline file from which the display shown in Figure

6 is generated;

Figure 8 shows a diagrammatic representation of the file data shown in Figure 7;

Figure 9 details process 302 for the generation of user-specific characteristics;

Figure 10 details the process identified in Figure 9 for generating an alert outline file; Figure 11 shows a visual display at a user terminal, inviting a user to provide input information;

Figure 12 shows an example of an outline file representing user alert specifications; Figure 13 represents a structure derived from the file shown in Figure

12;

Figure 14 shows an example of a source data file;

Figure 15 details a subsidiary process shown in Figure 2;

Figure 16 details operations performed by the subsidiary process detailed in Figure 15;

Figure 17 shows a plurality of rulebases produced by the process shown in Figure 16 and stored in the memory identified in Figure 15;

Figure 18 details procedures performed by the data processing system 104 in response to receiving a new data file; Figure 19 details procedures for the processing of data to determine associated preferred terms shown in Figure 18;

Figure 20 details a triggering phase identified in Figure 19;

Figure 21 details a scoring phase identified in Figure 19;

Figure 22 details a list generation phase identified in Figure 19; Figure 23 details a table constructed by the central processing system shown in Figure 2;

Figure 24 details a linked list;

Figure 25 details procedures for performing a search in response to a user request; Figure 26 shows an example of a common data associated file.

Detailed Description of The Preferred Embodiments A data distribution environment is illustrated in Figure 1 in which data, received from a plurality of data sources 101 , 102, 103 is supplied to a data processing, storage and retrieval system 104. Data sources 101 and 102 supply data directly to processing system 104 while data source 103 supplies data via a local area network 105, thereby allowing user terminals 106 and

107 to gain direct access to their local data source 103.

The processing system 104 provides access to a plurality of users, such as users 111, 112, 113, 114, 115, 116 and 117. User 111 has direct access to the processing system 104 while users 112, 113 and 114 gain access to the processing system 104 via the Internet 118. Users 115, 116 and 117 exist within a more sophisticated environment in which they have access, via a local area network 119 to their own local database system 120 in addition to a connection, via an interface 121, to the data processing system 104. All incoming data from data sources 101 to 103 is classified with a key word in seven separate fields, comprising "market sector", "location", "company name", "publisher", "publication date" and "scope". A user, such as users 112 to 117, may specify almost any term as the basis for a search and the user is then prompted by equivalent words or phrases that constitutes more preferred search parameters. For example, a user may specify a search word such as "confectionery" and the system will prompt the user to consider narrower terms such as "chocolate" along with related terms such as "cakes" or "desserts", or broader terms such as "food". From a simple request, a user is given an option of focusing further or of taking a broader overview of the subject under consideration.

The scope of an article refers to the context in which the document or article was written. For example, the scope field may consider questions as to whether the article concerns "mergers and acquisitions" or "seasonal trends" et cetera. Such categories are useful in gathering related information from a wide variety of industries and markets and may prove invaluable for particular applications. Processing system 104 is detailed in Figure 2A. Data signals from data sources 101 to 103 are supplied to input interfaces 201 via data input lines 202. Similarly, output data signals are supplied to users 111 to 117 via an output interface 203 and output wires 204. Input interface 201 and output interface 203 communicate with a central processing system 205 based oh DEC Alpha integrated circuitry. The central processing system 205 also communicates with other processing systems in a distributed processing architecture. Processing system 104 includes eight Intel chip based processing systems 211 to 218, each implementing instructions under the control of conventional operating systems such as Windows NT. An operator communicates with the processing system 104 by means of an operator terminal, having a visual display unit 221 and a manually operable keyboard 222. Data files received from sources 101 to 103 are written to bulk storage devices 223 in the form of large magnetic disk arrays. Data files are written to disk arrays 223 after these files have been associated with categories, as illustrated at step 203. These association processes are performed by the subsidiary processors 211 to 218 and the central processing system 205 is mainly concerned with the switching and transferring of data between the interface circuits 201 , 203 and the disk arrays 223. The central processing system 205 communicates with the subsidiary processors 211 to 218 via an Ethernet connection 206 and processing requirements are distributed between processors 211 to 218. Having 8

addressed a subsidiary processor 211 to 218 the transferring of data to an addressed processor is performed. Each individual incoming data file is supplied exclusively to one of the subsidiary processors. The selected subsidiary processor is then responsible for performing the association process, to identify preferred terms relevant to that particular data file.

Thereafter, the associated data file is returned to the central processing system 205, over connection 206 and the central processing system 205 is then responsible for writing the associated data file to the disk array 223. In this way, it is possible to scale the degree of processing capacity provided by system 104 in dependence upon the volume of data files to be processed in this way.

A new incoming data file is supplied to central processing system 205 from input interface 201. The central processing system 205 supplies the new data file to one of the subsidiary processors 211 to 218 over network connection 206. The selected subsidiary processor performs a first processing step of analysing the incoming data file with respect to common data characteristics to generate common category associations. These common category associations, which may be identified by preferred categories, effectively associate the file with particular categories thereby allowing the file to be identified with reference to these categories.

The associated common data characteristics are added to the file which is then returned back to the central processing system 205. Central processing system 205 also maintains a table 228 recording details of particular associated files for each of the common categories. Thus, given a particular common category it is possible to identify all associated files and given a particular file it is possible to identify the particular common categories under which that file has been associated. 9

In addition to associating the files to common categories, the system shown in Figure 2 is also capable of associating files to user-specific data characteristics. Such characteristics are defined by users, such as users 112 to 117 and the association process is performed by user-specific processor 226. After identifying common data characteristics, as a first process, the central processing system 205 supplies data files to the user-specific processor 226 so as to allow said processor to perform a second processing step. Under the second processing step the incoming data file is analysed with respect to user-specific data characteristics to generate user-specific associations. Such associations, when identified, are brought to the attention of the central processing system 205. The central processing system is then prompted generate an alerting signal to the effect that user-specific associations have been generated.

In order to obtain maximum benefit from the first processing step performed by the subsidiary processors 211 to 218 and in order to minimise the burden placed on the user-specific processor 226, user-specific data characteristics may include examples of the common data characteristics. Furthermore, the second processing step makes use of these specified common data characteristics and relies upon processing procedures performed as part of the first processing step by the subsidiary processors

211 to 218. Thus, an association process is performed only once, either in accordance with the common category associations, in response to operations performed by subsidiary processors 211 to 218, or in response to user-specific operations under control of user-specific processor 226. The facility includes a CD ROM reader 225 arranged to read CD ROM's, such as

ROM 226. In this way, it is possible to install executable instructions for computer system 205 and for computer systems 211 to 226. 10

Underlying principles of operation for the preferred embodiment are illustrated in Figure 2B. User A, user B, and user N have specified categories of interest. The process analyses incoming data files 251 and the system includes nine association files 252 each relating to a single common category. Each incoming data file is processed in combination with each of the association files 252, resulting in some files, such as data file 253, being associated with one of the common categories while some files, such as file 254 are not associated with any of the common categories. Files associated with common categories are stored in an associated file store 255, that also includes a table 256 listing all files stored as being associated for each of the nine common categories.

User-specific association files 256, similar to the common category association files 252, are established for each of users A, B and N. After a data file has been considered for association against the common categories, each data file, such as files 253 and 254, is considered for association against user-specified categories, by means of respective processes 257, 258 and 259. In addition to having unique category definitions, specified by means of association files 256, the user-related processes 257 to 259 also include references to selected ones of the common category associations, represented by regions 260.

When a file of interest has been identified for a specific user, details of the file are held in a user buffer 261 , which is also arranged to generate an alerting signal, possibly in the form of an e-mail 262, to the respective user (User A in this example). Thus, sub-process 260 considers files that have already been associated by association files 252 so as to identify files of interest to User A, where User A has (in their user-specific associations) made reference to a common category. This association has already been 11 made, as illustrated by file 253. In addition, user-specific categories, different from the common categories, may be detected by user-specific association files 256. In either event, a file is categorised as being of interest to a specific user, as illustrated by file 263 and an alert is generated by process 261. In this example, file 263 has been alerted to User A as a result of User

A specifying a common category. This common categorisation has been identified as a result of association files 252 and the relevant file has been written to storage 255 at 264. This categorisation is identified by sub-process 260, resulting in the file being alerted to User A. This is the same file as that stored at 264, as indicated by arrow 265. In this way, user-specific categorisation may be made, while encouraging users to make use of the common categories. In this way, maximum benefit can be obtained from the common category associations, thereby allowing user-specific processing to be minimised. Procedures performed by the data processing system 104 are summarised in Figure 3. Steps 301 and 302 represent set-up procedures performed prior to receiving incoming data files. Steps 303 to 307 represent the on-line procedures configured to respond as incoming data files are received. Furthermore, it should be appreciated that other procedures are performed in a multi-tasking environment, possibly in response to incoming data files, although not essential to the present invention.

At step 301 common characteristics are generated for association with data files. These common characteristics are determined by the service provider and will be established in an attempt to anticipate the demands of users.

At step 302 user-specific characteristics are generated for association with data files. These user-specific characteristics will be determined by the 12 specific requirements of a particular user therefore, in a working environment, many user-specific characteristic sets will be created enabling the requirements of many users to be satisfied.

After generating common characteristics and user-specific characteristics, the system enters its on-line mode of operation initiated by step 303. At step 303 a question is asked as to whether a source file has been received and when answered in the negative the system enters a short wait state at 304 before addressing the question again at step 303. When a source file is received the question asked at step 303 is answered in the affirmative and control is directed to step 305.

At step 305 common characteristics are associated with the incoming file and a question is then asked at step 306 as to whether any associations have been made at step 305. If this question is answered in the affirmative, the associations identified at step 305 are written to an association table at step 307 and the file is stored by storage device 223 with the details of the associations.

At step 308 a file of user characteristics is selected and at step 309 the user characteristics selected at step 308 are associated to the received file. At step 310 a question is asked whether any associations have been made and if answered in the affirmative an alert signal to this effect is generated at step 311. Alternatively, step 311 is bypassed to direct control to step 312.

At step 312 a question is asked as to whether another set of user characteristics are to be considered and when answered in the affirmative control is returned to step 308. Thus, in this way, all of the user sets are considered and alert signals are generated where appropriate. Eventually, all of the user characteristics will have been considered and control will be 13

directed to step 313.

At step 313 a question is asked as to whether characteristics are to be set up and when answered in the affirmative, control is returned to step 301 , effectively taking the system off-line and allowing common characteristics to be modified at step 301 or user-specific characteristics to be modified at step

302. However, in a multi-tasking environment, it should be appreciated that it would be possible to perform the off-line and on-line functionality simultaneously. If the question asked at step 313 is answered in the negative, to the effect that on-line processing is to continue, control is returned to step 303 to await the next incoming file.

Process 301 for specifying categories for association with data files is detailed in Figure 4. At step 401 a category is selected and at step 402 an outline (OTL) file is generated or modified. At step 403 a question is asked as to whether another term is to be processed and when answered in the affirmative control is returned to step 401, allowing the next term to be processed at step 402. Eventually, all of the terms will have been processed resulting in appropriate generations or modifications to their related outline files. Consequently, the question asked at step 403 is answered in the negative whereafter at step 404 data structures are initialised by parsing the OTL files generated at step 402.

Step 402 for the generation or modification of outline files is detailed in Figure 5. At step 501 a visual OTL editor is opened resulting in the editor's visual interface being displayed on VDU 321. At step 502 a question is asked as to whether an existing file is to be loaded for modification and if answered in the negative a new OTL file is created at step 503. If the question asked at step 502 is answered in the affirmative, step 503 is bypassed and at step 504 modifications or additions are made to the OTL definition. At step 505 the 14

OTL modifications created at step 504 are tested on a sample of test data and at step 506 a question is asked as to whether another modification is to be made. When answered in the affirmative, control is returned to step 504 resulting in further modifications or additions being made to the OTL definitions. When answered in the negative at step 506, the new or modified

OTL file is saved at step 507.

When performing modifications or additions at step 504, a graphical representation of the OTL file data is presented to an operator via the visual display unit 321. An example of a display of this type is illustrated in Figure 6, representing a graphical illustration of a specific OTL file.

The OTL file stores definitions in an hierarchical tree structure and this structure is represented in the graphical view as shown in Figure 6. A representation of the tree may be contracted or expanded and the possibility of expanding a particular branch is identified by a plus sign on a particular line, as shown at 601. Similarly, when a particular branch has been fully expanded, the line is identified by a minus sign as shown at 602. Definitions within the file consist of rules, words and labels. The labels allow relationships to be defined between various parts of the file and between individual files themselves. The words identify specific words within an input file of interest and the rules define how and what weights are to be attributed to these words. Each rule line includes, at its beginning, a weight value 603 representing the score that will be attributed when a particular rule condition is met. Rules may also have leaves and the rule defines the way in which scores generated from leaves are combined. OTL file data represented graphically in the form shown in Figure 6 is actually stored in a data file having a format of the type shown in Figure 7. The actual data file shown in Figure 7 corresponds to the data display in 15

Figure 6 but in Figure 7 all of the data, some of which has been rolled up in Figure 6, is present. The data contained within the file shown in Figure 7 is manipulated interactively by an operator in response to the graphical interface displayed as illustrated in Figure 6. Score values 603 are also identified in the data file shown in Figure 7.

Displayed line 601 in Figure 6 is generated from line 701 of the actual stored data. The syntax of the language used for recording the data, as illustrated in Figure 7, may vary and the example shown is specific to this particular application. However, the underlying functionality of the language may be considered with reference to the diagrammatic representation shown in Figure 8.

Purely to provide a specific example, this particular outline file is concerned with the topic of the oil industry and therefore the purpose of the OTL file is to identify words and phrases within an input file so as to provide an indication as to how relevant that input data is to users having an interest in the oil industry. Thus, the purpose of procedures exploiting these OTL files is to generate evidence showing that a particular data file conveys information which may be of interest to those studying the oil industry.

The outlines analyse data files in order to produce numerical evidence as to the relevance of a particular file with relation to a particular topic. The

OTL definitions and structures are determined empirically and would be modified and upgraded over a period of time. The system does more than merely register the existence of a particular word item by placing the word items within an interacting structure; the nature of which is illustrated in Figure 8. The particular entry, given label "oil-industry-mkt" relates to marketing aspects of the oil industry and as such can contribute to an overall score as to the pertinence of incoming data to this particular topic. The first 16

line 801 shows that this particular contribution may provide a total score of forty percent. This total of forty percent is then subdivided such that at line 802 the presence of the phase "buying oil from" has a score of fifty percent. Thus, the total contribution made the presence of this phrase consists of fifty percent of forty percent, that is a total of twenty percent being made to the total contribution. Similarly, as shown at line 803 and below, particular words may be identified which result in contributions of sixty percent of thirty percent of forty percent. Thus, a complete OTL file is structured in this way with particular words and phrases making contributions to an overall score value. These words and phrases may also be specified in the rules as making single contributions or being allowed to accrue.

Process 302 for the generation of user-specific characteristics is detailed in Figure 9. At step 901 a user is invited to select common categories of the type specified at step 201. At step 902 a user is invited to define user-specific data characteristics which may be in the form of key words or free text. At step 903 a user is invited to define a specific file title and at step 904 a user is invited to specify a particular country of origin.

At step 905 the user is invited to define a particular alert format, specifying the way in which the user is alerted when a new data file has been received which satisfies the user's data characteristics. At step 906 the user definitions are processed to generate an alert outline (OTL) file.

The user's alert criteria include components, defined at step 902, which require extensive searching of new material as it is received. Searching of this type places a significant burden upon the information supplying resource. In addition, the characteristics also include reference to the categories that will be associated automatically upon receiving each data file by means of the subsidiary processors 311 to 318. In accordance with the 17

present preferred embodiment, the user-specific characteristics include examples of common data characteristics, specified at step 901 and reference to these characteristics are included in the user definitions generated at step 906. However, when implementing these definitions, use is made of the previously processed common category associations, thereby significantly reducing the processing overhead placed on the user-specific processor 226.

Process 906 for generating an alert OTL file is detailed in Figure 10. At step 1001 preferred terms are identified and labels are constructed. At step 1002 free text entries are extracted and logical inferred rule structures are constructed. At step 1003 titles are identified and at step 1004 an OTL file is generated representing the user's alert specifications. Thus, these specifications may include references to common data characteristics in combination with references to user-specific data characteristics. User 117 communicates with the data processing station 104 via a terminal including a

Visual Display Unit (VDU) 221 and a manually operable keyboard 222.

VDU 221 is shown in Figure 11, having received an initial screen of data from the data processing station 104, inviting the user to provide input information in accordance with the procedures identified in Figure 9. Common categories may be entered within displayed boxes 1101, 1102,

1103, 1104, 1105 and 1106. Box 1101 allows an industry or market sector to be selected, while box 1102 allows a particular country of interest to be selected. Items entered at boxes 1101 and 1102 represent common categories and allow information to be supplied back to the central system 104 in response to prompt 901.

Keywords or free text are entered, as user-specific data characteristics, in box 1103 a specific title, as prompted by step 903, may be 18

entered in box 1105 and an alert format is defined by box 1104. In this example, a user may receive an alert as an e-mail message or, alternatively, a user maintains a continuous connection with the system and the user is continually updated with alerts in a manner similar to known ticker tapes. In addition, a user may identify a particular watch name, for the particular characteristics being defined, allowing a plurality of searching procedures to run simultaneously, at box 1106.

After supplying information into the boxes of the display shown in Figure 11, the information is supplied back to the central system 104, thereby allowing processes 1001 to 1003 to be performed as detailed in Figure 10.

This is then followed by the generation of the OTL file at step 1004; a process performed by central processing system 205.

Operation of step 1004 results in the production of an OTL file and an example of such a file is given in Figure 12. OTL file 1201 has been generated in response to the input data illustrated in Figure 11. Common data characteristics, such as the characteristic "medicine" entered at box 1101 is recorded in the OTL file as a label, as illustrated at line 1202.

Asterisks beneath this show levels of nesting and effectively represent the importance of a particular phrase or relationship within the structure of the definition. Thus, below the top level label, five levels of nesting are included before a specific word is defined at line 1203. At line 1203 word texts are derived from the free field 1103 which, in this example, result in three lines being included; the first being the word "bacteria" at line 1203, the second being the word "disease" at line 1204 and the third being the word "virus" at line 1205.

Figure 12 represents an example of an OTL file for a specific user's application. It is used to associate particular text files as being relevant and 19 consistent with the search criteria supplied by the user. The file includes reference to common data characteristics in combination with reference to user-specific data characteristics. Each common data characteristic has its own OTL file, of the type illustrated in Figure 7. Thus, when implemented, OTL file 1201 directly performs an association process with respect to the three word-text words shown at lines 1203, 1204 and 1205.

The OTL file also includes examples of the common data characteristics and as such it effectively calls an existing OTL file generated for those specific common characteristics. Thus, in this way, it is not necessary to generate new OTL files for the common characteristics and it is not necessary to perform an additional search based on these characteristics, given that association processes will have already taken place. Thus, OTL file 1201 provides a sophisticated level of functionality without being required to generate significant amounts of OTL structuring because it refers to the existing OTL files for the common category associations.

The outline structure defined by file 1201 is illustrated in Figure 13. This structure is substantially similar to the structure of common category associations, as illustrated in Figure 8. Source data files are received at step 303 and an example of a source data file is shown in Figure 14. All incoming data files are converted into a standard format of the type shown for file 1401. The file includes a title identifier at 1402 taking the form "XXTITLE". This is followed by the actual title of the file followed by a delimiter "THESTART" at 1403. The end of the body text is identified at 1404 by the string "EOR=ENDRECORD".

Upon receiving file 1401 , a central processing system 204 supplies this file to a subsidiary processor, such as subsidiary processor 211. The 20

subsidiary processor analyses the file with respect to common data characteristics to generate common category associations. These are added to the file itself and also recorded in table 328 before the data file is then written to storage 223. Subsidiary processor 211 is detailed in Figure 15. The processor includes an Intel Pentium processing unit 1501 connected to sixty-four megabytes of randomly accessible memory 1502 via a PCI bus 1503. In addition, a local disk drive 1504 and an interface circuit are connected to bus 1503. Interface circuit 1505 communicates with the TCP/IP network. Random access memory 1502 stores instructions executable by the processing unit

1501, in addition to storing input data files received from the data sources 101 to 103 and intermediate data. Operations performed on processing unit 1501, in response to instructions read from memory 1502 are identified in Figure 16. At step 1601 temporary memory structures are cleared and at step

1602 an OTL description file is selected. At step 1603 an item in the OTL file is identified and at step 1604 a question is asked as to whether the item selected at step 1603 is a rule definition. If this question is answered in the affirmative, a rule object is defined at step 1605. Alternatively, if the question asked at step 1604 is answered in the negative, to the effect that the item is not a rule definition, a question is asked at step 1606 as to whether the item is a word definition. If this question is answered in the affirmative, a dictionary link is created at step 1604.

At step 1608 a question is asked as to whether the item is a label and when answered in the affirmative a new entry is created in a label list, whereafter at step 1610 a question is asked as to whether another item is present. After executing step 1605 or after executing step 1607, control is 21 directed to step 1610.

When the question asked at step 1610 is answered in the affirmative, to the effect that another item is present, control is returned to step 1603 and the next item is identified in the OTL file. Eventually, all of the items will have been identified resulting in the question asked at step 1610 being answered in the negative. Thereafter, at step 1611 a question is asked as to whether another OTL file is present and when answered in the affirmative control is returned to step 1602 allowing the next OTL description file to be selected. Thus, this process continues until all of the OTL files have been considered resulting in the question asked at step 1611 being answered in the negative.

For each OTL file considered, by being selected at step 1602, a rule base is generated and a plurality of such rule bases is illustrated in Figure 17.

Thus, a first OTL file processed in accordance with the procedures shown in

Figure 10 results in the generation of a first rule base 1701. Similarly, further iterations of the procedures shown in Figure 7 result in the generation of rule bases 1702 to 1709. Typically, for a specific installation, in the order of three thousand rule bases would be generated by execution of the procedures illustrated in Figure 10. Rule bases 1701 to 1709 are stored in memory 1502, which also provides storage space for a dictionary 1721 , a label list 1722 and a data buffer 1723. The dictionary stores a list of words which have importance in any of the stored rule bases. Associated with each word in the dictionary, there is at least one pointer and possibly many pointers, to specific entries in specific rule bases 1701 to 1709. Thus, the words identified at 803 in Figure 8 would all be included in dictionary 1721. Entries within the dictionary 1721 are implemented upon execution of step 1607 in Figure 16.

Similarly, execution of step 1609, creating a new entry in the label list, allows a label to relate to rules that are elsewhere in the tree structure. 22

Processes performed by the data processing system 104 for associating preferred terms with the source files are detailed in Figure 18. At step 1801 central processor 205 obtains access to one of the subsidiary processors 211 to 218. The central processor then expects to receive authorisation so that communication may be effected with one of the subsidiary processors. After a connection has been established, the source file is supplied to the selected subsidiary processor at step 1803 and at step 1804 the data is processed to determine associated preferred terms.

After performing the processing at step 1804, the results are transmitted back to the central processing system at step 1805 and at step

1806 data with associated common categories is stored and data pointers associated with the categories are updated at step 1807.

Step 1804 for the processing of data to determine associated categories is detailed in Figure 19. The overall processing is broken down into three major phases, consisting of a triggering phase at 1901 , followed by a scoring phase at 1902 followed finally by a list generation phase at step

1903.

Triggering phase 1901 is detailed in Figure 20. At step 2001 a section of the data, such as its title, market sector or main body of text, is identified and at step 2002 an item of the identified section is selected. At step 2003 a question is asked as to whether the item indicates a new context, which may be considered as a grammatical marker in the form of a full stop, capital, start of a sentence or quotation marks et cetera. When answered in the affirmative new context information is supplied to all rule bases 1701 to 1709 at step 2004 and control is then directed to step 2007.

If the question asked at step 2003 is answered in the negative, step 2004 is bypassed and a look-up address is obtained for rule objects in rule 23

bases from the dictionary at step 2005. Thereafter, at step 2006 all addressed objects are triggered and a multiplication of scores is effected by a score weighting factor. Thereafter, at step 2007 a question is asked as to whether another item is present and when answered in the affirmative control is returned to step 2002.

Eventually all of the items for a selected section will have been considered resulting in the question asked at step 2007 being answered in the negative. Thereafter, at step 2008 a question is asked as to whether another section is to be considered and when answered in the affirmative control is returned to step 2001.

At step 2001 the next section is identified and steps 2002 to 2008 are repeated. Eventually, all of the sections will have been considered and the question asked at step 2008 will be answered in the negative.

Scoring phase 1902 is detailed in Figure 21. At step 2101 a rule base is selected and at step 2102 a score variable is re-set to zero. At step 2103 a branch is identified for score accumulation/accrue and at step 2104 scores are accumulated or accrued from triggered rules attached to the branch. At step 2105 a question is asked as to whether another branch is to be considered and when answered in the affirmative control is returned to step 2103. A next branch is selected at step 2103 with procedure 2104 being repeated. Eventually all of the branches will have been considered resulting in the question asked at step 2105 being answered in the negative.

At step 2106 an overall score in the range of zero to one hundred is stored for the rule base and at step 2107 a question is asked as to whether another rule base is present. When answered in the affirmative control is returned to step 2101 and steps 2101 to 2107 are repeated. Eventually, all of the rule bases will have been considered and the question asked at step 24

2107 will be answered in the negative.

The operations illustrated in Figure 21 may be considered with reference to the illustration of the structure in Figure 8. Thus if any of the defined words at 803 are identified within the file a provisional score of one hundred will be allocated. However, the process as shown in Figure 21, must then ascend up the branches so that any scores lower down will be modified in response to scores higher up the structure.

Phase 1903 for the generation of a list of associated preferred terms is detailed in Figure 22. At step 2201 a rule base is identified having a score greater than a predetermined threshold. Thus, for a particular application a threshold may be set at forty-eight percent. At step 2202 additional triggered preferred data characteristics are identified by associating successful rule bases with parent categorisations by rule base links.

At step 2203 lists of successful and inferred rule bases are combined to form overall lists of preferred data characteristics. Step 2203 results in data being generated by a subsidiary processor, such as processor 211, which is then supplied back to the central processing system 205.

Central processing system 205 is responsible for constructing a table of the type shown in Figure 23 in which an entry is present for each common category. The specific categories are stored in column 2301 and, for each of these terms, column 2302 defines a specific pointer to a position in memory associated with the central processing system 205. Specific data files are identified by file names and the number of files associated with each category is variable, depending on the nature and the amount of input data being considered. Thus, in order for this data to be accessible quickly while optimising use of the storage capacity within the central processing system 205, an indication of the file name is stored in the form of a linked list as 25

illustrated in Figure 24.

The preferred term "OILJNDUSTRY" has been associated to a pointer 0F8912, as shown in Figure 23. Address 0F8912 is the first in column 2401 of the linked list. Column 2402 identifies a particular file name and column 2403 identifies the next pointer in the list. Thus, entry 0F8912 points to a particular file with the file name "OIL_INDUSTRY_NETHERLAND_3" with a further pointer to memory location 0F8A20. At memory location 0F8A20 a new file name is provided, illustrated at column 2402 and again a new pointer is present at column 2403. Eventually, all relevant files will have been considered and the end of the list is identified by address 000000 at the pointer location in column 2403.

Procedures for performing a search in response to a user request are detailed in Figure 25. At step 2501 a user logs onto the system and at step 2502 a search method is identified. At step 2503 search criteria are defined and at step 2504 search criteria are processed to determine preferred terms.

At step 2505 a list of preferred terms are supplied to the central processing system 205.

At step 2506 a question is asked as to whether the host has responded and when answered in the affirmative titles of associated data files are displayed at step 2507. At step 2508 a question is asked as to whether the user wishes to view identified data and when answered in the affirmative the data is viewed; after being downloaded over the communication channel, at step 2509. At step 2510 a question is asked as to whether another search is to be performed and when answered in the affirmative control is returned to step 2502.

Common data associated files are supplied to storage device 223 and an example of such a file is shown in Figure 26. File 2601 is a processed 26 version of file 1401 and includes all the information present in file 1401. In addition, reference to common categories have been added to the top of the file, as shown at 2602 before title line 1403.

Claims

27What We Claim Is:

1. Apparatus for analysing incoming data files to associate said files with predetermined categories, comprising data file input means, processing means and alerting means, wherein said data file input means supplies data files to said processing means; said processing means is configured to: (a) associate data files with common categories, and to (b) associate data files against user-specified categories; and said alerting means is configured to alert a user on detecting a user- specified association, wherein said processing means is also configured to associate data files against user-specified categories that include references to said selected ones of said common category associations.

2. Apparatus according to claim 1 , including storage means, wherein said processing means is configured to write a table to said storage means identifying files associated to each of said common categories.

3. Apparatus according to claim 1 or claim 2, wherein said processing means is configured to provide a user- selection environment, inviting users to select common categories.

4. Apparatus according to any of claims 1 to 3, wherein said processing means is configured to produce a user-selection environment, inviting users to define free text. 28

5. A method of analysing incoming data files to associate said files with predetermined categories, comprising the steps of receiving data files; associating said data files with common categories; associating said data files against user-specified categories; and alerting users on detecting a user-specified association, wherein associations for user-specified categories include references to selected ones of the common category associations.

6. A method according to claim 5, wherein a table identifying files associated to each of the common categories is written to storage means.

7. A method according to claim 5 or claim 6, wherein a user- selection environment invites users to select common categories.

8. A method according to any of claims 5 to 7, wherein a user- selection environment is produced, inviting users to define free text.

9. A computer system programmed to execute stored instructions such that in response to said stored instructions said system is configured to associate data files with common categories; associate data files against user-specified categories; and alert a user on detecting a user-specified association, wherein said user-specified categories include references to selected ones of said common category associations. 29

10. A computer system programmed to execute stored instructions according to claim 9, configured to write a table to storage means identifying files associated to each of said common categories.

11. A computer system programmed to execute stored instructions according to claim 9 or claim 10, configured to provide a user-selection environment inviting users to select common categories.

12. A computer system programmed to execute stored instructions according to any of claims 9 to 11 , configured to produce a user-selection environment, inviting users to define free text.

13. A computer-readable medium having computer-readable instructions executable by a computer such that, when executing said instructions, a computer will perform the steps of: receiving data files; associating said data files with common categories; associating said data files against user-specified categories; and alerting users on detecting a user-specified association, wherein associations for user-specified categories include references to selected ones of the common category associations.

14. A computer-readable medium having computer-readable instructions according to claim 13, such that when executing said instructions a computer will also perform the step of writing a table identifying files associated to each of the common categories to storage means. 30

15. A computer-readable medium having computer-readable instructions according to claim 13 or claim 14, such that when executing said instructions a computer will also perform the step of generating a user- selection environment inviting users to select common categories.

16. A computer-readable medium having computer-readable instructions according to any of claims 13 to 15, such that when executing said instructions a computer will also perform the step of generating a user- selection environment and inviting users to define free text.