WO1999008226A1

WO1999008226A1 - Retrospective conversion

Info

Publication number: WO1999008226A1
Application number: PCT/NL1998/000452
Authority: WO
Inventors: Johannes Van Gent; Rudie Ekkelenkamp
Original assignee: Nederlandse Organisatie Voor Toegepastnatuurwetenschappelijk Onderzoek Tno
Priority date: 1997-08-11
Filing date: 1998-08-07
Publication date: 1999-02-18
Also published as: EP1004090A1; NL1006757C2; AU8752598A

Abstract

The present invention relates to a method and device for linking written or typed information in a document to database information from a database, comprising of: converting the written or typed information into a form suitable for a computer; selecting keywords from the converted written or typed information; drawing up a key part-word list consisting in each case of a number of successive characters of the keywords; drawing up a database word list of the database words occurring in the database; drawing up a database part-word list consisting in each case of the above stated number of successive characters of words from the database word list, wherein each database part-word contains a reference to the database word of which it forms part; comparing the key part-words from the document with the database part-word list; selecting on the basis of the comparison the database words corresponding with the keywords; and linking a document to a record of the external database on the basis of the selection of database words corresponding with the keywords.

Description

RETROSPECTIVE CONVERSION

The present invention relates to a method and device for converting information arranged on paper to an electronically structured data file or database. This conversion is also referred to as retrospective conversion.

Paper data files have been built up in the past for different purposes such as for instance library catalogues, birth registers, criminal records and the like, wherein large quantities of information are rendered on paper. The information is herein often arranged in handwritten or typed form or in the form of stamps and the like.

The drawback of written or typed data files is that they are difficult and time-consuming to access. With the appearance of automatic data processing using computers, it has become much easier nowadays to gain access to large data files . It has moreover become much easier as a result to search through the large quantities of stored information.

The problem of automatic data processing, however, is that all information written or typed in the past must be converted to an electronic database. This can be realized for instance by having all the information on paper manually retyped and thus inputted into an electronic data file . The drawback hereof however is that this is very expensive and time-consuming.

There is also the option of automating the conversion of the written or typed information by converting the written or typed information by means of scanners into digital images which are stored in the computer. The stored digital images are subsequently converted using character recognition techniques (Optical Character Recognition OCR) into characters which can be read by the computer.

According to the known art, automation of this conversion, or retrospective conversion, is not possible since the image quality of the digital images required for a good character recognition is generally not achieved. And even if the character recognition is good enough to provide a text which could be properly read by the computer, a correct conversion into a database format is often not possible since the structure of the written or typed information is not consistent and cannot therefore be understood by the computer. This is understood to mean that for instance the position of particular information in a document is not fixed, so that automatic conversion of this information to the fields of the database is not possible.

The object of the present intention is therefore to provide a method and device with which written or typed information can be stored, structured in correct manner, in an electronic database.

The invention therefore relates to a method for linking written or typed information in a document to database information from a database, comprising of:

- converting the written or typed information into a form suitable for a computer, for instance in the form of an ASCII text;

- selecting keywords from the converted document ;

- drawing up a key part-word list consisting in each case of a number of successive characters of the keywords ;

- drawing up a database word list of the database words occurring in the database;

- drawing up a database part-word list consisting in each case of the above stated number of successive characters of words from the database word list, wherein each database part-word contains a reference to the database word of which it forms part; - comparing the key part-words from the document with the database part-word list;

- selecting on the basis of the comparison the database words corresponding with the keywords; and - linking a document to a record of the external database on the basis of the selection of database words corresponding with the keywords .

The present invention also comprises a device for linking written or typed information in a document to database information from a database, comprising:

- converting means for converting the written information into a form suitable for a computer;

- keyword selection means for selecting keywords from the converted document ; - keyword list means for drawing up a key part-word list consisting in each case of a number of successive characters of the keywords;

- database word list means for drawing up a database word list of the database words occurring in the database;

- database part-word means for drawing up a database part-word list consisting in each case of the above stated number of successive characters of words from the database word list, wherein each database part- word contains a reference to the database word of which it forms part;

- comparing means for comparing the key part-words from the document with the database part-word list; - database word selection means for selecting, on the basis of the comparison, the database words corresponding with the keywords; and

- linking means for linking a document to a record of the external database on the basis of the selection of database words corresponding with the keywords . A preferred embodiment of the present invention will be described hereinbelow with reference to figures, in which:

- figure 1 is a block diagram of a device for converting written or typed information into a structured electronic data file;

- figure 2 is a block diagram of the method for converting written or typed information into a structured text, without the text having been corrected for errors; - figure 3 is an example of a library card on which retrospective conversion must be performed;

- figure 4 is an example of the same library card after performing segmentation of the fore- and background; - figures 5a and 5b show another example of performing segmentation of the fore- and background;

- figure 6 is an example of the same library card after performing image processing steps a to j ;

- figure 7 is a block diagram showing the method for indexing an external database;

- figure 8 is a block diagram which shows schematically the method for selecting keywords;

- figure 9 is a block diagram showing the method of direct linking of documents to records of a comparison database;

- figure 10 is an example of the comparison of trigrams, i.e. comparison of combinations of three successive characters;

- figure 11 is a block diagram showing the method of indirect linking of documents to records of a comparison database.

The preferred embodiment of the present invention relates to the conversion of data on library cards to an electronic database . In view of the large quantity of library cards for converting, in the Royal Library in The Hague for instance this amounts to about five million library cards, manual conversion to a database is too costly and too time-consuming. Automatic conversion is on the other hand very difficult in view of the wide variety in structure, i.e. information such as title, author, location and the like is not positioned in an unambiguous position on the library card. The variety in handwriting, variety in typefaces (fonts) used and possible tarnishing of the card such as by coffee stains and the like can also hamper the automatic conversion. Figure 1 shows a preferred embodiment of a device with which the conversion can be effected. The device comprises inter alia a scanner 1, with which the library cards can be read in, a connection 2 to an external database 5, a storage medium 3 for storing data and a computer 4 for controlling the data conversion. Figure 2 shows a block diagram representing the method for converting written or typed information into a form readable for the computer. Linking to an external database has not yet taken place herein.

An example of a library card is shown in figure 3. A library card is characterized by the following three characteristics: the card has a title description in which information is included concerning the title, author and edition of the relevant book, which information relates only to the document to which it refers and not to the specific location where the document can be found, a signature in which information is included concerning the specific location in the library where a copy of the document can be found, and a logical structure which indicates that the title description of the library card is divided into segments designating the different types of bibliographical information. a. Reading in the documents

The documents or library cards are converted into digital images using a scanner. These digital images can be binary, i.e. black-and-white values only, or can contain colour and/or grey tones. In order to carry out effective image processing at a later stage, digital images with grey tones and or colour tones are preferably generated. This is however not essential for the method of the present invention. b. Foreground-background segmentation The quality of the digital image is subsequently enhanced by means of image enhancement techniques, such as background-foreground segmentation, wherein the text (foreground) on a card is separated from other information (background) , such as for instance coffee stains, colours and patterns in the background and the like. Figure 4 shows the result of the image enhancement by foreground-background segmentation. Background information in the form of edges of the stamp of the signature is herein removed from the digital image of the library card. Shown in figures 5a and 5b is another example wherein the background pattern of the image is removed from the original image by segmentation.

The digital image is then subjected to automatic extraction and marking of logical components in the library card. Designated as logical components are parts of the image deemed as a meaningful unit or entity by a normal user. Logical components in a scientific book are for instance notes, chapter titles, paragraph titles, footnotes and the like. The purpose hereof is to code each library card, hereafter also referred to as document, in accordance with a given definition of the document type (DTD) as according to SGML coding, which definition of the document type describes the logical structure of the document, in this case the library card itself. A distinction is made between micro and macro objects. A micro object can be seen as a single spot of black, connected pixels surrounded by "white space". A macro object is built up of various micro objects or macro objects, for instance the macro object word is built up of various characters, the macro object line is built up of various words, etc. A series of image processing steps is performed to extract the connected components and their characteristics, followed by marking steps in which components are grouped into macro objects and are marked as SGML elements , which will be described hereinbelow . c. Micro object extraction

After scanning and optionally segmenting the image, the resulting binary image will be read into the memory of the computer and an extraction of micro objects of the digital image will be performed. As already stated above, a micro object can be seen as a single spot of connected pixels surrounded by "open" space (white space) . An example of a micro object is a single letter "e". The letter "i" is built up of two objects, i.e. the dot and the rest of the letter. Each object will be described by a frame formed by the left and top coordinate of the object and the width and height of the object. The result of micro object extraction is a list of micro objects. d. Histogram analysis

Since the documents do not have a uniform layout, only a very general knowledge, also referred to as "domain-independent" knowledge, is present a priori concerning the layout of the library cards. Domain- independent knowledge is for instance the fact that it is known a priori that words and sentences in a particular language are defined from left to right . It is therefore important to extract the relevant information from the document itself. Histogram analysis can be used for this purpose.

All objects determined during the micro object extraction are used in making histograms for the coverage, which is defined as the percentage of black pixels within the frame of an object, the height, which is defined by the height of the frame of an object, and surface area, defined by the surface area of the frame of an object. Histograms are also made of the entire document (thus, all objects together) . e. Marking of micro objects

This process is based on algorithms each making use of the statistical data originating from the histogram analysis. A distinction is herein made between text objects and other objects, such as for instance photo, table or graphic objects.

The detection of these object is performed with a decision tree wherein use is made of various parameters. For each object is described which parameters are important and how these can be used to separate the different objects. The parameters used are for instance: left-hand position of the object, top position of the object, width of the object, height of the object, width/height ratio of the object, area of the object, object pixel, coverage (number of object pixels within the frame divided by the total number of pixels within the frame) . Photographs can for instance be detected because the surface area of the object is large, the coverage is high and the width/height ratio is about 0.05 to 5. Text can for instance be detected due to the small size and width. The heights are compared with the average height of all objects on the library card. Width is not a very reliable parameter because letters may still be joined together after having been scanned. f. Straightening documents

When the library cards are fed askew into the scanning equipment they will also be stored askew. Obliqueness of the library cards can make further processing more difficult. The digital image of the library cards is therefore straightened out. Obliqueness correction is based on rotating the frames of the object through an angle which is detected in the obliqueness detection phase. Obliqueness detection can be implemented in various ways . g. Macro object extraction

After marking of micro object, micro objects such as letters are grouped into macro objects, such as into paragraphs. In this process knowledge about the logical structure of the library card will be used together with the results of the histogram analyses. Documents are usually written in horizontal and vertical direction. By now projecting all frames on a horizontal and vertical axis, determined areas in the documents running horizontally or vertically can be found in which no micro objects are situated (so-called "white rivers") . If the areas are wide or high enough, a title or column can for instance be detected. h. Marking of macro objects

Once the macro objects have been determined they are marked, for instance with HTML markings. i. Optical character recognition

After the objects have been marked, they can be converted into a form readable by a computer, such as for instance into ASCII format, by making use of standard optical character recognition techniques (optical character recognition is the process of reading a collection of pixels and conversion thereof into letters) . In addition to an ASCII text, the optical character technique also provides knowledge concerning the typeface (font) , the character size and style, which information can be used again at a later stage. j . DTD-controlled analysis

In addition to the above mentioned domain- independent, prior knowledge, it is possible for domain- dependent, prior knowledge, i.e. foreknowledge depending on specific characteristics of the document for recognizing, to be present. If for instance the basic layout of a library card or the character size of determined parts of the library card is known, this information can be described in a DTD (document type definition) . (The objects for recognizing, the relations therebetween and also knowledge about the objects themselves) . All objects can if necessary be reclassified on the basis of the document type definition of the used document and the results of the optical character recognition. It is noted that the DTD analysis is the only step wherein prior information concerning specific characteristic of the document for recognizing is used. Figure 6 shows the result of the image processing steps a-j , together also known as "Layout Semantics Discovery (LSD)", on the library card of figure 3. As can be seen from the figure, a distinction has been made between the title description and signature.

Determined errors have occurred in the conversion of the pixels into letters readable by the computer (OCR) . The phrase "Universiteitsbibliotheek Amsterdam" in the signature has thus been read as "©terdam" and "Amsterdam" in the title description has been read as "Anisterdarn" . The title description is subsequently compared with an external database which includes a large number of correct title descriptions.

The method for comparing the results of steps of the LSD analysis to the external database depends on the possibility of performing standard database operations on the external database. If this is not the case, no direct comparison can therefore be made. The comparison can however take place in indirect manner. On the basis of a list of all words occurring in the external database, a library lexicon is made whereby the LSD results are compared via the library lexicon instead of the external database itself.

INDEXING OF THE COMPARISON DATABASE

Figure 7 shows in a block diagram the method for indexing the external database, or comparison database . The words of the comparison database occurring in the database records are first lemmatized 5. This means that the words present in the database records are reduced to their lemma. This takes place for instance by comparing each word of the comparison database to a word list in which relations between words and their lemmas are recorded. In table 1 below an example is given of a comparison database of two records. Record I :

@Vondel#Joost van den# ' s Gebroeders# Treurspel #5 bedrijven# Amsterdam# Dominique van der Sticel# Unger 322@

Record II:

@Vondel#Joost van den# Gijsbregt van Aemstel# Treur- spel# ???# Amsterdam# Bezige Bij# ???@

Table 1 Example of a comparison database of two records before lemmatizing

Lemmatizing of these two records provides the two records of table 2.

Record I :

@Vondel# Joost van den# ' s broeder# Treurspel #5 bedrijf# Amsterdam^ Dominique van der Stichel# Unger 322@

Record II:

Table 2 Example of a comparison database of two records after lemmatizing

The word "gebroeders" is lemmatized to "broeder", the word "bedrijven" to "bedrijf". This lemmatizing step is optional and can be omitted without departing from the present invention.

The records of the comparison database, whether or not lemmatized, are then indexed in the following manner: A. By file inversion 6, wherein an inverted file is created on the basis of the records of the comparison database, whether lemmatized or not. The inverted file is referred to as lemma-based-inverted-file (LBIF) 7 and contains an alphabetically ordered list of lemmas, wherein each lemma contains a reference to all database records in which the words from which they are derived occur.

Table 3 shows the result of inverting the two lemmatized records of the comparison database of table 2.

1-aemstel: 2

2 -amsterdam: 1,2

3-bedrijf: 1

4-bezige: 2 5-ij : 2

6-broeder: 1

7-den: 1,2

8-der: 1

9-dominique: 1 10-gijsbregt : 2

11-joost: 1,2

12-stichel: 1

13 -unger: 1

14 -treurspel : 1,2 15-van: 1,2

16-vondel: 1,2

Table 3 Lemma-based-inverted-file (LBIF) of the comparison database of table 2

The LBIF contains all lemmas of the comparison database alphabetically ordered, wherein each lemma contains a reference to each record of the database in which it occurred. B. By trigram indexing 8 an additional index

(called trigram index 9) is made on the basis of this inverted file. The trigram index contains an alphabetically ordered list of trigrams, wherein each trigram consists of three letters originating from the lemmas from the lemma-based- inverted- file and wherein references are included to all locations of the trigram in the lemma-based- inverted- file . Table 4 shows the trigram index on the basis of the LBIF of table 3. *ae : 1 *am : 2

*de : 7 , 8

edr : 3

el* : l , 12 , 14 , 16 *vo : 16

Table 4 Trigram index

C. By vector space modelling 10, wherein a vector space index (vector space model VSM index 11) is made on the basis of the records, whether or not lemmatized, of the comparison database.

Vector spaces are described in chapter 14 of the book entitled "Information Retrieval", by William B. Frakes and Ricardo Baeza-Yates, 1992. The records of the comparison database are provided with coordinates. The value of a coordinate is a measure for the relevance of a determined word (or lemma) for the relevant record in the comparison database.

Starting from the lemmas of table 3, it is possible for both records of table 2 to designate for instance with a "1" that a determined lemma is found in the record and with a "0" that another lemma is not found in the record. Since in this example there are 16 lemmas in a comparison database of two records, the VSM index 11 contains per record 16 coordinates to describe the characteristics thereof. Table 5 shows the VSM index 11 for the present example. The 16 coordinates together define a 16 dimensional vector space. Record I = (0,1,1,0,0,1,1,1,1,0,1,1,1,0,1,1) Record II = (0,1,0,1,1,0,1,0,0,1,1,0,0,1,1,1)

Table 5 VSM index

On the basis of a vector operation, which is based on the cosine relation between the vectors, the degree of similarity between the VSM index 11 (originating from the comparison database) and the title description of the library card can then be determined. Instead of giving the values "1" and "0" for the presence or absence of a lemma in a record, a weighting can also be performed which for instance takes into account the number of times a lemma occurs in a record. Weighting factors can also be determined on other grounds and vary for instance from 0 to 1 depending on the degree to which lemmas are identifying for the records . In the example it is assumed for the sake of simplicity that no weighting is performed on the coordinates, which does not however detract from the method according to the present invention.

In respect of vector spaces it is otherwise noted that for a correct operation of vector space modelling the (key-) words of the title description input must be spelled correctly. This aspect will be further described at a later stage.

DETERMINATION OF THE SIGNIFICANT KEYWORDS IN ALL TITLE DESCRIPTIONS Figure 8 is a block diagram in which the method for selecting the keywords from all title descriptions is shown. For efficiency reasons only a limited number of words are selected for comparing with the comparison database . All input records, or title descriptions, are inverted 12, i.e. all words from all title descriptions are placed in an alphabetically ordered list 13 with reference to all title descriptions in which these words occur. The words of the title descriptions are preferably lemmatized herein, although this is not essential. The frequency and distribution in the different title descriptions of each word in the alphabetically ordered list is then determined 14. Determining of the frequency and distribution can take place in very many different ways (see for instance § 14,5 of said book by Frakes and Baeza-Yates) . The result of the method is an alphabetically ordered list 15 of keywords occurring in all title descriptions wherein a numeric value is added to each word which is a measure for its significance.

Since the example is based on only one title description for recognizing, the alphabetically ordered keyword list 15 only contains words of this one title description. For the more general case with a large number of title descriptions, performing of statistical operations is more useful. As a result of the statistical operations, frequently occurring words, which are less suitable for functioning as keyword, will automatically acquire a lower degree of significance.

In addition to the use of statistical information, linguistic information can also be used in assigning the degree of significance. Thus the Dutch word "bij" is quite common as a preposition, but the noun "bij" (bee) is not. The length of the word can also be taken into consideration in assigning the degree of significance.

Table 6 shows such a list for the present example, wherein for each keyword a degree of significance is also given. In the present example the term "Anisterdarn" has the greatest infrequency and is therefore selected as the first, most significant keyword. The words of the input records, or title descriptions, are lemmatized, which results in the terms "bedrijven" and "Gebroeders" being lemmatized to respectively "bedrijf" and "broeder" . keyword significance value

- an 0,8

- Abr 0,8

- Anisterdarn 0,9

- bedrijf 0,1

- denl 0,8

- Domin 0,7

- broeder 0,1

- Jfoostl 0,8

- Stichel 0,7

- titelvign 0,7

- Treurspel 0,1

- Unger 0,8

- Vondel 0,8

- Wees 0,2

Table 6 Keywords

COMPARISON OF TITLE DESCRIPTION WITH EXTERNAL COMPARISON DATABASE

Figure 9 is a block diagram showing the method for comparing the title description with the external database and linking the title description to a record from the comparison database. The title descriptions in the input records are compared one by one with the information from the comparison database. From the title description the most infrequent n keywords are selected 16 on the basis of the keyword list with statistical information, wherein the number of keywords n can be varied and is determined in practice.

The keywords are then inputted into the trigram comparison module 17. In the present example the keyword "Anisterdarn" is chosen as first input into trigram comparison module 17. The keyword is divided into trigrams consisting of three letters (wherein the character "*" represents a random character from the alphabet) . The trigrams of the keyword words are subsequently compared to those of the above described trigram index 9. Figure 10 indicates how the comparison is effected for the present example. Four of the ten trigrams of "Anisterdarn" correspond with trigrams from trigram index 9 which have a reference to the second word of the LBIF 7. The second word in LBIF 7 is in this case "Amsterdam" . This number is greater than the number of corresponding trigrams in the other words of LBIF 7, so that in this case it is decided that the word "Anisterdarn" in the title description for recognizing corresponds with the word "Amsterdam" .

The decision can also be taken on other grounds. A higher significance of particular trigrams relative to other trigrams can for instance be taken into consideration. The trigram "tje" for instance occurs so frequently in Dutch that a lower significance can be assigned to this trigram than to another less usual trigram.

The trigram comparison 17 is repeated for all keywords. The result is a list of keywords (or lemmas) from the LBIF corresponding to the keywords, wherein these keywords from the LBIF in turn contain a reference to the records in the comparison database in which they are situated.

It is possible to take account of how often the same record is referred to. This additional information is stored for use at a later stage. Thus, the number of times reference is made to a particular word can for instance be included in a relevance list, wherein the records to which most reference is made have the highest relevance (and vice versa) . Table 7 shows the result of the trigram comparison.

Keyword from keyword Reference title description from LBIF to record - Vondel vondel 1,2

- Jfoostl joost 1,2

- an van 1,2

- denl den 1,2

- Gebroeders broeder 1 - Treurspel treurspel 1,2

- bedrijven bedrijf 1,2

- Anisterdarn Amsterdam 1,2

- Domin dominique 1

- Stichel stichel 1 - Abr

- Wees

- titelvign

- Unger

Table 7 Result of the trigram comparison

It has already been mentioned above that linking of words to the vector space index 11 using the vector space model is only useful if the words are spelled correctly. However, now that the keywords, whether spelled incorrectly or not, are linked using the trigram comparison module to correctly spelled keywords from LBIF 7, wherein these words are spelled correctly since LBIF 7 is built up of words from the comparison database, it has become possible to use the keywords from LBIF 7 corresponding with the keywords as input to the VSM comparison module.

The keywords (lemmas) from LBIF 7 corresponding with the keywords from the title description are "matched" in the VSM module 18 with the previously constructed VSM index 11 of the comparison database. For this purpose the title description is provided with coordinates in analogous manner as specified above in respect of the VSM index 11.

The degree of similarity between a record j of VSM index 11 and title description can be determined in various ways, for instance as follows:

similarity =

in which _χ = i^e coordinate of the title description; v = i^s coordinate of record j from the VSM index .

The calculation of the similarity is repeated for all records of VSM index 11. The record 19 from VSM index 11 with the greatest similarity is then selected as being the record which most probably corresponds with the title description of the library card. The library card is therefore linked to this record of the comparison database.

COMPARISON OF TITLE DESCRIPTION WITH INDEX ON THE LIBRARY LEXICON

When the external database is not directly accessible but only indirectly via its own interface, the title description cannot be compared in the above described manner with the external database since the LBIF 7 cannot be constructed. Figure 11 shows in a block diagram the method applied for indirect comparison with the comparison database.

In this case all words are extracted from the external database, whereafter they are placed in an alphabetically ordered list, wherein each word occurs only once in the list. This ordered list is called the library lexicon 20. On the basis of this library lexicon 20 a trigram index 9 is created in the above described manner using a trigram indexing process 8.

All input records, or title descriptions, are inverted, i.e. all words of the title description are placed in an alphabetical list with reference to all input records, or title descriptions, in which these words occur, whereby a file index results. On the basis of the frequency and distribution information the significance of each word and of each term in the file index is then determined.

The input records are subsequently fed one by one to a keyword selection module 16. On the basis of the previously obtained significance of the words, the keyword selection module selects in the manner as described above the n most relevant keywords of the input record. Via a trigram comparison 17 on the basis of the trigram index 9 these keywords are compared with said library lexicon 20. The keywords resulting from this comparison, which are spelled correctly owing to the trigram comparison, are subsequently used to retrieve 21 the corresponding database records 19 of the comparison database 5 via the interface 22 of the external comparison database 5.

Claims

1. Method for linking written or typed information in a document to database information from a database, comprising of:

- converting the written or typed information into a form suitable for a computer;

- selecting keywords from the converted written or typed information;

- drawing up a key part-word list consisting in each case of a number of successive characters of the keywords;

- drawing up a database part -word list consisting in each case of the above stated number of successive characters of words from the database word list, wherein each database part -word contains a reference to the database word of which it forms part;

- comparing the key part-words from the document with the database part-word list; - selecting on the basis of the comparison the database words corresponding with the keywords; and

- linking a document to a record of the external database on the basis of the selection of database words corresponding with the keywords.

2. Method as claimed in claim 1, wherein the selection of keywords from the converted document is performed by the steps of:

- drawing up a document collection word list of words from a collection of documents for linking; - determining the frequency and the distribution with which each word occurs in the document collection word list; - selecting a preset number of keywords from the document for linking, which keywords in the document collection word list have the highest identifying value on the basis of the frequency and distribution.

3. Method as claimed in claim 1, wherein each database word contains a reference to the records of the external database of which it forms part.

4. Method as claimed in claims 1-3, wherein linking of a document to a record of the external database comprises the steps of:

- providing records of the external database with database record coordinates, wherein each database record coordinate indicates the extent to which a database word from the database word list is present in the relevant record of the external database;

- providing the document for linking with document coordinates, wherein each document coordinate indicates the extent to which a database word corresponding with a keyword is present in the document; - determining the angle between the vectors defined by the document coordinates and the record coordinates;

- determining the combination of document and database record with the smallest angle between their corresponding vectors;

- linking document and database record of the determined combination.

5. Method as claimed in claim 4, wherein the values of the database record coordinates and document coordinates are determined from statistical information from respectively the database and the document.

6. Method as claimed in any of the foregoing claims, wherein document words and database words are shortened to their characteristic part.

7. Device for linking written or typed information in a document to database information from a database, comprising: - converting means for converting the written information into a form suitable for a computer;

- comparing means for comparing the key part- words from the document with the database part-word list;

- database word selection means for selecting, on the basis of the comparison, the database words corresponding with the keywords ; and

- linking means for linking a document to a record of the external database on the basis of the selection of database words corresponding with the keywords.

8. Device as claimed in claim 7, wherein selection of keywords from the converted document is performed by the steps of:

- document collecting means for drawing up a document collection word list of words from a collection of documents for linking;

- frequency means for determining the frequency and distribution with which each word occurs in the document collection word list; - keyword selection means for selecting a preset number of keywords from the document for linking, which keywords in the document collection word list have the highest identifying value on the basis of the frequency and the distribution.

9. Device as claimed in claim 7 or 8 , also comprising means with which the methods as claimed in claims 1-6 are performed.

10. Device as claimed in claims 7-9, wherein said means are implemented by software control of computing means.

11. Method as claimed in claim 1, wherein conversion of the written information into a form suitable for a computer takes place with at least one of the following steps of:

- converting the written or typed information into a digital image; - foreground-background segmentation;

- micro object extraction;

- histogram analysis;

- marking of micro objects;

- straightening the document; - macro object extraction;

- marking of macro objects;

- optical character recognition of objects; and/or

- reclassification of objects determined by document type definition.

12. Method for linking written or typed information to database information from a database as claimed in at least one of the foregoing claims, wherein the document is a library card.

13. Method for recognizing objects of written or typed information in a document wherein prior information not dependent on the document type as well as prior information dependent on the document type can be used in the recognition of objects.

14. Method as claimed in claim 13, wherein the prior information dependent on the document type can be specified separately for each document.