US7240003B2 - Database annotation and retrieval - Google Patents

Database annotation and retrieval Download PDF

Info

Publication number
US7240003B2
US7240003B2 US10/363,752 US36375203A US7240003B2 US 7240003 B2 US7240003 B2 US 7240003B2 US 36375203 A US36375203 A US 36375203A US 7240003 B2 US7240003 B2 US 7240003B2
Authority
US
United States
Prior art keywords
data
block
phoneme
nodes
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US10/363,752
Other versions
US20030177108A1 (en
Inventor
Jason Peter Andrew Charlesworth
Philip Neil Garner
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Canon Inc
Original Assignee
Canon Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Canon Inc filed Critical Canon Inc
Assigned to CANON KABUSHIKI KAISHA reassignment CANON KABUSHIKI KAISHA ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHARLESWORTH, JASON PETER ANDREW, GARNER, PHILIP NEIL
Publication of US20030177108A1 publication Critical patent/US20030177108A1/en
Assigned to CANON KABUSHIKI KAISHA reassignment CANON KABUSHIKI KAISHA CORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNEE ADDRESS. PREVIOUSLY RECORDED AT REEL 014113 FRAME 0321. Assignors: CHARLESWORTH, JASON PETER, GARNER, PHILIP NEIL
Application granted granted Critical
Publication of US7240003B2 publication Critical patent/US7240003B2/en
Adjusted expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/632Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data

Definitions

  • the present invention relates to the annotation of data files which are to be stored in a database for facilitating their subsequent retrieval.
  • the present invention is also concerned with a system for generating the annotation data which is added to the data file and to a system for searching the annotation data in the database to retrieve a desired data file in response to a user's input query.
  • the invention also relates to a system for translating an unordered list of nodes and links into an ordered and blocked list of nodes and links.
  • Databases of information are well known and suffer from the problem of how to locate and retrieve the desired information from the database quickly and efficiently.
  • Existing database search tools allow the user to search the database using typed keywords. Whilst this is quick and efficient, this type of searching is not suitable for various kinds of databases, such as video or audio databases.
  • the present invention aims to provide a data structure for the annotation of data files within a database which will allow a quick and efficient search to be carried out in response to a user's input query.
  • the present invention provides data defining a phoneme and word lattice for use as annotation data for annotating data files to be stored within a database.
  • the data defines a plurality of nodes and a plurality of links connecting the nodes, and further data associates a plurality of phonemes with a respective plurality of links and further data associates at least one word with at least one of said links, and further data defines a block arrangement for the nodes such that the links may only extend over a given maximum number of blocks. It is further preferred that the links may only extend into a following block.
  • the present invention provides an apparatus for searching a database which employs the annotation data discussed above for annotating data filed therein.
  • the apparatus is arranged to generate phoneme data in response to a user's query or input, and to search the database using the generated phoneme data. It is further preferred that word data is also generated from the user's input or query.
  • the present invention provides an apparatus for generating a phoneme and word lattice corresponding to received phoneme and word data, comprising means for defining a plurality of links and a plurality of nodes between which the links extend, means for associating the links with phonemes or words, and means for arranging the nodes in a sequence of time ordered blocks in which the links only extend up to a maximum given number of blocks later in the sequence.
  • the maximum extension allowed for a link is to extend into a following block.
  • the apparatus is arranged to add nodes or links incrementally as it forms the lattice, and to split an existing block of nodes into at least two blocks of nodes.
  • the present invention provides an apparatus for adding phonemes or words to a phoneme and word lattice of any of the types discussed above, and arranged to analyse which data defining the current phoneme and word lattice needs to be modified in dependence upon the extent to which the links are permitted to extend from one block to another.
  • this analysis is further dependent upon the location within the lattice of a point identifying the latest node in each block to which any link originating in the preceding block extends and a point identifying the earliest node in each block from which a link extends into the succeeding block.
  • the present invention provides a method of adding phonemes or words to a phoneme and word lattice of any of the types discussed above, comprising analysing which data defining the current phoneme and word lattice needs to be modified in dependence upon the extent to which the links are Preferably, this analysis is further dependent upon the location within the lattice of respective points identifying the latest node in each block to which any link originating in the preceding block extends.
  • a method and apparatus are provided for converting an unordered list of nodes and links into an ordered and blocked list of nodes and links.
  • the blocks are formed by filling and splitting: successive nodes are inserted into a block until it is full, then a new block is begun. If new nodes would overfill an already full block, that block is split into two or more blocks. Constraints on the links regarding which block they can lead to are used to speed up the block splitting process, and identify which nodes remain in the old block and which go into the new block.
  • FIG. 1 is a schematic view of a computer which is programmed to operate an embodiment of the present invention
  • FIG. 2 is a block diagram showing a phoneme and word annotator unit which is operable to generate phoneme and word annotation data for appendage to a data file;
  • FIG. 3 is a block diagram illustrating one way in which FIG. 3 is a block diagram illustrating one way in which the phoneme and word annotator can generate the annotation data from an input video data file;
  • FIG. 4 a is a schematic diagram of a phoneme lattice for an example audio string from the input video data file
  • FIG. 4 b is a schematic diagram of a word and phoneme lattice embodying one aspect of the present invention, for an example audio string from the input video data file;
  • FIG. 5 is a schematic block diagram of a user's terminal which allows the user to retrieve information from the database by a voice query;
  • FIG. 6 is a schematic diagram of a pair of word and phoneme lattices, for example audio strings from two speakers;
  • FIG. 7 is a schematic block diagram illustrating a user terminal which allows the annotation of a data file with annotation data generated from an audio signal input from a user;
  • FIG. 8 is a schematic diagram of phoneme and word lattice annotation data which is generated for an example utterance input by the user for annotating a data file;
  • FIG. 9 is a schematic block diagram illustrating a user terminal which allows the annotation of a data file with annotation data generated from a typed input from a user;
  • FIG. 10 is a schematic diagram of phoneme and word lattice annotation data which is generated for a typed input by the user for annotating a data file;
  • FIG. 11 is a block schematic diagram showing the form of a document annotation system
  • FIG. 12 is a block schematic diagram of an alternative document annotation system
  • FIG. 13 is a block schematic diagram of another document annotation system
  • FIG. 14 is a schematic block diagram illustrating the way in which a phoneme and word lattice can be generated from script data contained within a video data file;
  • FIG. 15 a is a schematic diagram of a word and phoneme lattice showing relative timings of the nodes of the lattice;
  • FIG. 15 b is a schematic diagram showing the nodes of a word and phoneme lattice divided into blocks.
  • FIG. 16 a is a schematic diagram illustrating the format of data corresponding to one node of a word and phoneme lattice
  • FIG. 16 b is a schematic diagram illustrating a data stream defining a word and phoneme lattice
  • FIG. 17 is a flow diagram illustrating a process of forming a word and phoneme lattice according to one embodiment of the present invention.
  • FIGS. 18 a to 18 h are schematic diagrams illustrating the build-up of a word and phoneme lattice
  • FIGS. 19 a to 19 h are schematic diagrams illustrating the build-up of a data stream defining a word and phoneme lattice
  • FIGS. 20 a to 20 c are schematic diagrams showing the updating of a word and phoneme lattice on insertion of a long link;
  • FIGS. 21 a to 21 d are schematic diagrams illustrating the updating of a word and phoneme lattice on insertion of additional nodes
  • FIG. 22 is a flow diagram illustrating a procedure of adjusting off-sets
  • FIGS. 23 a and 23 b are schematic diagrams illustrating the application of a block splitting procedure to a word and phoneme lattice.
  • FIG. 24 is a block diagram illustrating one way in which the phoneme and word annotator can generate the annotation data from an input video data file.
  • Embodiments of the present invention can be implemented using dedicated hardware circuits, but the embodiment to be described is implemented in computer software or code, which is run in conjunction with processing hardware such as a personal computer, work station, photocopier, facsimile machine, personal digital assistant (PDA) or the like.
  • processing hardware such as a personal computer, work station, photocopier, facsimile machine, personal digital assistant (PDA) or the like.
  • FIG. 1 shows a personal computer (PC) 1 which is programmed to operate an embodiment of the present invention.
  • a keyboard 3 , a pointing device 5 , a microphone 7 and a telephone line 9 are connected to the PC 1 via an interface 11 .
  • the keyboard 3 and pointing device 5 enable the system to be controlled by a user.
  • the microphone 7 converts acoustic speech signals from the user into equivalent electrical signals and supplies them to the PC 1 for processing.
  • An internal modem and speech receiving circuit (not shown) is connected to the telephone line 9 so that the PC 1 can communicate with, for example, a remote computer or with a remote user.
  • the programme instructions which make the PC 1 operate in accordance with the present invention may be supplied for use with an existing PC 1 on, for example, a storage device such as a magnetic disc 13 , or by downloading the software from the Internet (not shown) via the internal modem and telephone line 9 .
  • FIG. 2 is a block diagram illustrating the way in which annotation data 21 for an input data file 23 is generated in this embodiment by a phoneme and word annotating unit 25 .
  • the generated phoneme and word annotation data 21 is then combined with the data file 23 in the data combination unit 27 and the combined data file output thereby is input to the database 29 .
  • the annotation data 21 comprises a combined phoneme (or phoneme like) and word lattice which allows the user to retrieve information from the database by a voice query.
  • the data file 23 can be any kind of data file, such as, a video file, an audio file, a multimedia file etc.
  • N-Best word lists for an audio stream as annotation data by passing the audio data from a video data file through an automatic speech recognition unit.
  • word-based systems suffer from a number of problems. These include (i) that state of the art speech recognition systems still make basic mistakes in recognition; (ii) that state of the art automatic speech recognition systems use a dictionary of perhaps 20,000 to 100,000 words and cannot produce words outside that vocabulary; and (iii) that the production of N-Best lists grows exponentially with the number of hypothesis at each stage, therefore resulting in the annotation data becoming prohibitively large for long utterances.
  • the first of these problems may not be that significant if the same automatic speech recognition system is used to generate the annotation data and to subsequently retrieve the corresponding data file, since the same decoding error could occur.
  • this is particularly significant in video data applications, since users are likely to use names and places (which may not be in the speech recognition dictionary) as input query terms. In place of these names, the automatic speech recognition system will typically replace the out of vocabulary words with a phonetically similar word or words within the vocabulary, often corrupting nearby decodings. This can also result in the failure to retrieve the required data file upon subsequent request.
  • the phoneme and word lattice is an acyclic directed graph with a single entry point and a single exit point. It represents different parses of the audio stream within the data file. It is not simply a sequence of words with alternatives since each word does not have to be replaced by a single alternative, one word can be substituted for two or more words or phonemes, and the whole structure can form a substitution for one or more words or phonemes.
  • the density of data within the phoneme and word lattice essentially remains linear throughout the audio data, rather than growing exponentially as in the case of the N-Best technique discussed above.
  • phoneme data is more robust, because phonemes are dictionary independent and allow the system to cope with out of vocabulary words, such as names, places, foreign words etc.
  • the use of phoneme data is also capable of making the system future proof, since it allows data files which are placed into the database to be retrieved even when the words were not understood by the original automatic speech recognition system.
  • the video data file 31 comprises video data 31 - 1 , which defines the sequence of images forming the video sequence and audio data 31 - 2 , which defines the audio which is associated with the video sequence.
  • the audio data 31 - 2 is time synchronised with the video data 31 - 1 so that, in use, both the video and audio data are supplied to the user at the same time.
  • the audio data 31 - 2 is input to an automatic speech recognition unit 33 , which is operable to generate a phoneme lattice corresponding to the stream of audio data 31 - 2 .
  • an automatic speech recognition unit 33 is commonly available in the art and will not be described in further detail. The reader is referred to, for example, the book entitled ‘Fundamentals of Speech Recognition’ by Lawrence Rabiner and Biing-Hwang Juang and, in particular, to pages 42 to 50 thereof, for further information on this type of speech recognition system.
  • FIG. 4 a illustrates the form of the phoneme lattice data output by the speech recognition unit 33 , for the input audio corresponding to the phrase '. . . now is the winter of our . . . '.
  • the automatic speech recognition unit 33 identifies a number of different possible phoneme strings which correspond to this input audio utterance. For example, the speech recognition system considers that the first phoneme in the audio string is either an /m/ or an /n/. For clarity, only the alternatives for the first phoneme are shown. As is well known in the art of speech recognition, these different possibilities can have their own weighting which is generated by the speech recognition unit 33 and is indicative of the confidence of the speech recognition unit's output.
  • the phoneme /n/ may be given a weighting of 0.9 and the phoneme /m/ may be given a weighting of 0.1, indicating that the speech recognition system is fairly confident that the corresponding portion of audio represents the phoneme /n/, but that it still may be the phoneme /m/.
  • the phoneme lattice data 35 output by the automatic speech recognition unit 33 is input to a word decoder 37 which is operable to identify possible words within the phoneme lattice data 35 .
  • the words identified by the word decoder 37 are incorporated into the phoneme lattice data structure. For example, for the phoneme lattice shown in FIG. 4 a , the word decoder 37 identifies the words “NOW”, “IS”, “THE”, “WINTER”, “OF” and “OUR”. As shown in FIG. 3 , the phoneme lattice shown in FIG. 4 a , the word decoder 37 identifies the words “NOW”, “IS”, “THE”, “WINTER”, “OF” and “OUR”. As shown in FIG.
  • these identified words are added to the phoneme lattice data structure output by the speech recognition unit 33 , to generate a phoneme and word lattice data structure which forms the annotation data 31 - 3 .
  • This annotation data 31 - 3 is then combined with the video data file 31 to generate an augmented video data file 31 ′ which is then stored in the database 29 .
  • the annotation data 31 - 3 is also time synchronised and associated with the corresponding video data 31 - 1 and audio data 31 - 2 , so that a desired portion of the video and audio data can be retrieved by searching for and locating the corresponding portion of the annotation data 31 - 3 .
  • annotation data 31 - 3 stored in the database 29 has the following general form:
  • the time of start data in the header can identify the time and date of transmission of the data. For example, if the video file is a news broadcast, then the time of start may include the exact time of the broadcast and the date on which it was broadcast.
  • the flag identifying if the annotation data is word annotation data, phoneme annotation data or if it is mixed is provided since not all the data files within the database will include the combined phoneme and word lattice annotation data discussed above, and in this case, a different search strategy would be used to search this annotation data.
  • the annotation data is divided into blocks in order to allow the search to jump into the middle of the annotation data for a given audio data stream.
  • the header therefore includes a time index which associates the location of the blocks of annotation data within the memory to a given time offset between the time of start and the time corresponding to the beginning of the block.
  • the header also includes data defining the word set used (i.e. the dictionary), the phoneme set used and the language to which the vocabulary pertains.
  • the header may also include details of the automatic speech recognition system used to generate the annotation data and any appropriate settings thereof which were used during the generation of the annotation data.
  • the phoneme probability data defines the probability of insertions, deletions, misrecognitions and decodings for the system, such as an automatic speech recognition system, which generated the annotation data.
  • the blocks of annotation data then follow the header and identify, for each node in the block, the time offset of the node from the start of the block, the phoneme links which connect that node to other nodes by phonemes and word links which connect that node to other nodes by words.
  • Each phoneme link and word link identifies the phoneme or word which is associated with the link. They also identify the offset to the current node. For example, if node N 50 is linked to node N 55 by a phoneme link, then the offset to node N 50 is 5. As those skilled in the art will appreciate, using an offset indication like this allows the division of the continuous annotation data into separate blocks.
  • an automatic speech recognition unit outputs weightings indicative of the confidence of the speech recognition units output
  • these weightings or confidence scores would also be included within the data structure.
  • a confidence score would be provided for each node which is indicative of the confidence of arriving at the node and each of the phoneme and word links would include a transition score depending upon the weighting given to the corresponding phoneme or word. These weightings would then be used to control the search and retrieval of the data files by discarding those matches which have a low confidence score.
  • FIG. 5 is a block diagram illustrating the form of a user terminal 59 which can be used to retrieve the annotated data files from the database 29 .
  • This user terminal 59 may be, for example, a personal computer, hand held device or the like.
  • the user terminal 59 comprises the database 29 of annotated data files, an automatic speech recognition unit 51 , a search engine 53 , a control unit 55 and a display 57 .
  • the automatic speech recognition unit 51 is operable to process an input voice query from the user 39 received via the microphone 7 and the input line 61 and to generate therefrom corresponding phoneme and word data.
  • This data may also take the form of a phoneme and word lattice, but this is not essential.
  • This phoneme and word data is then input to the control unit 55 which is operable to initiate an appropriate search of the database 29 using the search engine 53 .
  • the results of the search, generated by the search engine 53 are then transmitted back to the control unit 55 which analyses the search results and generates and displays appropriate display data to the user via the display 57 . More details of the search techniques which can be performed are given in co-pending applications PCT/GB00/00718 and GB9925561.4, the contents of which are incorporated herein by reference.
  • this type of phonetic and word annotation of data files in a database provides a convenient and powerful way to allow a user to search the database by voice.
  • a single audio data stream was annotated and stored in the database for subsequent retrieval by the user.
  • the audio data within the data file will usually include audio data for different speakers.
  • separate phoneme and word lattice annotation data can be generated for the audio data of each speaker.
  • FIG. 6 illustrates the form of the annotation data in such an embodiment, where a first speaker utters the words “. . . this so” and the second speaker replies “yes”.
  • the annotation data for the different speakers' audio data are-time synchronised, relative to each other, so that the annotation data is still time synchronised to the video and audio data within the data file.
  • the header information in the data structure should preferably include a list of the different speakers within the annotation data and, for each speaker, data defining that speaker's language, accent, dialect and phonetic set, and each block should identify those speakers that are active in the block.
  • a speech recognition system was used to generate the annotation data for annotating a data file in the database.
  • other techniques can be used to generate this annotation data. For example, a human operator can listen to the audio data and generate a phonetic and word transcription to thereby manually generate the annotation data.
  • FIG. 7 illustrates the form of a user terminal 59 which allows a user to input voice annotation data via the microphone 7 for annotating a data file 91 which is to be stored in the database 29 .
  • the data file 91 comprises a two dimensional image generated by, for example, a camera.
  • the user terminal 59 allows the user 39 to annotate the 2D image with an appropriate annotation which can be used subsequently for retrieving the 2D image from the database 29 .
  • the input voice annotation signal is converted, by the automatic speech recognition unit 51 , into phoneme and word lattice annotation data which is passed to the control unit 55 .
  • the control unit 55 retrieves the appropriate 2D file from the database 29 and appends the phoneme and word annotation data to the data file 91 .
  • the augmented data file is then returned to the database 29 .
  • the control unit 55 is operable to display the 2D image on the display 57 so that the user can ensure that the annotation data is associated with the correct data file 91 .
  • the automatic speech recognition unit 51 generates the phoneme and word lattice annotation data by (i) generating a phoneme lattice for the input utterance; (ii) then identifying words within the phoneme lattice; and (iii) finally by combining the two.
  • FIG. 8 illustrates the form of the phoneme and word lattice annotation data generated for the input utterance “picture of the Taj-Mahal”. As shown, the automatic speech recognition unit identifies a number of different possible phoneme strings which correspond to this input utterance. As shown in FIG. 8 , the words which the automatic speech recognition unit 51 identifies within the phoneme lattice are incorporated into the phoneme lattice data structure.
  • the automatic speech recognition unit 51 identifies the words “picture”, “of”, “off”, “the”, “other”, “ta”, “tar”, “jam”, “ah”, “hal”, “ha” and “al”.
  • the control unit 55 is then operable to add this annotation data to the 2D image data file 91 which is then stored in a database 29 .
  • this embodiment can be used to annotate any kind of image such as x-rays of patients, 3D videos of, for example, NMR scans, ultrasound scans etc. It can also be used to annotate one-dimensional data, such as audio data or seismic data.
  • FIG. 9 illustrates the form of a user terminal 59 which allows a user to input typed annotation data via the keyboard 3 for annotating a data file 91 which is to be stored in a database 29 .
  • the typed input is converted, by the phonetic transcription unit 75 , into the phoneme and word lattice annotation data (using an internal phonetic dictionary (not shown)) which is passed to the control unit 55 .
  • the control unit 55 retrieves the appropriate 2D file from the database 29 and appends the phoneme and word annotation data to the data file 91 .
  • the augmented data file is then returned to the database 29 .
  • the control unit 55 is operable to display the 2D image on the display 57 so that the user can ensure that the annotation data is associated with the correct data file 91 .
  • FIG. 10 illustrates the form of the phoneme and word lattice annotation data generated for the input utterance “picture of the Taj-Mahal”.
  • the phoneme and word lattice is an acyclic directed graph with a single entry point and a single exit point. It represents different parses of the user's input.
  • the phonetic transcription unit 75 identifies a number of different possible phoneme strings which correspond to the typed input.
  • FIG. 11 is a block diagram illustrating a document annotation system.
  • a text document 101 is converted into an image data file by a document scanner 103 .
  • the image data file is then passed to an optical character recognition (OCR) unit 105 which converts the image data of the document 101 into electronic text.
  • OCR optical character recognition
  • This electronic text is then supplied to a phonetic transcription unit 107 which is operable to generate phoneme and word annotation data 109 which is then appended to the image data output by the scanner 103 to form a data file 111 .
  • the data file 111 is then stored in the database 29 for subsequent retrieval.
  • the annotation data 109 comprises the combined phoneme and word lattice described above which allows the user to subsequently retrieve the data file 111 from the database 29 by a voice query.
  • FIG. 12 illustrates a modification to the document annotation system shown in FIG. 15 .
  • the difference between the system shown in FIG. 16 and the system shown in FIG. 11 is that the output of the optical character recognition unit 105 is used to generate the data file 113 , rather than the image data output by the scanner 103 .
  • the rest of the system shown in FIG. 12 is the same as that shown in FIG. 11 and will not be described further.
  • FIG. 13 shows a further modification to the document annotation system shown in FIG. 11 .
  • the input document is received by a facsimile unit 115 rather than a scanner 103 .
  • the image data output by the facsimile unit is then processed in the same manner as the image data output by the scanner 103 shown in FIG. 11 , and will not be described again.
  • a phonetic transcription unit 107 was used for generating the annotation data for annotating the image or text data.
  • other techniques can be used. For example, a human operator can manually generate this annotation data from the image of the document itself.
  • the audio data from the data file 31 was passed through an automatic speech recognition unit in order the generate the phoneme annotation data.
  • a transcript of the audio data will be present in the data file.
  • FIG. 14 Such an embodiment is illustrated in FIG. 14 .
  • the data file 81 represents a digital video file having video data 81 - 1 , audio data 81 - 2 and script data 81 - 3 which defines the lines for the various actors in the video film.
  • the script data 81 - 3 is passed through a text to phoneme converter 83 , which generates phoneme lattice data 85 using a stored dictionary which translates words into possible sequences of phonemes.
  • This phoneme lattice data 85 is then combined with the script data 81 - 3 to generate the above described phoneme and word lattice annotation data 81 - 4 .
  • This annotation data is then added to the data file 81 to generate an augmented data file 81 ′ which is then added to the database 29 .
  • this embodiment facilitates the generation of separate phoneme and word lattice annotation data for the different speakers within the video data file, since the script data usually contains indications of who is talking.
  • the synchronisation of the phoneme and word lattice annotation data with the video and audio data can then be achieved by performing a forced time alignment of the script data with the audio data using an automatic speech recognition system (not shown).
  • a phoneme or phoneme-like and word lattice was used to annotate a data file.
  • the word “phoneme” in the description and claims is not limited to its linguistic meaning but includes the various sub-word units that are identified and used in standard speech recognition systems, such as phonemes, syllables, Katakana (Japanese alphabet) etc.
  • FIG. 15 a shows the timing of each node of the lattice relative to a common zero time, which in the present example is set such that the first node occurs at a time of 0.10 seconds. It is noted that FIG. 15 a is merely schematic and as such the time axis is not represented linearly.
  • the nodes are divided into three blocks as shown in FIG. 15 b .
  • demarcation of the nodes into blocks is implemented by block markers or flags 202 , 204 , 206 and 208 .
  • Block markers 204 , 206 and 208 are located immediately after the last node of a block, but are shown slightly spaced therefrom in FIG. 15 b for the sake of clarity of the illustration.
  • Block marker 204 marks the end of block 0 and the start of block 1
  • similarly block marker 206 marks the end of block 1 and the start of block 2 .
  • Block marker 208 is at the end of the lattice and hence only indicates the end of block 2 .
  • block 0 has five nodes
  • block 1 also has five nodes
  • block 2 has seven nodes.
  • each node is provided relative to the time of the start of its respective block. This does not affect the timings of the nodes in block 0 . However, for the further blocks the new off-set timings are different from each node's absolute relative timing as per FIG. 15 a .
  • the start time for each of the blocks other than block 0 is taken to be the time of the last node of the preceding block. For example, in FIG. 15 a it can be seen that the node between the phonemes /ih/ and /z/ occurs at 0.71 seconds, and is the last node of block 1 . From FIG. 15 a it can be seen that the next node, i.e.
  • the off-set time of the first node of block 1 is 0.23 seconds.
  • time off-sets determined relative to the start of each block rather than from the start of the whole lattice provides advantages with respect to dynamic range as follows.
  • the dynamic range of the data type used to record the timing values in the lattice structure will need to increase accordingly, which will consume large amounts of memory. This will become exacerbated when the lattice structure is being provided for a data file of unknown length, for example if a common lattice structure is desired to be usable for annotating either a one minute television commercial or a film or television programme lasting a number of hours.
  • the dynamic range of the corresponding data type for the lattice structure divided into blocks is significantly reduced by only needing to accommodate a maximum expected time off-set of a single block, and moreover this remains the same irrespective of the total duration of the data file.
  • the data type employed provides integer values where each value of the integer represents the off-set time measured in hundredths of a second.
  • FIG. 15 b also shows certain parts of the lattice structure identified as alpha ( ⁇ ) and beta ( ⁇ ). The significance of these items will be explained later.
  • FIG. 16 a shows by way of example the format of the data for the first node of the lattice.
  • the data for this particular node is in the form of seven data components 210 , 212 , 214 , 216 , 218 , 220 and 222 .
  • the first data component 210 specifies the time off-set of the node from the start of the block.
  • the value is 0.10 seconds, and is implemented by means of the integer data type described earlier above.
  • the second data component 212 represents the word link “NOW”, which is shown in FIGS. 15 a and 15 b extending from the first node.
  • the third data component specifies the nodal off-set of the preceding link, i.e. the word link “NOW”, by which is meant the number of nodes the preceding link extends by. Referring to FIGS. 15 a and 15 b , it can be seen that the node to which the word link “NOW” extends is the third node along from the node from which the link extends, hence the nodal off-set is 3, as represented illustratively in FIG. 16 a by the value 003.
  • the data type employed to implement the nodal off-set values is again one providing integer values.
  • the fourth data component 216 represents the phoneme /n/ which extends from the first node to the second node, entailing therefore a nodal off-set of one which leads directly to the value 001 for the fifth data component 218 as shown in FIG. 16 a .
  • the sixth data component 220 represents the phoneme link /m/
  • the seventh data component 222 shows the nodal off-set of that link which is equal to 1 and represented as 001.
  • the data components 212 , 216 and 220 represent the respective word or phoneme associated with their link can be implemented in any appropriate manner.
  • the data components 212 , 216 and 220 consist of an integer value which corresponds to a word index entry value (in the case of a word link) or a phoneme index entry value (in the case of a phoneme link).
  • the index entry value serves to identify an entry in a corresponding word or phoneme index containing a list of words or phonemes as appropriate.
  • the corresponding word or phoneme index is held in the header part of the annotation data 31 - 3 described earlier.
  • the header may itself only contain a further cross-reference identification to a separate database storing one or more word or phoneme indices.
  • the different links corresponding to a given node can be placed in the data format of FIG. 16 a in any desired relative order.
  • a preferred order is employed in which the word or phoneme link with the largest nodal off-set, i.e. the “longest” link, is placed first in the sequence.
  • the “longest” link is the word link “NOW” with a nodal off-set of three nodes, and it is therefore placed before the “shorter” phoneme links /n/ and /m/ which each only have a nodal off-set of 1.
  • the data for each node in the form shown in FIG. 16 a , is arranged in a time ordered sequence to form a data stream defining the whole lattice (except for the header).
  • the data stream for the lattice shown in FIG. 15 b is shown in FIG. 16 b .
  • the data stream additionally includes data components 225 to 241 serving as node flags to identify that the data components following them refer to the next respective node.
  • the data stream also includes further data components 244 , 246 , 248 and 250 implementing respectively the block markers 202 , 204 , 206 and 208 described earlier above with respect to FIG. 15 b.
  • the header also described with reference to FIG. 4 b , includes a time index which associates the location of the blocks of annotation data within the memory to a given time offset between the time of start and the time corresponding to the beginning of the block.
  • the time corresponding to the beginning of a given block is, in the present embodiment, the time of the last node of the block which precedes the given block.
  • the block arrangement shown in FIG. 15 b displays however further characteristics and advantages, which will now be described.
  • the blocks are determined according to an extent to which word or phoneme links are permitted to extend between blocks.
  • the block positions implement a criteria that no link may extend into any other block other than its directly neighbouring block.
  • the phoneme links /n/, /m/, /oh/, /w/ and /ih/ and word link “NOW” only extend within the same block in which their source nodes are located, which is allowed by the criteria
  • the phoneme link /z/ and the word link “IS” each extend from block 0 into block 1 , i.e. into the directly neighbouring block, which is also allowed by the criteria.
  • there are no links extending into block 2 because such links would have to extend beyond the directly neighbouring block of block 0 (i.e. block 1 ) and hence are not allowed by the criteria.
  • any existing link “passing over” a newly inserted node will require its nodal off-set to be increased by one, as the newly inserted node will need to be included in the count of the number of nodes over which the existing link extends. For example, if a new node were inserted at a time of 0.50 seconds into block 2 , then it can be seen from FIG.
  • An advantage of the blocks of the lattice data structure being arranged according to the present criteria is that it reduces the number of earlier existing nodes that need to be analysed. More particularly, it is only necessary to analyse those nodes in the same block in which the node is inserted which precede the inserted node plus the nodes in the neighbouring block directly preceding the block in which the new node has been inserted.
  • the advantage becomes increasingly beneficial as the length of the lattice increases and the number of blocks formed increases. Furthermore, the advantage not only applies to the insertion of new nodes into an otherwise complete lattice, it also applies to the ongoing procedure of constructing the lattice, which may occur when nodes are not necessarily inserted into a lattice in strict time order.
  • the particular choice of the criteria to only allow links to extend into a neighbouring block may be varied, for example the criteria may allow links extending only as far as four blocks away, it then being necessary to search back only a maximum of four blocks.
  • any appropriate number of blocks can be chosen as the limit in the criteria, it merely being necessary to commensurately adapt the number of blocks that are searched back through.
  • the lattice data structure of the present embodiment contains a further preferred refinement which is also related to the extension of the word or phoneme links into neighbouring blocks.
  • the lattice data structure further includes data specifying two characteristic points of each block.
  • the two characteristic points for each block are shown as alpha ( ⁇ ) and beta ( ⁇ ) in FIG. 15 b.
  • Beta for a given block is defined as the time of the latest node in the given block to which any link originating from the previous block extends.
  • beta is at the first node in the block (i.e. the node to which the phoneme link /z/ and the word link “IS” extend), since there are no links originating in block 0 that extend further than the first node of block 1 .
  • beta is at the third node, since the word link “WINTER” extends to that node from block 1 .
  • the first block of the lattice structure i.e. block zero, there are intrinsically no links extending into that block. Therefore, beta for this block is defined as occurring before the start of the lattice.
  • Alpha for a given block is defined as the time of the earliest node in the given block from which a link extends into the next block.
  • two links extend into block 1 , namely word link “IS” and the phoneme link /z/.
  • the node from which the word link “IS” extends is earlier in block 0 than the node from which the phoneme link /z/ extends, hence alpha is at the node from which the word link “IS” extends.
  • alpha for block 1 is located at the node where the word link “WINTER” originates from.
  • alpha is specially defined as being at the last node in the block.
  • beta represents the latest point in a block before which there are nodes which interact with the previous block
  • alpha represents the earliest point in a block after which there are nodes which interact with the next block.
  • each alpha and beta can be specified by identification of a particular node or by specification in terms of time. In the present embodiment identification is specified by nodes.
  • the data specifying alpha and beta within the lattice data structure can be stored in a number of different ways. For example, data components of the type shown in FIG. 16 b can be included containing flags or markers at the relevant locations within the data stream. However, in the present embodiment the points are specified by storing the identities of the respective nodes in a look-up table in the header part of the lattice data structure.
  • alpha and beta for each block firstly provides certain advantages with respect to analysing the nodal off-sets of previous nodes in a lattice when a new node is inserted.
  • a new node is inserted at a location after beta in a given block, it follows that it is only necessary to analyse the preceding nodes in the given block, and it is no longer necessary to analyse the nodes in the block preceding the given block.
  • This is because it is already known that by virtue of the new inserted node being after beta within the given block, there can by definition be no links that extend from the previous block beyond the newly inserted node, since the position of beta defines the greatest extent which any links extend from the previous block.
  • alpha and beta for each block secondly provides certain advantages with respect to employing alpha and beta in procedures to re-define blocks within an existing lattice so as to provide smaller or more evenly arranged blocks whilst maintaining compliance with the earlier mentioned criteria that no link may extend further than one block.
  • existing blocks are essentially split, according to the relative position of alpha and beta within an existing block.
  • provided alpha occurs after beta within a given block, the given block can be divided into two blocks by splitting it somewhere between beta and alpha.
  • the data specifying beta and alpha is advantageously employed to determine when existing blocks can be split into smaller blocks in the course of a preferred procedure for constructing the lattice data structure.
  • the longest link from a given node is positioned first in the sequence of data components for any given node as shown in FIG. 16 a . This is advantageous during the procedure of inserting a new node into the lattice data structure, wherein previous nodes must be analysed to determine whether any links originate from them that extend beyond the newly inserted node.
  • each set of data components consists of either:
  • FIG. 17 is a flow diagram which illustrates the process steps employed in the preferred method.
  • the application of the steps to the construction of the lattice of FIG. 15 b will be demonstrated, and will thus serve to show how the method operates when applied to input data in which the nodes are already fully time sequentially ordered.
  • the way in which the process steps are applied (be it to the construction of a new lattice or to the alteration of an existing lattice) when additional nodes are to be inserted into an existing time ordered sequence of nodes will be described by describing various different additions of data to the lattice data structure of FIG. 15 b.
  • FIGS. 18 a to 18 h show the build up of the lattice structure in the graphical representation form of FIG. 15 b . Additional reference will be made to FIGS. 19 a to 19 h which show the progress of the construction of the data stream defining the lattice, corresponding to the form of FIG. 16 b.
  • the automatic speech recognition unit 33 defines the start of the first block, i.e. block zero.
  • the block marker defining the starter of the first block is indicated by reference number 202 . This is implemented in the data stream by insertion of data component 244 (see FIG. 19 a ) consisting of a block flag.
  • step S 63 the automatic speech recognition unit 33 sets an incremental counter n equal to 1.
  • the automatic speech recognition unit 33 inserts the first set of data components into the data stream defining the lattice data structure. More particularly, the automatic speech recognition unit 33 collects the data corresponding to the first two nodes of the lattice and any direct phoneme links therebetween (in this case phoneme links /n/ and /m/). It then additionally collects any words that have been identified by the word decoder 37 as being associated with a link between these two nodes, although in the case of the first two nodes no such word has been identified. It then inserts the corresponding data components into the data stream. In particular, referring again to FIG.
  • data 260 defining the first node of the lattice structure, and being made up of a data component consisting of a node flag and a data component indicating the time of the node, is inserted.
  • data 262 comprising the data component consisting of the phoneme link /n/ and the nodal off-set value of 001 is inserted, followed by data 264 comprising a data component consisting of the phoneme /m/ and nodal off-set value 001.
  • data 266 comprising the data component consisting of a node flag and the data component consisting of the time of that second node is inserted.
  • step S 67 the automatic speech recognition unit 33 determines whether any new nodes have been included in the newly inserted set of data components. The answer in the present case is yes, so the process moves on to step S 69 where the automatic speech recognition unit determines whether any of the new nodes are now positioned at the end of the current data lattice structure. The answer in the present case is again yes. In fact, when the method shown in the flow chart of FIG.
  • step S 71 the automatic speech recognition unit 33 defines the end of the last block to be immediately after the newly inserted node which is at the end of the lattice.
  • the end of the sole block is in fact defined.
  • This newly defined current end of the block is shown as item 203 in FIG. 18 a , and is implemented in the data stream as data component 245 consisting of a block flag, as shown in FIG. 19 a.
  • the automatic speech recognition unit 33 determines all of the alpha and beta points. At the present stage there is only one block so only one alpha and one beta is determined. The procedure for determining alpha and beta in the first block was described earlier above. The resulting positions are shown in FIG. 18 a . With respect to the data stream, the alpha and beta positions are entered into the header data, as was described earlier above.
  • step S 79 the automatic speech recognition unit 33 determines whether any of the alpha and beta values are “invalid”, in the sense of being either indeterminate or positioned such as to contravene the earlier described criteria that no link may extend further than into a directly neighbouring block. At the present stage of building up the lattice this determination step obviously determines that there is no such invalidity, and hence the process moves to step S 81 .
  • step S 81 the automatic speech recognition unit determines whether the number of nodes in any blocks that have just had nodes inserted in them has reached or exceeded a predetermined critical number.
  • the predetermined critical number is set for the purpose of defining a minimum number of nodes that must be in a block before the block structure will be analysed or altered for the purposes of giving smaller block sizes or more even block spacings.
  • block division for blocks containing less than the critical number of nodes would tend to be counter productive.
  • the choice of the value of the critical number will depend on the particular characteristics of the lattice or data file being considered. As mentioned above, in the present embodiment the number is set at nine. Hence at the present stage of the process, where only two nodes have been inserted in total, the answer to the determination step S 81 is no.
  • step S 89 the automatic speech recognition unit determines that more sets of data components are to be added, and hence at step S 91 increments the value of n by one and the process steps beginning at steps S 65 are repeated for the next set of data components.
  • the next set of data components consists of data (item 270 in FIG. 19 b ) specifying the third node of the lattice and its time of 0.41 seconds and data (item 268 in FIG. 19 b ) specifying the phoneme link /oh/ plus its nodal off-set value of 001.
  • the phoneme link /oh/ and third node are shown having been inserted in FIG. 18 b also.
  • step S 71 the end 203 of the block, being defined as after the last node, is therefore now positioned as shown in FIG. 18 b , and is implemented in the data stream by the data component 245 , consisting of a block flag, now being positioned after the newly inserted data 268 and 270 .
  • the new position of alpha, now at the new end node, as determined at step S 75 is shown in FIG. 18 b .
  • step S 79 it is again determined that there is no invalid alpha or beta, and because the number of nodes is only three (i.e. less than nine) processing of this latest set of data components is now complete, so that the lattice and data stream are currently as shown in FIGS. 18 b and 19 b.
  • the fourth node and the two links which end at that node namely the phoneme link /w/ and the word link “NOW”, representing the next set of data components.
  • the process steps from S 65 onwards are followed as described for the previous sets of data components, resulting in the lattice structure shown in FIG. 18 c and the data stream shown in FIG. 19 c .
  • FIG. 19 c It can be seen in FIG. 19 c that the data 272 corresponding to the phoneme link /w/ and the data 274 corresponding to the latest node is just before the last block flag at the end of the data stream, whereas the data 276 corresponding to the word link “NOW” is placed in the data stream with the node from which that link extends, i.e.
  • the first node Moreover it is placed before the other links that extend from the first node, namely the phoneme links /n/ and /m/ because their nodal off-set values are 001 which are less than the value of 003 for the word link “NOW”.
  • the procedure continues as described above without variation for the insertion of the fifth, sixth, seventh and eighth nodes providing the lattice structure and data stream shown in FIGS. 18 d and 19 d respectively.
  • the set of data components inserted is the ninth node and the phoneme link /w/ ending at that node.
  • the lattice arrangement is as shown in FIG. 18 e - 1 , with the end 203 of the block located after the newly inserted ninth node, and alpha located at that ninth node.
  • step S 79 the automatic speech recognition unit determines that there is no invalidity of the alpha and beta values and so the process moves on to step S 81 .
  • the procedure to this point has followed the same as for the previous sets of data components. However, since this time the newly inserted node brings the total number of nodes in the sole block up to nine, when the automatic speech recognition unit carries out the determination step S 81 it determines for the first time that the number of nodes in the block is indeed greater than or equal to nine. Consequently, this time the procedure moves to step S 83 , where the automatic speech recognition unit determines whether alpha is greater than beta, i.e. whether alpha occurs later in the block than beta. This is determined in the present example to be the case (in fact this will always be the case for the first block of a lattice due to the way beta is defined for the first lattice).
  • the basic approach of the present method is that when the number of nodes in a block reaches nine or more, the block will be divided into two blocks, provided that alpha is greater than beta.
  • the reason for waiting until a certain number of nodes has been reached is due to the cost in overhead resource, as was explained-earlier above.
  • the reason for the criteria that alpha be greater than beta is to ensure that each of the two blocks formed by the division of an original block will obey the earlier described criteria that no link is permitted to extend into any block beyond a directly neighbouring block.
  • step S 85 the procedure moves to step S 85 in which the automatic speech recognition unit splits the sole block of FIG. 18 e - 1 into two blocks.
  • This is carried out by defining a new end of block 205 which is positioned according to any desired criteria specifying a position somewhere between beta and alpha.
  • the criteria is to insert the new end of block equally spaced (in terms of the number of nodes, rounded up where necessary) between beta and alpha.
  • the block is split by insertion of a new end of block 205 immediately after the fifth node, as shown in FIG. 18 e - 2 .
  • This is implemented in the data stream by the insertion of data component 298 , consisting of a block flag, as shown in FIG. 19 e .
  • the automatic speech recognition unit 33 recalculates the times of all of the nodes in the newly formed second block as off-sets from the start time of that block, which is the time of the fifth node of the whole lattice (0.71 seconds).
  • the resulting data stream shown in FIG. 19 e , now contains the newly inserted data component 298 , newly inserted data 300 relating to the phoneme link /w/ and newly inserted data 302 relating to the end node. Morever, the data components 304 , 306 , 308 and 310 have had their time values changed to new off-set values.
  • step S 87 updated values of alpha and beta are determined by the automatic speech recognition unit. Given there are now two blocks, there are two betas and two alphas to be determined. The new locations of these alphas and betas are shown in FIG. 18 e - 2 .
  • FIG. 17 thereafter continues as described above for the insertion of the tenth through to thirteenth node of the overall lattice without the critical number of 9 nodes yet being reached in block 1 .
  • This provides the lattice structure and data stream shown in FIGS. 18 f and 19 f respectively.
  • the next set of data components inserted consists of the fourteenth node and the phoneme link /oh/ ending at that node.
  • the situation after steps S 65 to S 79 are implemented for this set of data components is shown in FIG. 18 g - 1 . Insertion of this latest set of data components has brought the number of nodes in the second block up to nine, and alpha is after beta. Consequently, the automatic speech recognition unit 33 carries out step S 85 in which it inserts a new end of block 207 immediately after the fifth node of the block to be split, as shown in FIG. 18 g - 2 . This is implemented in the data stream by insertion of data component 330 consisting of a new block flag, as shown in FIG. 19 g .
  • the automatic speech recognition unit 33 also calculates the adjusted off-set times ( 334 , 336 , 338 , 340 in FIG. 19 g ) of the nodes in the newly formed third block. Thereafter, at step S 87 , the automatic speech recognition unit determines updated values of the alphas and betas, which provides a new alpha for what is now the second block and a new beta for what is now the third block, both of which are also shown in FIG. 18 g - 2 .
  • the automatic speech recognition unit 33 determines at step S 89 that no more sets of data components are available to be inserted, and hence the current lattice data structure is complete, and indeed corresponds to the lattice shown in FIGS. 15 b and 16 b.
  • the insertion of the earlier timed link is essentially part of the original on-going construction of the lattice, although the data component consisting of the additional link is processed separately at the end because it constitutes a word recognised by the automatic speech recognition unit 33 when passing the phoneme data through a second speech recognition vocabulary.
  • the second vocabulary consists of a specialised name place vocabulary that has been optionally selected by a user.
  • the data is inserted at step S 65 .
  • the data consists of the word link “ESTONIA” and extends from the fourth node of block 0 to the third node of block 2 , as shown in FIG. 20 a.
  • step S 67 the automatic speech recognition unit 33 recognises that no new node has been inserted, hence the process moves to step S 75 where it determines updated locations of alpha and beta.
  • the newly inserted link extends from block 0 right over block 1 to end in block 2 , it contravenes the earlier described criteria barring link extensions beyond directly neighbouring blocks, and moreover does not produce a valid alpha or beta for block 1 .
  • This is represented in FIG. 20 a by the indication that any alpha for block 1 would in fact need to appear in block 0 , and any beta for block 1 would need to appear in block 2 . Consequently, at the next step S 79 , it is determined that alpha and beta are indeed invalid.
  • step S 77 which consists of merging blocks.
  • Any suitable criteria can be used to choose which blocks should be merged together, for example the criteria can be based on providing the most evenly spaced blocks, or could consist of merging the offending block with its preceding block. However, in the present embodiment the choice is always to merge the offending block with its following block, i.e. in the present example block 1 will be merged with block 2 .
  • step S 79 the automatic speech recognition unit 33 determines that alpha and beta are now valid, so the procedure moves to step S 81 .
  • the procedure moves to step S 85 and block 1 is split using the same procedure as described earlier above.
  • the earlier employed criteria specifying where to locate the new block division namely half way in terms of nodes between beta and alpha, contains in the present example a refinement that when the block to be split has greater than nine nodes, splitting should, where possible, leave the earlier of the two resulting blocks with no more than eight nodes. This is to avoid inefficient repetitions of the block splitting process.
  • the new block marker is inserted immediately after the eighth node of the block being split, as shown in FIG. 20 c .
  • the alphas and betas are again determined, the new positions being shown in FIG. 20 c . It is noted that alpha and beta both occur at the same node of block 1 . In the present example it is determined at step S 89 that no more sets of data components are to be added, and hence the procedure is completed.
  • step S 77 of merging the two blocks is implemented by removal of the relevant data component 248 containing the original block flag dividing the original block 1 and 2 .
  • FIGS. 21 a to 21 d A further example demonstrating the processing of data according to the procedure laid out in the flow chart of FIG. 17 will now be described with reference to FIGS. 21 a to 21 d .
  • additional data components are added immediately after the seventeenth node has been added to the lattice of FIG. 15 c . Therefore at step S 89 of FIG. 17 further components are indeed to be added and the procedure returns again via increment step S 91 to insertion step S 65 .
  • the method steps employed to add the additional data components in the following example also constitute a stand alone method of updating or revising any suitable original lattice irrespective of how the original lattice itself was formed.
  • additional data is added via a keyboard and a phonetic transcription unit, of the same form as the keyboard 3 and phonetic transcription unit 75 shown in FIG. 9 .
  • the output of the phonetic transcription unit is connected to the automatic speech recognition unit 33 .
  • the user uses this arrangement to enter annotation data which he intends to correspond to a specific portion of the video data 31 - 1 .
  • Such data is sometimes referred to in the art as “metadata”.
  • the specific portion of the video data may show, for example, a number of profile shots of an actor, which the user wishes to be able to locate/retrieve at a later date as desired by using the annotation data.
  • step S 65 data component (i) as described above is inserted by the automatic speech recognition unit 33 into the lattice of FIG. 15 b , in the position shown in FIG. 21 a .
  • the automatic speech recognition unit 33 determines that new nodes have been inserted.
  • the automatic speech recognition unit determines that neither of the new nodes have been inserted at either the start or the end of the lattice. In other words, the new nodes have been inserted within an existing lattice, and hence it will probably be necessary to adjust the nodal off-sets of one or more existing nodes of the lattice.
  • step S 73 in which the automatic speech recognition unit 33 carries out such necessary adjustment of the nodal off-sets of existing nodes.
  • Any appropriate method of adjusting the off-sets can be employed at step S 73 . In the present embodiment a preferred method is employed, and this will be described in detail later below with reference to the flow chart of FIG. 22 .
  • FIG. 21 b shows the stage reached when data components (i), (ii) and (iii) have been inserted and the procedure has reached step S 81 .
  • the automatic speech recognition unit 33 splits the block and at step S 87 determines the new alphas and betas, resulting in the new block structure shown in FIG. 21 c .
  • the criteria employed for locating the new block end is one in which the size of the newly formed second block is made as large as possible except that placing the end of the block at alpha itself is not allowed.
  • step S 81 the lattice is of the form shown in FIG. 21 d , i.e. nine nodes are now located in present block 2 , and hence the outcome of step S 81 is that the procedure again moves to step S 83 .
  • the present example has thrown up a situation in present block 2 where beta occurs after alpha, in other words the longest link extended into block 2 extends beyond the start of the earliest link exiting that block 2 , as can be seen in FIG. 21 d .
  • step S 73 of adjusting the off-sets will now be described with reference to the flow chart of FIG. 22 , which shows the procedure followed for each newly inserted node.
  • the preferred method uses the fact that the location of alpha and beta in each block is known.
  • the automatic speech recognition unit 33 analyses nodes preceding the newly inserted node, to determine any links that extend from those nodes beyond the location of the newly inserted node. If any such node is found, then any affected link needs to have its nodal off-set value increased by one, to accommodate the fact that the newly inserted node is present under its span.
  • the newly inserted node is positioned after beta within a given block, then only those nodes before the newly inserted node and within the same given block need be analysed, since there are inherently no links extending from the previous block beyond beta.
  • the nodes before the newly inserted node in that given block need to be analysed plus the nodes in the preceding block, but only so far back as to include the node corresponding to alpha.
  • the nodes positioned before alpha of the preceding block do not need to be analysed because inherently there are no links extending from before alpha into the block in which the new node has been inserted.
  • the increment counter is used to control repeated application, as required, of the procedure to consecutive earlier nodes on a node-by-node basis.
  • the node which is positioned one place before the inserted node is identified. Referring to FIG. 21 a , in the case of the newly inserted node from which the word link “PROFILE” extends, the identified node one position before it is the node from which the word link “THE” extends.
  • all the links extending from the identified node are identified, being here the word link “THE” and the phoneme link /dh/.
  • the automatic speech recognition unit 33 determines the nodal off-set value of these links, which is 002 for the word link “THE” and 001 for the phoneme link /dh/, and hence at step S 107 increases each of these nodal off-set values by one, to the new values of 003 and 002 respectively.
  • step S 109 it is determined whether the newly inserted node was positioned before beta. In the present case it was actually positioned after, hence analysis of the nodes need only continue back to the first node of the present block, and hence at step S 111 it is determined whether the currently identified node, i.e. the node that has just had its nodal off-sets changed, is the first node of the present block.
  • the procedure ends. If, however, further nodes remained to be processed in the present block, then the procedure would continue to step S 113 where the value of i is incremented, and then the procedure would be repeated for the next previous node starting from step S 103 . Also, if in the above example the newly inserted node was in fact located before beta, then the procedure would be continued on until each node up to the node corresponding to alpha in the preceding block had been processed. In order to achieve this, when the inserted node is indeed before beta then the procedure moves to step S 115 where the automatic speech recognition unit determines whether the identified node is at the position of alpha of the preceding block. If it is then the procedure is complete. If it is not, then the procedure moves to step S 117 where the value of i is incremented, and then the procedure is repeated from step S 103 .
  • FIG. 23 a shows a sequence of nodes within a lattice, linked by phoneme links for example phoneme link 412 , the end part of a word link 414 and a further word link 416 .
  • the nodes are divided into blocks by block markers 402 , 404 and 406 , forming blocks n and (n+1) of the lattice.
  • FIG. 23 a shows the state of the lattice after the data represented by phoneme link 413 and the two nodes between which it extends has been inserted.
  • the number of nodes in block (n+1) has now reached nine, and since also alpha is later than beta, block rearrangement is now implemented.
  • the two blocks of FIG. 23 a are replaced by three blocks, namely block n, block (n+1) and block (n+2), as shown in FIG. 23 b . This is implemented by deleting the block divider 404 , and replacing it with two new block dividers 408 and 410 placed immediately after beta of block n and beta of block (n+1) respectively.
  • Alpha and beta for each block is thereafter re-calculated and the new positions are shown in FIG. 23 b .
  • This procedure for rearranging the blocks provides particularly evenly spaced blocks. This is particularly the case when a given block has the required number of nodes for splitting and its alpha is after beta, yet in the block preceding it beta is positioned after alpha. It is noted that this was indeed the case in FIG. 23 a . Because of this, in the preferred embodiment, block splitting is carried out by this procedure of forming a new block between the two beta positions when beta is positioned after alpha in the relevant preceding block, but block splitting follows the originally described procedure of dividing the present block between alpha and beta when beta is positioned before alpha in the preceding block.
  • the two new block dividers may be positioned at nodes relatively close, compared to the number of nodes in each block, to the position of beta of block n and beta of block (n+1) respectively, instead of at those two beta positions as such.
  • the timing of each node of the lattice is provided, prior to arrangement in blocks, relative to a common zero time set such that the first node occurs at a time of 0.10 seconds.
  • the start time for the first block is set equal to the common zero time.
  • the start time for each of the other blocks is the time of the last node of the preceding block.
  • the timing of each node may be provided in an absolute form, and the block marker demarcating the start of each block may be given a Universal Standard Time (UST) time stamp, corresponding to the absolute time of the first node of that block rounded down to the nearest whole second.
  • the UST time stamp may be implemented as a 4 byte integer representing a count of the number of seconds since 1 Jan. 1, 1970.
  • each block time is rounded to the nearest second, if block durations of less than 1 second were to be permitted, then two or more blocks could be allocated the same time stamp value. Therefore, when UST time stamps are employed, block durations less than 1 second are not permitted. This is implemented by specifying a predetermined block duration, e.g. 1 second, that a current block must exceed before splitting of the current block is performed. This requirement will operate in addition to the earlier described requirement that the current block must contain greater than a predetermined number of nodes before splitting of the current block is performed. Alternatively, shorter block durations may be accommodated, by employing a time stamp convention other than UST and then rounding down the block marker times more precisely than the minimum allowed duration of a block.
  • the phoneme and word lattice structure was determined and generated by the automatic speech recognition unit 33 , configured with the requisite functionality.
  • a standard automatic speech recognition unit can be used instead, in conjunction with a separate lattice creation unit comprising the functionality for determining and generating the above described phoneme and word lattice structure.
  • An embodiment employing a standard automatic speech recognition unit 40 which outputs a sequence of phonemes is shown in FIG. 24 .
  • the word decoder 37 identifies words from the phoneme data 35 .
  • the identified words are added to the phoneme data to form phoneme and word data 42 .
  • a lattice creation unit 44 determines and generates the above described phoneme and word lattice structure which forms the phoneme and word annotation data 31 - 3 .
  • a word to phoneme dictionary can be used to generate phonemes, and then the words and phonemes are combined and formed into the above described phoneme and word lattice structure by a lattice creation unit (not shown).
  • the phoneme and word data was associated with the links of the lattice.
  • the word and/or the phoneme data can be associated with the nodes instead.
  • the data associated with each node would preferably include a start and an end time for each word or phoneme associated therewith.
  • a technique has been described above for organising an unordered list of nodes and links into an ordered and blocked list.
  • the technique has been described for the particular application of the ordering of an unordered list of phonemes and words.
  • this technique can be applied to other types of data lattices.
  • the technique can be applied to a lattice which only has phonemes or a lattice which only has words.
  • it can be applied to a lattice generated from a hand writing recognition system which produces a lattice of possible characters as a result of a character recognition process.
  • the nodes and links would not be ordered in time, but would be spatially ordered so that the characters appear in the ordered lattice at a position which corresponds to the character's position on the page relative to the other characters.

Abstract

A data structure is provided for annotating data files within a database. The annotation data comprises a phoneme and word lattice which allows the quick and efficient searching of data files within the database, in response to a user's input query for desired information. The phoneme and word lattice comprises a plurality of time-ordered nodes, and a plurality of links extending between the nodes. Each link has a phoneme or word associated with it. The nodes are arranged in a sequence of time-ordered blocks such that further data can be conveniently added to the lattice.

Description

This application is a National Stage Filing Under 35 U.S.C. 371 of International Application No. PCT/GB01/04331, filed Sep. 28, 2001, and published in English as International Publication No. WO 02/27546 A2, on Apr. 4, 2002.
The present invention relates to the annotation of data files which are to be stored in a database for facilitating their subsequent retrieval. The present invention is also concerned with a system for generating the annotation data which is added to the data file and to a system for searching the annotation data in the database to retrieve a desired data file in response to a user's input query. The invention also relates to a system for translating an unordered list of nodes and links into an ordered and blocked list of nodes and links.
Databases of information are well known and suffer from the problem of how to locate and retrieve the desired information from the database quickly and efficiently. Existing database search tools allow the user to search the database using typed keywords. Whilst this is quick and efficient, this type of searching is not suitable for various kinds of databases, such as video or audio databases.
According to one aspect, the present invention aims to provide a data structure for the annotation of data files within a database which will allow a quick and efficient search to be carried out in response to a user's input query.
According to another aspect, the present invention According to another aspect, the present invention provides data defining a phoneme and word lattice for use as annotation data for annotating data files to be stored within a database. Preferably, the data defines a plurality of nodes and a plurality of links connecting the nodes, and further data associates a plurality of phonemes with a respective plurality of links and further data associates at least one word with at least one of said links, and further data defines a block arrangement for the nodes such that the links may only extend over a given maximum number of blocks. It is further preferred that the links may only extend into a following block.
According to another aspect, the present invention provides an apparatus for searching a database which employs the annotation data discussed above for annotating data filed therein. Preferably, the apparatus is arranged to generate phoneme data in response to a user's query or input, and to search the database using the generated phoneme data. It is further preferred that word data is also generated from the user's input or query.
According to another aspect, the present invention provides an apparatus for generating a phoneme and word lattice corresponding to received phoneme and word data, comprising means for defining a plurality of links and a plurality of nodes between which the links extend, means for associating the links with phonemes or words, and means for arranging the nodes in a sequence of time ordered blocks in which the links only extend up to a maximum given number of blocks later in the sequence. Preferably, the maximum extension allowed for a link is to extend into a following block. It is further preferred that the apparatus is arranged to add nodes or links incrementally as it forms the lattice, and to split an existing block of nodes into at least two blocks of nodes.
According to another aspect, the present invention provides an apparatus for adding phonemes or words to a phoneme and word lattice of any of the types discussed above, and arranged to analyse which data defining the current phoneme and word lattice needs to be modified in dependence upon the extent to which the links are permitted to extend from one block to another. Preferably, this analysis is further dependent upon the location within the lattice of a point identifying the latest node in each block to which any link originating in the preceding block extends and a point identifying the earliest node in each block from which a link extends into the succeeding block.
According to another aspect, the present invention provides a method of adding phonemes or words to a phoneme and word lattice of any of the types discussed above, comprising analysing which data defining the current phoneme and word lattice needs to be modified in dependence upon the extent to which the links are Preferably, this analysis is further dependent upon the location within the lattice of respective points identifying the latest node in each block to which any link originating in the preceding block extends.
According to another aspect, a method and apparatus are provided for converting an unordered list of nodes and links into an ordered and blocked list of nodes and links. The blocks are formed by filling and splitting: successive nodes are inserted into a block until it is full, then a new block is begun. If new nodes would overfill an already full block, that block is split into two or more blocks. Constraints on the links regarding which block they can lead to are used to speed up the block splitting process, and identify which nodes remain in the old block and which go into the new block.
Exemplary embodiments of the present invention will now be described with reference to the accompanying figures, in which:
FIG. 1 is a schematic view of a computer which is programmed to operate an embodiment of the present invention;
FIG. 2 is a block diagram showing a phoneme and word annotator unit which is operable to generate phoneme and word annotation data for appendage to a data file;
FIG. 3 is a block diagram illustrating one way in which FIG. 3 is a block diagram illustrating one way in which the phoneme and word annotator can generate the annotation data from an input video data file;
FIG. 4 a is a schematic diagram of a phoneme lattice for an example audio string from the input video data file;
FIG. 4 b is a schematic diagram of a word and phoneme lattice embodying one aspect of the present invention, for an example audio string from the input video data file;
FIG. 5 is a schematic block diagram of a user's terminal which allows the user to retrieve information from the database by a voice query;
FIG. 6 is a schematic diagram of a pair of word and phoneme lattices, for example audio strings from two speakers;
FIG. 7 is a schematic block diagram illustrating a user terminal which allows the annotation of a data file with annotation data generated from an audio signal input from a user;
FIG. 8 is a schematic diagram of phoneme and word lattice annotation data which is generated for an example utterance input by the user for annotating a data file;
FIG. 9 is a schematic block diagram illustrating a user terminal which allows the annotation of a data file with annotation data generated from a typed input from a user;
FIG. 10 is a schematic diagram of phoneme and word lattice annotation data which is generated for a typed input by the user for annotating a data file;
FIG. 11 is a block schematic diagram showing the form of a document annotation system;
FIG. 12 is a block schematic diagram of an alternative document annotation system;
FIG. 13 is a block schematic diagram of another document annotation system;
FIG. 14 is a schematic block diagram illustrating the way in which a phoneme and word lattice can be generated from script data contained within a video data file;
FIG. 15 a is a schematic diagram of a word and phoneme lattice showing relative timings of the nodes of the lattice;
FIG. 15 b is a schematic diagram showing the nodes of a word and phoneme lattice divided into blocks.
FIG. 16 a is a schematic diagram illustrating the format of data corresponding to one node of a word and phoneme lattice;
FIG. 16 b is a schematic diagram illustrating a data stream defining a word and phoneme lattice;
FIG. 17 is a flow diagram illustrating a process of forming a word and phoneme lattice according to one embodiment of the present invention;
FIGS. 18 a to 18 h are schematic diagrams illustrating the build-up of a word and phoneme lattice;
FIGS. 19 a to 19 h are schematic diagrams illustrating the build-up of a data stream defining a word and phoneme lattice;
FIGS. 20 a to 20 c are schematic diagrams showing the updating of a word and phoneme lattice on insertion of a long link;
FIGS. 21 a to 21 d are schematic diagrams illustrating the updating of a word and phoneme lattice on insertion of additional nodes;
FIG. 22 is a flow diagram illustrating a procedure of adjusting off-sets;
FIGS. 23 a and 23 b are schematic diagrams illustrating the application of a block splitting procedure to a word and phoneme lattice; and
FIG. 24 is a block diagram illustrating one way in which the phoneme and word annotator can generate the annotation data from an input video data file.
Embodiments of the present invention can be implemented using dedicated hardware circuits, but the embodiment to be described is implemented in computer software or code, which is run in conjunction with processing hardware such as a personal computer, work station, photocopier, facsimile machine, personal digital assistant (PDA) or the like.
FIG. 1 shows a personal computer (PC) 1 which is programmed to operate an embodiment of the present invention. A keyboard 3, a pointing device 5, a microphone 7 and a telephone line 9 are connected to the PC 1 via an interface 11. The keyboard 3 and pointing device 5 enable the system to be controlled by a user. The microphone 7 converts acoustic speech signals from the user into equivalent electrical signals and supplies them to the PC 1 for processing. An internal modem and speech receiving circuit (not shown) is connected to the telephone line 9 so that the PC 1 can communicate with, for example, a remote computer or with a remote user.
The programme instructions which make the PC 1 operate in accordance with the present invention may be supplied for use with an existing PC 1 on, for example, a storage device such as a magnetic disc 13, or by downloading the software from the Internet (not shown) via the internal modem and telephone line 9.
Data File Annotation
FIG. 2 is a block diagram illustrating the way in which annotation data 21 for an input data file 23 is generated in this embodiment by a phoneme and word annotating unit 25. As shown, the generated phoneme and word annotation data 21 is then combined with the data file 23 in the data combination unit 27 and the combined data file output thereby is input to the database 29. In this embodiment, the annotation data 21 comprises a combined phoneme (or phoneme like) and word lattice which allows the user to retrieve information from the database by a voice query. As those skilled in the art will appreciate, the data file 23 can be any kind of data file, such as, a video file, an audio file, a multimedia file etc.
A system has been proposed to generate N-Best word lists for an audio stream as annotation data by passing the audio data from a video data file through an automatic speech recognition unit. However, such word-based systems suffer from a number of problems. These include (i) that state of the art speech recognition systems still make basic mistakes in recognition; (ii) that state of the art automatic speech recognition systems use a dictionary of perhaps 20,000 to 100,000 words and cannot produce words outside that vocabulary; and (iii) that the production of N-Best lists grows exponentially with the number of hypothesis at each stage, therefore resulting in the annotation data becoming prohibitively large for long utterances.
The first of these problems may not be that significant if the same automatic speech recognition system is used to generate the annotation data and to subsequently retrieve the corresponding data file, since the same decoding error could occur. However, with advances in automatic speech recognition systems being made each year, it is likely that in the future the same type of error may not occur, resulting in the inability to be able to retrieve the corresponding data file at that later date. With regard to the second problem, this is particularly significant in video data applications, since users are likely to use names and places (which may not be in the speech recognition dictionary) as input query terms. In place of these names, the automatic speech recognition system will typically replace the out of vocabulary words with a phonetically similar word or words within the vocabulary, often corrupting nearby decodings. This can also result in the failure to retrieve the required data file upon subsequent request.
In contrast, with the proposed phoneme and word lattice annotation data, a quick and efficient search using the word data in the database 29 can be carried out and, if this fails to provide the required data file, then a further search using the more robust phoneme data can be performed. The phoneme and word lattice is an acyclic directed graph with a single entry point and a single exit point. It represents different parses of the audio stream within the data file. It is not simply a sequence of words with alternatives since each word does not have to be replaced by a single alternative, one word can be substituted for two or more words or phonemes, and the whole structure can form a substitution for one or more words or phonemes. Therefore, the density of data within the phoneme and word lattice essentially remains linear throughout the audio data, rather than growing exponentially as in the case of the N-Best technique discussed above. As those skilled in the art of speech recognition will realise, the use of phoneme data is more robust, because phonemes are dictionary independent and allow the system to cope with out of vocabulary words, such as names, places, foreign words etc. The use of phoneme data is also capable of making the system future proof, since it allows data files which are placed into the database to be retrieved even when the words were not understood by the original automatic speech recognition system.
The way in which this phoneme and word lattice annotation data can be generated for a video data file will now be described with reference to FIG. 3. As shown, the video data file 31 comprises video data 31-1, which defines the sequence of images forming the video sequence and audio data 31-2, which defines the audio which is associated with the video sequence. As is well known, the audio data 31-2 is time synchronised with the video data 31-1 so that, in use, both the video and audio data are supplied to the user at the same time.
As shown in FIG. 3, in this embodiment, the audio data 31-2 is input to an automatic speech recognition unit 33, which is operable to generate a phoneme lattice corresponding to the stream of audio data 31-2. Such an automatic speech recognition unit 33 is commonly available in the art and will not be described in further detail. The reader is referred to, for example, the book entitled ‘Fundamentals of Speech Recognition’ by Lawrence Rabiner and Biing-Hwang Juang and, in particular, to pages 42 to 50 thereof, for further information on this type of speech recognition system.
FIG. 4 a illustrates the form of the phoneme lattice data output by the speech recognition unit 33, for the input audio corresponding to the phrase '. . . now is the winter of our . . . '. The automatic speech recognition unit 33 identifies a number of different possible phoneme strings which correspond to this input audio utterance. For example, the speech recognition system considers that the first phoneme in the audio string is either an /m/ or an /n/. For clarity, only the alternatives for the first phoneme are shown. As is well known in the art of speech recognition, these different possibilities can have their own weighting which is generated by the speech recognition unit 33 and is indicative of the confidence of the speech recognition unit's output. For example, the phoneme /n/ may be given a weighting of 0.9 and the phoneme /m/ may be given a weighting of 0.1, indicating that the speech recognition system is fairly confident that the corresponding portion of audio represents the phoneme /n/, but that it still may be the phoneme /m/.
In this embodiment, however, this weighting of the phonemes is not performed.
As shown in FIG. 3, the phoneme lattice data 35 output by the automatic speech recognition unit 33 is input to a word decoder 37 which is operable to identify possible words within the phoneme lattice data 35. In this embodiment, the words identified by the word decoder 37 are incorporated into the phoneme lattice data structure. For example, for the phoneme lattice shown in FIG. 4 a, the word decoder 37 identifies the words “NOW”, “IS”, “THE”, “WINTER”, “OF” and “OUR”. As shown in FIG. 4 b, these identified words are added to the phoneme lattice data structure output by the speech recognition unit 33, to generate a phoneme and word lattice data structure which forms the annotation data 31-3. This annotation data 31-3 is then combined with the video data file 31 to generate an augmented video data file 31′ which is then stored in the database 29. As those skilled in the art will appreciate, in a similar way to the way in which the audio data 31-2 is time synchronised with the video data 31-1, the annotation data 31-3 is also time synchronised and associated with the corresponding video data 31-1 and audio data 31-2, so that a desired portion of the video and audio data can be retrieved by searching for and locating the corresponding portion of the annotation data 31-3.
In this embodiment, the annotation data 31-3 stored in the database 29 has the following general form:
    • Header
      • time of start
      • flag if word if phoneme if mixed
    • time index associating the location of blocks of annotation data within memory to a given time point.
      • word set used (i.e. the dictionary)
      • phoneme set used
      • phoneme probability data
      • the language to which the vocabulary pertains
    • Block(i) i=0,1,2, . . .
      • node Nj j=0,1,2, . . .
        • time offset of node from start of block
        • phoneme links (k) k=0,1,2 . . . offset to node Nj=Nk−Nj (Nk is node to which link K extends) or if Nk is in block (i+1) offset to node Nj=Nk+Nb−Nj (where Nb is the number of nodes in block (i)) phoneme associated with link (k)
        • word links (l) l=0,1,2 . . . offset to node Nj=Ni−Nj (Nj is node to which link l extends) or if Nk is in block (i+1) offset to node Nj=Nk+Nb−Nj (where Nb is the number of nodes in block (i)) word associated with link (l)
The time of start data in the header can identify the time and date of transmission of the data. For example, if the video file is a news broadcast, then the time of start may include the exact time of the broadcast and the date on which it was broadcast.
The flag identifying if the annotation data is word annotation data, phoneme annotation data or if it is mixed is provided since not all the data files within the database will include the combined phoneme and word lattice annotation data discussed above, and in this case, a different search strategy would be used to search this annotation data.
In this embodiment, the annotation data is divided into blocks in order to allow the search to jump into the middle of the annotation data for a given audio data stream. The header therefore includes a time index which associates the location of the blocks of annotation data within the memory to a given time offset between the time of start and the time corresponding to the beginning of the block.
The header also includes data defining the word set used (i.e. the dictionary), the phoneme set used and the language to which the vocabulary pertains. The header may also include details of the automatic speech recognition system used to generate the annotation data and any appropriate settings thereof which were used during the generation of the annotation data.
The phoneme probability data defines the probability of insertions, deletions, misrecognitions and decodings for the system, such as an automatic speech recognition system, which generated the annotation data.
The blocks of annotation data then follow the header and identify, for each node in the block, the time offset of the node from the start of the block, the phoneme links which connect that node to other nodes by phonemes and word links which connect that node to other nodes by words. Each phoneme link and word link identifies the phoneme or word which is associated with the link. They also identify the offset to the current node. For example, if node N50 is linked to node N55 by a phoneme link, then the offset to node N50 is 5. As those skilled in the art will appreciate, using an offset indication like this allows the division of the continuous annotation data into separate blocks.
In an embodiment where an automatic speech recognition unit outputs weightings indicative of the confidence of the speech recognition units output, these weightings or confidence scores would also be included within the data structure. In particular, a confidence score would be provided for each node which is indicative of the confidence of arriving at the node and each of the phoneme and word links would include a transition score depending upon the weighting given to the corresponding phoneme or word. These weightings would then be used to control the search and retrieval of the data files by discarding those matches which have a low confidence score.
Data File Retrieval
FIG. 5 is a block diagram illustrating the form of a user terminal 59 which can be used to retrieve the annotated data files from the database 29. This user terminal 59 may be, for example, a personal computer, hand held device or the like. As shown, in this embodiment, the user terminal 59 comprises the database 29 of annotated data files, an automatic speech recognition unit 51, a search engine 53, a control unit 55 and a display 57. In operation, the automatic speech recognition unit 51 is operable to process an input voice query from the user 39 received via the microphone 7 and the input line 61 and to generate therefrom corresponding phoneme and word data. This data may also take the form of a phoneme and word lattice, but this is not essential. This phoneme and word data is then input to the control unit 55 which is operable to initiate an appropriate search of the database 29 using the search engine 53. The results of the search, generated by the search engine 53, are then transmitted back to the control unit 55 which analyses the search results and generates and displays appropriate display data to the user via the display 57. More details of the search techniques which can be performed are given in co-pending applications PCT/GB00/00718 and GB9925561.4, the contents of which are incorporated herein by reference.
ALTERNATIVE EMBODIMENTS
As those skilled in the art will appreciate, this type of phonetic and word annotation of data files in a database provides a convenient and powerful way to allow a user to search the database by voice. In the illustrated embodiment, a single audio data stream was annotated and stored in the database for subsequent retrieval by the user. As those skilled in the art will appreciate, when the input data file corresponds to a video data file, the audio data within the data file will usually include audio data for different speakers. Instead of generating a single stream of annotation data for the audio data, separate phoneme and word lattice annotation data can be generated for the audio data of each speaker. This may be achieved by identifying, from the pitch or from another distinguishing feature of the speech signals, the audio data which corresponds to each of the speakers and then by annotating the different speaker's audio separately. This may also be achieved if the audio data was recorded in stereo or if an array of microphones were used in generating the audio data, since it is then possible to process the audio data to extract the data for each speaker.
FIG. 6 illustrates the form of the annotation data in such an embodiment, where a first speaker utters the words “. . . this so” and the second speaker replies “yes”. As illustrated, the annotation data for the different speakers' audio data are-time synchronised, relative to each other, so that the annotation data is still time synchronised to the video and audio data within the data file. In such an embodiment, the header information in the data structure should preferably include a list of the different speakers within the annotation data and, for each speaker, data defining that speaker's language, accent, dialect and phonetic set, and each block should identify those speakers that are active in the block.
In the above embodiments, a speech recognition system was used to generate the annotation data for annotating a data file in the database. As those skilled in the art will appreciate, other techniques can be used to generate this annotation data. For example, a human operator can listen to the audio data and generate a phonetic and word transcription to thereby manually generate the annotation data.
In the above embodiments, the annotation data was generated from audio stored in the data file itself. As those skilled in the art will appreciate, other techniques can be used to input the annotation data. FIG. 7 illustrates the form of a user terminal 59 which allows a user to input voice annotation data via the microphone 7 for annotating a data file 91 which is to be stored in the database 29. In this embodiment, the data file 91 comprises a two dimensional image generated by, for example, a camera. The user terminal 59 allows the user 39 to annotate the 2D image with an appropriate annotation which can be used subsequently for retrieving the 2D image from the database 29. In this embodiment, the input voice annotation signal is converted, by the automatic speech recognition unit 51, into phoneme and word lattice annotation data which is passed to the control unit 55. In response to the user's input, the control unit 55 retrieves the appropriate 2D file from the database 29 and appends the phoneme and word annotation data to the data file 91. The augmented data file is then returned to the database 29. During this annotating step, the control unit 55 is operable to display the 2D image on the display 57 so that the user can ensure that the annotation data is associated with the correct data file 91.
The automatic speech recognition unit 51 generates the phoneme and word lattice annotation data by (i) generating a phoneme lattice for the input utterance; (ii) then identifying words within the phoneme lattice; and (iii) finally by combining the two. FIG. 8 illustrates the form of the phoneme and word lattice annotation data generated for the input utterance “picture of the Taj-Mahal”. As shown, the automatic speech recognition unit identifies a number of different possible phoneme strings which correspond to this input utterance. As shown in FIG. 8, the words which the automatic speech recognition unit 51 identifies within the phoneme lattice are incorporated into the phoneme lattice data structure. As shown, for the example phrase, the automatic speech recognition unit 51 identifies the words “picture”, “of”, “off”, “the”, “other”, “ta”, “tar”, “jam”, “ah”, “hal”, “ha” and “al”. The control unit 55 is then operable to add this annotation data to the 2D image data file 91 which is then stored in a database 29.
As those skilled in the art will appreciate, this embodiment can be used to annotate any kind of image such as x-rays of patients, 3D videos of, for example, NMR scans, ultrasound scans etc. It can also be used to annotate one-dimensional data, such as audio data or seismic data.
In the above embodiment, a data file was annotated from a voiced annotation. As those skilled in the art will appreciate, other techniques can be used to input the annotation. For example, FIG. 9 illustrates the form of a user terminal 59 which allows a user to input typed annotation data via the keyboard 3 for annotating a data file 91 which is to be stored in a database 29. In this embodiment, the typed input is converted, by the phonetic transcription unit 75, into the phoneme and word lattice annotation data (using an internal phonetic dictionary (not shown)) which is passed to the control unit 55. In response to the user's input, the control unit 55 retrieves the appropriate 2D file from the database 29 and appends the phoneme and word annotation data to the data file 91. The augmented data file is then returned to the database 29. During this annotating step, the control unit 55 is operable to display the 2D image on the display 57 so that the user can ensure that the annotation data is associated with the correct data file 91.
FIG. 10 illustrates the form of the phoneme and word lattice annotation data generated for the input utterance “picture of the Taj-Mahal”. As shown in FIG. 2, the phoneme and word lattice is an acyclic directed graph with a single entry point and a single exit point. It represents different parses of the user's input. As shown, the phonetic transcription unit 75 identifies a number of different possible phoneme strings which correspond to the typed input.
FIG. 11 is a block diagram illustrating a document annotation system. In particular, as shown in FIG. 11, a text document 101 is converted into an image data file by a document scanner 103. The image data file is then passed to an optical character recognition (OCR) unit 105 which converts the image data of the document 101 into electronic text. This electronic text is then supplied to a phonetic transcription unit 107 which is operable to generate phoneme and word annotation data 109 which is then appended to the image data output by the scanner 103 to form a data file 111. As shown, the data file 111 is then stored in the database 29 for subsequent retrieval. In this embodiment, the annotation data 109 comprises the combined phoneme and word lattice described above which allows the user to subsequently retrieve the data file 111 from the database 29 by a voice query.
FIG. 12 illustrates a modification to the document annotation system shown in FIG. 15. The difference between the system shown in FIG. 16 and the system shown in FIG. 11 is that the output of the optical character recognition unit 105 is used to generate the data file 113, rather than the image data output by the scanner 103. The rest of the system shown in FIG. 12 is the same as that shown in FIG. 11 and will not be described further.
FIG. 13 shows a further modification to the document annotation system shown in FIG. 11. In the embodiment shown in FIG. 13, the input document is received by a facsimile unit 115 rather than a scanner 103. The image data output by the facsimile unit is then processed in the same manner as the image data output by the scanner 103 shown in FIG. 11, and will not be described again.
In the above embodiment, a phonetic transcription unit 107 was used for generating the annotation data for annotating the image or text data. As those skilled in the art will appreciate, other techniques can be used. For example, a human operator can manually generate this annotation data from the image of the document itself.
In the first embodiment, the audio data from the data file 31 was passed through an automatic speech recognition unit in order the generate the phoneme annotation data. In some situations, a transcript of the audio data will be present in the data file. Such an embodiment is illustrated in FIG. 14. In this embodiment, the data file 81 represents a digital video file having video data 81-1, audio data 81-2 and script data 81-3 which defines the lines for the various actors in the video film. As shown, the script data 81-3 is passed through a text to phoneme converter 83, which generates phoneme lattice data 85 using a stored dictionary which translates words into possible sequences of phonemes. This phoneme lattice data 85 is then combined with the script data 81-3 to generate the above described phoneme and word lattice annotation data 81-4. This annotation data is then added to the data file 81 to generate an augmented data file 81′ which is then added to the database 29. As those skilled in the art will appreciate, this embodiment facilitates the generation of separate phoneme and word lattice annotation data for the different speakers within the video data file, since the script data usually contains indications of who is talking. The synchronisation of the phoneme and word lattice annotation data with the video and audio data can then be achieved by performing a forced time alignment of the script data with the audio data using an automatic speech recognition system (not shown).
In the above embodiments, a phoneme (or phoneme-like) and word lattice was used to annotate a data file. As those skilled in the art of speech recognition and speech processing will realise, the word “phoneme” in the description and claims is not limited to its linguistic meaning but includes the various sub-word units that are identified and used in standard speech recognition systems, such as phonemes, syllables, Katakana (Japanese alphabet) etc.
Lattice Generation
In the above description, generation of the phoneme and word lattice data structure shown in FIG. 4 b was described with reference to FIG. 3. A preferred form of that data structure, including a preferred division of the nodes into blocks, will now be described with reference to FIGS. 15 to 17. Thereafter, one way of generating the preferred data structure will be described with reference to FIGS. 18 to 22.
FIG. 15 a shows the timing of each node of the lattice relative to a common zero time, which in the present example is set such that the first node occurs at a time of 0.10 seconds. It is noted that FIG. 15 a is merely schematic and as such the time axis is not represented linearly.
In the present embodiment, the nodes are divided into three blocks as shown in FIG. 15 b. In the present embodiment, demarcation of the nodes into blocks is implemented by block markers or flags 202, 204, 206 and 208. Block markers 204, 206 and 208 are located immediately after the last node of a block, but are shown slightly spaced therefrom in FIG. 15 b for the sake of clarity of the illustration. Block marker 204 marks the end of block 0 and the start of block 1, similarly block marker 206 marks the end of block 1 and the start of block 2. Block marker 208 is at the end of the lattice and hence only indicates the end of block 2. Block marker 202 is implemented at time t=0.00 seconds in order to provide the demarcation of the start of block 0. In the present embodiment, block 0 has five nodes, block 1 also has five nodes and block 2 has seven nodes.
The time of each node is provided relative to the time of the start of its respective block. This does not affect the timings of the nodes in block 0. However, for the further blocks the new off-set timings are different from each node's absolute relative timing as per FIG. 15 a. In the present embodiment the start time for each of the blocks other than block 0 is taken to be the time of the last node of the preceding block. For example, in FIG. 15 a it can be seen that the node between the phonemes /ih/ and /z/ occurs at 0.71 seconds, and is the last node of block 1. From FIG. 15 a it can be seen that the next node, i.e. that between the phoneme /z/ and the phoneme /dh/ occurs at a time of 0.94 seconds, which is 0.23 seconds after the time of 0.71 seconds. Consequently, as can be seen in FIG. 15 b, the off-set time of the first node of block 1 is 0.23 seconds.
The use of time off-sets determined relative to the start of each block rather than from the start of the whole lattice provides advantages with respect to dynamic range as follows. As the total time of a lattice increases, the dynamic range of the data type used to record the timing values in the lattice structure will need to increase accordingly, which will consume large amounts of memory. This will become exacerbated when the lattice structure is being provided for a data file of unknown length, for example if a common lattice structure is desired to be usable for annotating either a one minute television commercial or a film or television programme lasting a number of hours. In contrast, the dynamic range of the corresponding data type for the lattice structure divided into blocks is significantly reduced by only needing to accommodate a maximum expected time off-set of a single block, and moreover this remains the same irrespective of the total duration of the data file. In the present embodiment the data type employed provides integer values where each value of the integer represents the off-set time measured in hundredths of a second.
FIG. 15 b also shows certain parts of the lattice structure identified as alpha (α) and beta (β). The significance of these items will be explained later.
The format in which the data is held for each respective node in the preferred form of the phoneme and word lattice data structure will now be explained with reference to FIG. 16 a, which shows by way of example the format of the data for the first node of the lattice. The data for this particular node is in the form of seven data components 210, 212, 214, 216, 218, 220 and 222.
The first data component 210 specifies the time off-set of the node from the start of the block. In the present example, the value is 0.10 seconds, and is implemented by means of the integer data type described earlier above.
The second data component 212 represents the word link “NOW”, which is shown in FIGS. 15 a and 15 b extending from the first node. The third data component specifies the nodal off-set of the preceding link, i.e. the word link “NOW”, by which is meant the number of nodes the preceding link extends by. Referring to FIGS. 15 a and 15 b, it can be seen that the node to which the word link “NOW” extends is the third node along from the node from which the link extends, hence the nodal off-set is 3, as represented illustratively in FIG. 16 a by the value 003. In the present embodiment the data type employed to implement the nodal off-set values is again one providing integer values.
The fourth data component 216 represents the phoneme /n/ which extends from the first node to the second node, entailing therefore a nodal off-set of one which leads directly to the value 001 for the fifth data component 218 as shown in FIG. 16 a. Similarly the sixth data component 220 represents the phoneme link /m/, and the seventh data component 222 shows the nodal off-set of that link which is equal to 1 and represented as 001.
The manner in which the data components 212, 216 and 220 represent the respective word or phoneme associated with their link can be implemented in any appropriate manner. In the present embodiment the data components 212, 216 and 220 consist of an integer value which corresponds to a word index entry value (in the case of a word link) or a phoneme index entry value (in the case of a phoneme link). The index entry value serves to identify an entry in a corresponding word or phoneme index containing a list of words or phonemes as appropriate. In the present embodiment the corresponding word or phoneme index is held in the header part of the annotation data 31-3 described earlier. In other embodiments the header may itself only contain a further cross-reference identification to a separate database storing one or more word or phoneme indices.
Generally, the different links corresponding to a given node can be placed in the data format of FIG. 16 a in any desired relative order. In the present embodiment, however, a preferred order is employed in which the word or phoneme link with the largest nodal off-set, i.e. the “longest” link, is placed first in the sequence. Thus, in the present case, the “longest” link is the word link “NOW” with a nodal off-set of three nodes, and it is therefore placed before the “shorter” phoneme links /n/ and /m/ which each only have a nodal off-set of 1. Advantages of this preferred arrangement will be explained later below.
The data for each node, in the form shown in FIG. 16 a, is arranged in a time ordered sequence to form a data stream defining the whole lattice (except for the header). The data stream for the lattice shown in FIG. 15 b is shown in FIG. 16 b. As shown, the data stream additionally includes data components 225 to 241 serving as node flags to identify that the data components following them refer to the next respective node. The data stream also includes further data components 244, 246, 248 and 250 implementing respectively the block markers 202, 204, 206 and 208 described earlier above with respect to FIG. 15 b.
Earlier, with reference to FIG. 4 b, a first advantage of the block arrangement of the present lattice data structure was described, namely that it allows the search to jump into the middle of the annotation data for a given audio data stream. For this reason the header, also described with reference to FIG. 4 b, includes a time index which associates the location of the blocks of annotation data within the memory to a given time offset between the time of start and the time corresponding to the beginning of the block. As is described above with respect to FIG. 15 b, the time corresponding to the beginning of a given block is, in the present embodiment, the time of the last node of the block which precedes the given block.
The block arrangement shown in FIG. 15 b displays however further characteristics and advantages, which will now be described. The blocks are determined according to an extent to which word or phoneme links are permitted to extend between blocks. For example, in the present embodiment, the block positions implement a criteria that no link may extend into any other block other than its directly neighbouring block. Considering the nodes of block 0, for example, it can be seen from FIG. 15 b that the phoneme links /n/, /m/, /oh/, /w/ and /ih/ and word link “NOW” only extend within the same block in which their source nodes are located, which is allowed by the criteria, and the phoneme link /z/ and the word link “IS” each extend from block 0 into block 1, i.e. into the directly neighbouring block, which is also allowed by the criteria. However, there are no links extending into block 2, because such links would have to extend beyond the directly neighbouring block of block 0 (i.e. block 1) and hence are not allowed by the criteria.
By virtue of the blocks being implemented so as to obey the above described criteria, the following advantages are achieved. If further data is later to be inserted into the phoneme and word lattice structure, this may involve the insertion of one or more additional nodes. In this event, any existing link “passing over” a newly inserted node will require its nodal off-set to be increased by one, as the newly inserted node will need to be included in the count of the number of nodes over which the existing link extends. For example, if a new node were inserted at a time of 0.50 seconds into block 2, then it can be seen from FIG. 15 b that the phoneme link /v/ extending from the node at 0.47 seconds to the node at 0.55 seconds would then acquire a nodal off-set value of 2, rather than its original value of 1, and similarly the word link “OF” extending from the node at 0.34 seconds to the node at 0.55 seconds would have its original nodal off-set value of 2 increased to a nodal off-set of 3. Expressed in terms of the data stream shown in FIG. 16 b, the data component 252 originally showing a value of 001 would need to be changed to a value of 002, and the data component 254 whose original value is 002 would need to have its value changed to 003.
During insertion of such additional nodes and processing of the consequential changes to the nodal off-sets, it is necessary to search back through the lattice data structure from the point of the newly inserted node in order to analyse the earlier existing nodes to determine which of them have links having a nodal off-set sufficiently large to extend beyond the newly inserted node. An advantage of the blocks of the lattice data structure being arranged according to the present criteria is that it reduces the number of earlier existing nodes that need to be analysed. More particularly, it is only necessary to analyse those nodes in the same block in which the node is inserted which precede the inserted node plus the nodes in the neighbouring block directly preceding the block in which the new node has been inserted. For example, if a new node is to be inserted at 0.50 seconds in block 2, it is only necessary to analyse the four existing nodes in block 2 that precede the newly inserted node plus the five nodes of block 1. It is not necessary to search any of the nodes in block 0 in view of the block criteria discussed above.
This advantage becomes increasingly beneficial as the length of the lattice increases and the number of blocks formed increases. Furthermore, the advantage not only applies to the insertion of new nodes into an otherwise complete lattice, it also applies to the ongoing procedure of constructing the lattice, which may occur when nodes are not necessarily inserted into a lattice in strict time order.
Yet further, it is noted that the particular choice of the criteria to only allow links to extend into a neighbouring block may be varied, for example the criteria may allow links extending only as far as four blocks away, it then being necessary to search back only a maximum of four blocks. This still provides a significant advantage in terms of reducing the level of processing required in the case of large lattices, particularly lattices with hundreds or thousands of blocks. The skilled practitioners will appreciate that any appropriate number of blocks can be chosen as the limit in the criteria, it merely being necessary to commensurately adapt the number of blocks that are searched back through.
The lattice data structure of the present embodiment contains a further preferred refinement which is also related to the extension of the word or phoneme links into neighbouring blocks. In particular the lattice data structure further includes data specifying two characteristic points of each block. The two characteristic points for each block are shown as alpha (α) and beta (β) in FIG. 15 b.
Beta for a given block is defined as the time of the latest node in the given block to which any link originating from the previous block extends. Thus, in the case of block 1, beta is at the first node in the block (i.e. the node to which the phoneme link /z/ and the word link “IS” extend), since there are no links originating in block 0 that extend further than the first node of block 1. In the case of block 2, beta is at the third node, since the word link “WINTER” extends to that node from block 1. In the case of the first block of the lattice structure i.e. block zero, there are intrinsically no links extending into that block. Therefore, beta for this block is defined as occurring before the start of the lattice.
Alpha for a given block is defined as the time of the earliest node in the given block from which a link extends into the next block. In the case of block 0, two links extend into block 1, namely word link “IS” and the phoneme link /z/. Of these, the node from which the word link “IS” extends is earlier in block 0 than the node from which the phoneme link /z/ extends, hence alpha is at the node from which the word link “IS” extends. Similarly, alpha for block 1 is located at the node where the word link “WINTER” originates from. In the case of the last block of the lattice, in this case block 2, there are intrinsically no links extending into any further block, hence alpha is specially defined as being at the last node in the block. Thus it can be appreciated that conceptually beta represents the latest point in a block before which there are nodes which interact with the previous block, and alpha represents the earliest point in a block after which there are nodes which interact with the next block.
As those skilled in the art will appreciate, each alpha and beta can be specified by identification of a particular node or by specification in terms of time. In the present embodiment identification is specified by nodes. The data specifying alpha and beta within the lattice data structure can be stored in a number of different ways. For example, data components of the type shown in FIG. 16 b can be included containing flags or markers at the relevant locations within the data stream. However, in the present embodiment the points are specified by storing the identities of the respective nodes in a look-up table in the header part of the lattice data structure.
The specification of alpha and beta for each block firstly provides certain advantages with respect to analysing the nodal off-sets of previous nodes in a lattice when a new node is inserted. In particular, when a new node is inserted at a location after beta in a given block, it follows that it is only necessary to analyse the preceding nodes in the given block, and it is no longer necessary to analyse the nodes in the block preceding the given block. This is because it is already known that by virtue of the new inserted node being after beta within the given block, there can by definition be no links that extend from the previous block beyond the newly inserted node, since the position of beta defines the greatest extent which any links extend from the previous block. Thus the need to search and analyse any of the nodes of the preceding block has been avoided, which becomes particularly advantageous as the average size of blocks increases. If alternatively a new node is inserted into a given block at a location before beta of the given block, then it is now necessary to consider links originating from the preceding block as well, but only those nodes at or after alpha in the preceding block. This is due to the fact that from the definition of alpha, it is already known that none of the nodes in the preceding block that come before the preceding block's alpha have links which extend into the given block. Thus processing is again reduced, and the reduction will again become more marked as the size of individual blocks is increased. Moreover, the position of alpha in any given block will tend to be towards the end of that block, so that in the case of long blocks the majority of the processing resource that would otherwise have been used analysing the whole of the preceding block is saved.
The specification of alpha and beta for each block secondly provides certain advantages with respect to employing alpha and beta in procedures to re-define blocks within an existing lattice so as to provide smaller or more evenly arranged blocks whilst maintaining compliance with the earlier mentioned criteria that no link may extend further than one block. In these procedures, existing blocks are essentially split, according to the relative position of alpha and beta within an existing block. In one approach, provided alpha occurs after beta within a given block, the given block can be divided into two blocks by splitting it somewhere between beta and alpha. Similarly, the data specifying beta and alpha is advantageously employed to determine when existing blocks can be split into smaller blocks in the course of a preferred procedure for constructing the lattice data structure.
It was mentioned earlier above that in the present embodiment the longest link from a given node is positioned first in the sequence of data components for any given node as shown in FIG. 16 a. This is advantageous during the procedure of inserting a new node into the lattice data structure, wherein previous nodes must be analysed to determine whether any links originate from them that extend beyond the newly inserted node. By always placing the longest link that extends from any given node at a particular place in the sequence of data components for that node, in the present case at the earliest place within the sequence, if that link is found not to extend over the newly inserted node then it is not necessary to analyse any of the remaining links in the sequence of data components for that node, since they will by definition be of shorter span than the already analysed longest link. Hence further processing economy is achieved.
A preferred method of generating the above described lattice data structure will now be described with reference to FIGS. 17 to 19. In this preferred method the constituent data is organised into sets of data components, and the sets of data components are added one at a time to the lattice structure as it is built up. Each set of data components consists of either:
  • (i) two new nodes plus any links directly therebetween (in the case of adding nodes to the lattice which are not to be connected to nodes already in the lattice); or
  • (ii) a new node plus each of the links that end at that node; or
  • (iii) a link between existing nodes within the lattice.
FIG. 17 is a flow diagram which illustrates the process steps employed in the preferred method. In the following explanation of the process steps of FIG. 17, the application of the steps to the construction of the lattice of FIG. 15 b will be demonstrated, and will thus serve to show how the method operates when applied to input data in which the nodes are already fully time sequentially ordered. Thereafter, the way in which the process steps are applied (be it to the construction of a new lattice or to the alteration of an existing lattice) when additional nodes are to be inserted into an existing time ordered sequence of nodes will be described by describing various different additions of data to the lattice data structure of FIG. 15 b.
In overview, as each set of data components is added to the lattice, the various ends of blocks, alphas and betas are updated. When the number of nodes in a block reaches a critical value, in this example 9, the locations of alpha and beta are analysed and if suitable the block is split into two smaller blocks. The various alphas and betas are again updated, and the process then continues in the same manner with the addition of further data components.
The process steps laid out in FIG. 17 will now be explained in detail. Reference will also be made to FIGS. 18 a to 18 h which show the build up of the lattice structure in the graphical representation form of FIG. 15 b. Additional reference will be made to FIGS. 19 a to 19 h which show the progress of the construction of the data stream defining the lattice, corresponding to the form of FIG. 16 b.
Referring to FIG. 17, at step S61 the automatic speech recognition unit 33 defines the start of the first block, i.e. block zero. In FIG. 18 a the block marker defining the starter of the first block is indicated by reference number 202. This is implemented in the data stream by insertion of data component 244 (see FIG. 19 a) consisting of a block flag.
At step S63 the automatic speech recognition unit 33 sets an incremental counter n equal to 1.
At step S65 the automatic speech recognition unit 33 inserts the first set of data components into the data stream defining the lattice data structure. More particularly, the automatic speech recognition unit 33 collects the data corresponding to the first two nodes of the lattice and any direct phoneme links therebetween (in this case phoneme links /n/ and /m/). It then additionally collects any words that have been identified by the word decoder 37 as being associated with a link between these two nodes, although in the case of the first two nodes no such word has been identified. It then inserts the corresponding data components into the data stream. In particular, referring again to FIG. 19 a, data 260 defining the first node of the lattice structure, and being made up of a data component consisting of a node flag and a data component indicating the time of the node, is inserted. Thereafter data 262 comprising the data component consisting of the phoneme link /n/ and the nodal off-set value of 001 is inserted, followed by data 264 comprising a data component consisting of the phoneme /m/ and nodal off-set value 001. Finally, data 266 comprising the data component consisting of a node flag and the data component consisting of the time of that second node is inserted. Thus all of the component parts 260, 262, 264, 266 of the first set of data components are inserted. The first two nodes and the phoneme links /n/ and /m/ therebetween can be seen in FIG. 18 a also. At step S67 the automatic speech recognition unit 33 determines whether any new nodes have been included in the newly inserted set of data components. The answer in the present case is yes, so the process moves on to step S69 where the automatic speech recognition unit determines whether any of the new nodes are now positioned at the end of the current data lattice structure. The answer in the present case is again yes. In fact, when the method shown in the flow chart of FIG. 17 is used to construct a data lattice from data in which the nodes are ordered in a time sequential manner, as in the present case, the answers to the determination steps S67 and S69 will inherently always be positive. These determination steps are only included in the flow chart to illustrate that the process is capable of accommodating additional nodes or links to be inserted within the lattice when required (examples of these cases will be given later below).
In the present case, the process then moves on to step S71, where the automatic speech recognition unit 33 defines the end of the last block to be immediately after the newly inserted node which is at the end of the lattice. At this stage of the procedure there is only one block, hence in defining the end of the last block, the end of the sole block is in fact defined. This newly defined current end of the block is shown as item 203 in FIG. 18 a, and is implemented in the data stream as data component 245 consisting of a block flag, as shown in FIG. 19 a.
The automatic speech recognition unit 33 then determines all of the alpha and beta points. At the present stage there is only one block so only one alpha and one beta is determined. The procedure for determining alpha and beta in the first block was described earlier above. The resulting positions are shown in FIG. 18 a. With respect to the data stream, the alpha and beta positions are entered into the header data, as was described earlier above.
As step S79 the automatic speech recognition unit 33 determines whether any of the alpha and beta values are “invalid”, in the sense of being either indeterminate or positioned such as to contravene the earlier described criteria that no link may extend further than into a directly neighbouring block. At the present stage of building up the lattice this determination step obviously determines that there is no such invalidity, and hence the process moves to step S81. At step S81 the automatic speech recognition unit determines whether the number of nodes in any blocks that have just had nodes inserted in them has reached or exceeded a predetermined critical number. The predetermined critical number is set for the purpose of defining a minimum number of nodes that must be in a block before the block structure will be analysed or altered for the purposes of giving smaller block sizes or more even block spacings. There is an effective overhead cost in terms of resources that are required when carrying out block division, data storage of the block flag data, and so on. Hence block division for blocks containing less than the critical number of nodes would tend to be counter productive. The choice of the value of the critical number will depend on the particular characteristics of the lattice or data file being considered. As mentioned above, in the present embodiment the number is set at nine. Hence at the present stage of the process, where only two nodes have been inserted in total, the answer to the determination step S81 is no.
The process steps are thus completed for the first set of data components to be inserted, and the current form of the lattice and data stream is shown in FIGS. 18 a and 19 a.
The procedure then moves to step S89, where the automatic speech recognition unit determines that more sets of data components are to be added, and hence at step S91 increments the value of n by one and the process steps beginning at steps S65 are repeated for the next set of data components. In the present case the next set of data components consists of data (item 270 in FIG. 19 b) specifying the third node of the lattice and its time of 0.41 seconds and data (item 268 in FIG. 19 b) specifying the phoneme link /oh/ plus its nodal off-set value of 001. The phoneme link /oh/ and third node are shown having been inserted in FIG. 18 b also. At step S71, the end 203 of the block, being defined as after the last node, is therefore now positioned as shown in FIG. 18 b, and is implemented in the data stream by the data component 245, consisting of a block flag, now being positioned after the newly inserted data 268 and 270. The new position of alpha, now at the new end node, as determined at step S75, is shown in FIG. 18 b. At step S79 it is again determined that there is no invalid alpha or beta, and because the number of nodes is only three (i.e. less than nine) processing of this latest set of data components is now complete, so that the lattice and data stream are currently as shown in FIGS. 18 b and 19 b.
As the procedure continues, the fourth node and the two links which end at that node, namely the phoneme link /w/ and the word link “NOW”, representing the next set of data components, are inserted. The process steps from S65 onwards are followed as described for the previous sets of data components, resulting in the lattice structure shown in FIG. 18 c and the data stream shown in FIG. 19 c. It can be seen in FIG. 19 c that the data 272 corresponding to the phoneme link /w/ and the data 274 corresponding to the latest node is just before the last block flag at the end of the data stream, whereas the data 276 corresponding to the word link “NOW” is placed in the data stream with the node from which that link extends, i.e. the first node. Moreover it is placed before the other links that extend from the first node, namely the phoneme links /n/ and /m/ because their nodal off-set values are 001 which are less than the value of 003 for the word link “NOW”.
The procedure continues as described above without variation for the insertion of the fifth, sixth, seventh and eighth nodes providing the lattice structure and data stream shown in FIGS. 18 d and 19 d respectively. On the next cycle of the procedure starting at step S65, the set of data components inserted is the ninth node and the phoneme link /w/ ending at that node. Following implementation in the same manner as above of the steps S67, S69, S71 and S75, the lattice arrangement is as shown in FIG. 18 e-1, with the end 203 of the block located after the newly inserted ninth node, and alpha located at that ninth node. At step S79 the automatic speech recognition unit determines that there is no invalidity of the alpha and beta values and so the process moves on to step S81. The procedure to this point has followed the same as for the previous sets of data components. However, since this time the newly inserted node brings the total number of nodes in the sole block up to nine, when the automatic speech recognition unit carries out the determination step S81 it determines for the first time that the number of nodes in the block is indeed greater than or equal to nine. Consequently, this time the procedure moves to step S83, where the automatic speech recognition unit determines whether alpha is greater than beta, i.e. whether alpha occurs later in the block than beta. This is determined in the present example to be the case (in fact this will always be the case for the first block of a lattice due to the way beta is defined for the first lattice).
It can thus be appreciated that the basic approach of the present method is that when the number of nodes in a block reaches nine or more, the block will be divided into two blocks, provided that alpha is greater than beta. The reason for waiting until a certain number of nodes has been reached is due to the cost in overhead resource, as was explained-earlier above. The reason for the criteria that alpha be greater than beta is to ensure that each of the two blocks formed by the division of an original block will obey the earlier described criteria that no link is permitted to extend into any block beyond a directly neighbouring block.
Therefore, in the present case, the procedure moves to step S85 in which the automatic speech recognition unit splits the sole block of FIG. 18 e-1 into two blocks. This is carried out by defining a new end of block 205 which is positioned according to any desired criteria specifying a position somewhere between beta and alpha. In the present embodiment the criteria is to insert the new end of block equally spaced (in terms of the number of nodes, rounded up where necessary) between beta and alpha. Thus, the block is split by insertion of a new end of block 205 immediately after the fifth node, as shown in FIG. 18 e-2. This is implemented in the data stream by the insertion of data component 298, consisting of a block flag, as shown in FIG. 19 e. Additionally, the automatic speech recognition unit 33 recalculates the times of all of the nodes in the newly formed second block as off-sets from the start time of that block, which is the time of the fifth node of the whole lattice (0.71 seconds). Hence the resulting data stream, shown in FIG. 19 e, now contains the newly inserted data component 298, newly inserted data 300 relating to the phoneme link /w/ and newly inserted data 302 relating to the end node. Morever, the data components 304, 306, 308 and 310 have had their time values changed to new off-set values.
At step S87 updated values of alpha and beta are determined by the automatic speech recognition unit. Given there are now two blocks, there are two betas and two alphas to be determined. The new locations of these alphas and betas are shown in FIG. 18 e-2.
The procedure of FIG. 17 thereafter continues as described above for the insertion of the tenth through to thirteenth node of the overall lattice without the critical number of 9 nodes yet being reached in block 1. This provides the lattice structure and data stream shown in FIGS. 18 f and 19 f respectively.
The next set of data components inserted consists of the fourteenth node and the phoneme link /oh/ ending at that node. The situation after steps S65 to S79 are implemented for this set of data components is shown in FIG. 18 g-1. Insertion of this latest set of data components has brought the number of nodes in the second block up to nine, and alpha is after beta. Consequently, the automatic speech recognition unit 33 carries out step S85 in which it inserts a new end of block 207 immediately after the fifth node of the block to be split, as shown in FIG. 18 g-2. This is implemented in the data stream by insertion of data component 330 consisting of a new block flag, as shown in FIG. 19 g. The automatic speech recognition unit 33 also calculates the adjusted off-set times (334,336,338,340 in FIG. 19 g) of the nodes in the newly formed third block. Thereafter, at step S87, the automatic speech recognition unit determines updated values of the alphas and betas, which provides a new alpha for what is now the second block and a new beta for what is now the third block, both of which are also shown in FIG. 18 g-2.
The procedure shown in FIG. 17 is repeated for the remaining three sets of data components yet to be added, so providing the lattice structure and data stream shown in FIGS. 18 h and 19 h.
At this stage, the automatic speech recognition unit 33 determines at step S89 that no more sets of data components are available to be inserted, and hence the current lattice data structure is complete, and indeed corresponds to the lattice shown in FIGS. 15 b and 16 b.
An example will now be given to demonstrate the merging of two blocks due to the later insertion of a long link that extends beyond a neighbouring block. This situation did not arise in the earlier example because the data was added into the lattice on a fully time ordered sequential basis. In contrast, in the following example, after the lattice of FIG. 15 b has reached the stage described so far, an additional link is required to be inserted between certain existing nodes. There are a number of reasons why this might occur. One possibility is that the lattice has been completed earlier, then employed as annotation data, but at a later date needs revision. Another possibility is that all the phoneme data is processed first, followed by all the word data, or vice-versa. Yet another possibility is that the data from different soundtracks, e.g. different speakers, is separately added to provide a single lattice.
However, in the present example, the insertion of the earlier timed link is essentially part of the original on-going construction of the lattice, although the data component consisting of the additional link is processed separately at the end because it constitutes a word recognised by the automatic speech recognition unit 33 when passing the phoneme data through a second speech recognition vocabulary. In the present example, the second vocabulary consists of a specialised name place vocabulary that has been optionally selected by a user. Hence, in the present example, at step S89 it is determined that a further set of data components is to be inserted, and following incrementing of the value of n at step S91, the data is inserted at step S65. The data consists of the word link “ESTONIA” and extends from the fourth node of block 0 to the third node of block 2, as shown in FIG. 20 a.
At step S67 the automatic speech recognition unit 33 recognises that no new node has been inserted, hence the process moves to step S75 where it determines updated locations of alpha and beta. However, because the newly inserted link extends from block 0 right over block 1 to end in block 2, it contravenes the earlier described criteria barring link extensions beyond directly neighbouring blocks, and moreover does not produce a valid alpha or beta for block 1. This is represented in FIG. 20 a by the indication that any alpha for block 1 would in fact need to appear in block 0, and any beta for block 1 would need to appear in block 2. Consequently, at the next step S79, it is determined that alpha and beta are indeed invalid.
The procedure therefore moves to step S77 which consists of merging blocks. Any suitable criteria can be used to choose which blocks should be merged together, for example the criteria can be based on providing the most evenly spaced blocks, or could consist of merging the offending block with its preceding block. However, in the present embodiment the choice is always to merge the offending block with its following block, i.e. in the present example block 1 will be merged with block 2.
This is implemented by removal of the block marker dividing block 1 from block 2, resulting in two blocks only, as shown in FIG. 20 b. The procedure then returns to step S75, where the alphas and betas are determined again. The resulting positions of alpha and beta are shown in FIG. 20 b.
At step S79 the automatic speech recognition unit 33 determines that alpha and beta are now valid, so the procedure moves to step S81. In the present example, because there are now twelve nodes in block 1 and because alpha is greater than beta, the procedure moves to step S85 and block 1 is split using the same procedure as described earlier above. However, the earlier employed criteria specifying where to locate the new block division, namely half way in terms of nodes between beta and alpha, contains in the present example a refinement that when the block to be split has greater than nine nodes, splitting should, where possible, leave the earlier of the two resulting blocks with no more than eight nodes. This is to avoid inefficient repetitions of the block splitting process. Hence in the present example the new block marker is inserted immediately after the eighth node of the block being split, as shown in FIG. 20 c. At step S87 the alphas and betas are again determined, the new positions being shown in FIG. 20 c. It is noted that alpha and beta both occur at the same node of block 1. In the present example it is determined at step S89 that no more sets of data components are to be added, and hence the procedure is completed.
In the above procedure described with reference to FIGS. 20 a to 20 c, the changes to the lattice are implemented by changes to the data stream of FIG. 16 b in corresponding fashion to the earlier examples. In particular, step S77 of merging the two blocks is implemented by removal of the relevant data component 248 containing the original block flag dividing the original block 1 and 2.
A further example demonstrating the processing of data according to the procedure laid out in the flow chart of FIG. 17 will now be described with reference to FIGS. 21 a to 21 d. In this example, additional data components are added immediately after the seventeenth node has been added to the lattice of FIG. 15 c. Therefore at step S89 of FIG. 17 further components are indeed to be added and the procedure returns again via increment step S91 to insertion step S65. However, the method steps employed to add the additional data components in the following example also constitute a stand alone method of updating or revising any suitable original lattice irrespective of how the original lattice itself was formed.
In this further example, additional data is added via a keyboard and a phonetic transcription unit, of the same form as the keyboard 3 and phonetic transcription unit 75 shown in FIG. 9. In this further example the output of the phonetic transcription unit is connected to the automatic speech recognition unit 33. The user uses this arrangement to enter annotation data which he intends to correspond to a specific portion of the video data 31-1. Such data is sometimes referred to in the art as “metadata”. The specific portion of the video data may show, for example, a number of profile shots of an actor, which the user wishes to be able to locate/retrieve at a later date as desired by using the annotation data. Hence, he enters the words “PROFILE A B C D E” and moreover specifies that only word links, not phoneme links, should be transcribed. This provides the following data components to be added:
  • (i) a first new node, a second new node, and a word link “PROFILE” therebetween;
  • (ii) a third new node, and the word link “A” between the new second and third nodes;
  • (iii) a fourth new node, and the word link “B” between the new third and fourth nodes;
  • (iv) a fifth new node and the word link “C” between the new fourth and fifth nodes;
  • (v) a sixth new node and the word link “D” between the new fifth and sixth nodes; and
  • (vi) a seventh new node and the word link “E” between the new sixth and seventh nodes.
Referring again to FIG. 17, at step S65 data component (i) as described above is inserted by the automatic speech recognition unit 33 into the lattice of FIG. 15 b, in the position shown in FIG. 21 a. At step S67, the automatic speech recognition unit 33 determines that new nodes have been inserted. At step S69 the automatic speech recognition unit determines that neither of the new nodes have been inserted at either the start or the end of the lattice. In other words, the new nodes have been inserted within an existing lattice, and hence it will probably be necessary to adjust the nodal off-sets of one or more existing nodes of the lattice. The procedure therefore moves to step S73, in which the automatic speech recognition unit 33 carries out such necessary adjustment of the nodal off-sets of existing nodes. Any appropriate method of adjusting the off-sets can be employed at step S73. In the present embodiment a preferred method is employed, and this will be described in detail later below with reference to the flow chart of FIG. 22.
Following adjustment of the off-sets, the procedure of FIG. 17 is followed in the manner described above for the earlier examples, returning to step S65 for insertion of data component (ii). The procedure described above with respect to data component (i) is then repeated for data components (ii) and (iii). FIG. 21 b shows the stage reached when data components (i), (ii) and (iii) have been inserted and the procedure has reached step S81. At this stage, for the first time during this insertion of additional data components, it is now determined that the number of nodes in the second block equals 9. Hence at step S83 the automatic speech recognition unit 33 splits the block and at step S87 determines the new alphas and betas, resulting in the new block structure shown in FIG. 21 c. It is noted that the criteria employed for locating the new block end is one in which the size of the newly formed second block is made as large as possible except that placing the end of the block at alpha itself is not allowed.
The procedure then continues in the same fashion resulting in the insertion of data components (iv), (v) and (vi) up to reaching step S81 during processing of data component (vi). At this stage, the lattice is of the form shown in FIG. 21 d, i.e. nine nodes are now located in present block 2, and hence the outcome of step S81 is that the procedure again moves to step S83. It is noted that the present example has thrown up a situation in present block 2 where beta occurs after alpha, in other words the longest link extended into block 2 extends beyond the start of the earliest link exiting that block 2, as can be seen in FIG. 21 d. If block 2 were to be split in such circumstances, this would inherently involve forming a new block that contravenes the basic criteria of the present embodiment that no link is allowed to extend into any other blocks other than its directly neighbouring block. Because of this, the method of FIG. 17 does not allow splitting of block 2 despite it having nine nodes, and this is implemented by the outcome of determination step S83 being that alpha is not greater than beta leading to the procedure moving directly on to step S89. In the present example it is determined at step S89 that no more sets of data components are to be added, and hence the procedure ends.
The above-mentioned preferred procedure for implementing step S73 of adjusting the off-sets will now be described with reference to the flow chart of FIG. 22, which shows the procedure followed for each newly inserted node. The preferred method uses the fact that the location of alpha and beta in each block is known. The automatic speech recognition unit 33 analyses nodes preceding the newly inserted node, to determine any links that extend from those nodes beyond the location of the newly inserted node. If any such node is found, then any affected link needs to have its nodal off-set value increased by one, to accommodate the fact that the newly inserted node is present under its span. If the newly inserted node is positioned after beta within a given block, then only those nodes before the newly inserted node and within the same given block need be analysed, since there are inherently no links extending from the previous block beyond beta. Alternatively, if a newly inserted node is positioned before beta in the given block, then the nodes before the newly inserted node in that given block need to be analysed plus the nodes in the preceding block, but only so far back as to include the node corresponding to alpha. The nodes positioned before alpha of the preceding block do not need to be analysed because inherently there are no links extending from before alpha into the block in which the new node has been inserted.
The above procedure is implemented by the process steps shown in FIG. 22. At step S101 the automatic speech recognition unit 33 sets an increment counter to the value i=1. The increment counter is used to control repeated application, as required, of the procedure to consecutive earlier nodes on a node-by-node basis. At step S103 the node which is positioned one place before the inserted node is identified. Referring to FIG. 21 a, in the case of the newly inserted node from which the word link “PROFILE” extends, the identified node one position before it is the node from which the word link “THE” extends. At step S105, all the links extending from the identified node are identified, being here the word link “THE” and the phoneme link /dh/. The automatic speech recognition unit 33 determines the nodal off-set value of these links, which is 002 for the word link “THE” and 001 for the phoneme link /dh/, and hence at step S107 increases each of these nodal off-set values by one, to the new values of 003 and 002 respectively. At step S109 it is determined whether the newly inserted node was positioned before beta. In the present case it was actually positioned after, hence analysis of the nodes need only continue back to the first node of the present block, and hence at step S111 it is determined whether the currently identified node, i.e. the node that has just had its nodal off-sets changed, is the first node of the present block. In the present case it is, and since no further nodes need to have their off-sets adjusted, the procedure ends. If, however, further nodes remained to be processed in the present block, then the procedure would continue to step S113 where the value of i is incremented, and then the procedure would be repeated for the next previous node starting from step S103. Also, if in the above example the newly inserted node was in fact located before beta, then the procedure would be continued on until each node up to the node corresponding to alpha in the preceding block had been processed. In order to achieve this, when the inserted node is indeed before beta then the procedure moves to step S115 where the automatic speech recognition unit determines whether the identified node is at the position of alpha of the preceding block. If it is then the procedure is complete. If it is not, then the procedure moves to step S117 where the value of i is incremented, and then the procedure is repeated from step S103.
An alternative way of splitting a block will now be described. When the number of nodes in a given block has reached the critical number and alpha is later than beta for the given block, then the given block and the preceding block are adjusted to form three new blocks in place of those two blocks. This procedure will now be described more fully with reference to FIGS. 23 a and 23 b.
FIG. 23 a shows a sequence of nodes within a lattice, linked by phoneme links for example phoneme link 412, the end part of a word link 414 and a further word link 416. The nodes are divided into blocks by block markers 402, 404 and 406, forming blocks n and (n+1) of the lattice.
The positions of alpha and beta for block n and block (n+1) respectively are shown also. FIG. 23 a shows the state of the lattice after the data represented by phoneme link 413 and the two nodes between which it extends has been inserted. The number of nodes in block (n+1) has now reached nine, and since also alpha is later than beta, block rearrangement is now implemented. The two blocks of FIG. 23 a are replaced by three blocks, namely block n, block (n+1) and block (n+2), as shown in FIG. 23 b. This is implemented by deleting the block divider 404, and replacing it with two new block dividers 408 and 410 placed immediately after beta of block n and beta of block (n+1) respectively. Alpha and beta for each block is thereafter re-calculated and the new positions are shown in FIG. 23 b. This procedure for rearranging the blocks provides particularly evenly spaced blocks. This is particularly the case when a given block has the required number of nodes for splitting and its alpha is after beta, yet in the block preceding it beta is positioned after alpha. It is noted that this was indeed the case in FIG. 23 a. Because of this, in the preferred embodiment, block splitting is carried out by this procedure of forming a new block between the two beta positions when beta is positioned after alpha in the relevant preceding block, but block splitting follows the originally described procedure of dividing the present block between alpha and beta when beta is positioned before alpha in the preceding block.
In an alternative version of the embodiments described in the preceding paragraph, the two new block dividers may be positioned at nodes relatively close, compared to the number of nodes in each block, to the position of beta of block n and beta of block (n+1) respectively, instead of at those two beta positions as such.
In the above embodiments, the timing of each node of the lattice is provided, prior to arrangement in blocks, relative to a common zero time set such that the first node occurs at a time of 0.10 seconds. The start time for the first block is set equal to the common zero time. The start time for each of the other blocks is the time of the last node of the preceding block. However, in an alternative embodiment the timing of each node may be provided in an absolute form, and the block marker demarcating the start of each block may be given a Universal Standard Time (UST) time stamp, corresponding to the absolute time of the first node of that block rounded down to the nearest whole second. The UST time stamp may be implemented as a 4 byte integer representing a count of the number of seconds since 1 Jan. 1, 1970. The times of the nodes in each block are then determined and stored as offset times relative to the rounded UST time of the start of the block. Because in this embodiment each block time is rounded to the nearest second, if block durations of less than 1 second were to be permitted, then two or more blocks could be allocated the same time stamp value. Therefore, when UST time stamps are employed, block durations less than 1 second are not permitted. This is implemented by specifying a predetermined block duration, e.g. 1 second, that a current block must exceed before splitting of the current block is performed. This requirement will operate in addition to the earlier described requirement that the current block must contain greater than a predetermined number of nodes before splitting of the current block is performed. Alternatively, shorter block durations may be accommodated, by employing a time stamp convention other than UST and then rounding down the block marker times more precisely than the minimum allowed duration of a block.
In the above embodiments the phoneme and word lattice structure was determined and generated by the automatic speech recognition unit 33, configured with the requisite functionality. As will readily be appreciated by those skilled in the art, a standard automatic speech recognition unit can be used instead, in conjunction with a separate lattice creation unit comprising the functionality for determining and generating the above described phoneme and word lattice structure. An embodiment employing a standard automatic speech recognition unit 40, which outputs a sequence of phonemes is shown in FIG. 24. As was the case for the arrangement shown in earlier FIG. 3, the word decoder 37 identifies words from the phoneme data 35. In the embodiment illustrated in FIG. 24, the identified words are added to the phoneme data to form phoneme and word data 42. This is then passed to a lattice creation unit 44, which determines and generates the above described phoneme and word lattice structure which forms the phoneme and word annotation data 31-3. In other embodiments, which include a standard automatic speech recognition unit which only outputs words, a word to phoneme dictionary can be used to generate phonemes, and then the words and phonemes are combined and formed into the above described phoneme and word lattice structure by a lattice creation unit (not shown).
In the above embodiments, the phoneme and word data was associated with the links of the lattice. As those skilled in the art will appreciate, the word and/or the phoneme data can be associated with the nodes instead. In this case the data associated with each node would preferably include a start and an end time for each word or phoneme associated therewith.
A technique has been described above for organising an unordered list of nodes and links into an ordered and blocked list. The technique has been described for the particular application of the ordering of an unordered list of phonemes and words. However, as those skilled in the art will appreciate, this technique can be applied to other types of data lattices. For example the technique can be applied to a lattice which only has phonemes or a lattice which only has words. Alternatively still, it can be applied to a lattice generated from a hand writing recognition system which produces a lattice of possible characters as a result of a character recognition process. In this case, the nodes and links would not be ordered in time, but would be spatially ordered so that the characters appear in the ordered lattice at a position which corresponds to the character's position on the page relative to the other characters.

Claims (51)

1. An apparatus for searching a database, comprising data defining a phoneme and/or word lattice for use in the database, said data comprising data for defining a plurality of time-ordered nodes within the lattice, data for defining a plurality of links within the lattice, each link extending from a first node to a second node, data for associating a phoneme or a word with at least one node or link, and data for arranging the nodes in a sequence of time-ordered blocks so that links from nodes in any given block do not extend beyond the nodes in a block that is a predetermined number of blocks later in the sequence, in response to an input query, by a user, the apparatus comprising:
means for generating phoneme data corresponding to the user's input query;
means for searching the phoneme and word lattice using the phoneme data generated for the input query; and
means for outputting search results in dependence upon the output from said searching means.
2. An apparatus according to claim 1, further comprising means for generating word data corresponding to the user's input query and means for searching the phoneme and word lattice using the word data generated for the input query.
3. A method of searching a database, comprising data defining a phoneme and/or word lattice for use in the database, said data comprising data for defining a plurality of time-ordered nodes within the lattice, data for defining a plurality of links within the lattice, each link extending from a first node to a second node, data for associating a phoneme or a word with at least one node or link, and data for arranging the nodes in a sequence of time-ordered blocks so that links from nodes in any given block do not extend beyond the nodes in a block that is a predetermined number of blocks later in the sequence, in response to an input query by a user, the method comprising the steps of:
generating phoneme data corresponding to the user's input query;
searching the phoneme and word lattice using the phoneme data generated for the input query; and
outputting search results in dependence upon the results of said searching step.
4. A method according to claim 3, further comprising the steps of generating word data corresponding to the users input query and searching the phoneme and word lattice using the word data generated for the input query.
5. An apparatus for generating annotation data for use in annotating a data file, the apparatus comprising:
a receiver operable to receive phoneme and/or word data; and
a first generator operable to generate annotation data defining a phoneme and/or word lattice corresponding to the received phoneme and/or word data;
wherein the first generator comprises:
a second generator operable to generate node data defining a plurality of time-ordered nodes within the lattice;
a third generator operable to generate link data defining a plurality of links within the lattice, each link extending from a first node to a second node;
a fourth generator operable to generate association data associating each node or link with a phoneme or word from the phoneme and/or word data; and
a fifth generator operable to generate block data for arranging the nodes in a sequence of time-ordered blocks fulfilling a block criteria in which links from nodes in any given block do not extend beyond the nodes in a block that is a predetermined number of blocks later in the sequence.
6. An apparatus according to claim 5, wherein the block criteria is that links from nodes in any given block do not extend beyond the nodes in the succeeding block.
7. An apparatus according to claim 5, wherein the first generator comprises a processor operable to form the phoneme and/or word lattice by processing the node data for each node and the link data for each link, the processor comprising:
i) an adder operable to add one or more nodes and associated link or links to a current block of the lattice until the number of nodes in the current block reaches a predetermined number;
ii) a first determiner operable to determine if the current block can be split in accordance with said block criteria; and
iii) a splitter operable to split the current block into at least two blocks of nodes.
8. An apparatus according to claim 7, operable to generate the node data and the link data in correspondence to the phoneme and/or word data separately for each phoneme and/or word.
9. An apparatus according to claim 8, operable to generate all the node data and all the link data prior to forming the phoneme and/or word lattice.
10. An apparatus according to claim 8, operable to add the node data and link data for each phoneme and/or word to the phoneme and/or word lattice incrementally as it is generated for each said phoneme and/or word.
11. An apparatus according to claim 10, operable to add the node data and link data incrementally by:
determining if a node already exists for the start and end times for the current phoneme or word being processed;
adding to the lattice a node or nodes corresponding to the start and/or end time if they do not already exist; and
adding a link between the nodes corresponding to the start and end times for the current phoneme or word being processed.
12. An apparatus according to claim 7, further comprising a second determiner operable to determine a first timing or nodal point (β) for each block identifying the latest node in the block to which any link originating in the preceding block extends and a second timing or nodal point (α) for each block identifying the earliest node in the block from which a link extends into the succeeding block; and
wherein the first determiner is operable to determine that the current block of nodes can be split in accordance with said block criteria by determining that the first timing or nodal point (β) is before the second timing or nodal point (α) and wherein the splitter is operable to split the current block responsive to the first determiner determining that the current block of nodes can be split.
13. An apparatus according to claim 12, wherein the second determiner is operable to update the first timing or nodal point (β) and the second timing or nodal point (α) for each block, on addition of further nodes to the lattice.
14. An apparatus according to claim 12, wherein the splitter is operable to split the current block between the first timing or nodal point (β) and the second timing or nodal point (α).
15. An apparatus according to claim 14, wherein the sixth generator comprises one of the following:
a) a processor operable to receive and process an input voice annotation signal;
b) a processor operable to receive and process a text annotation; and
c) a processor operable to receive image data representative of a text document and a character recognition unit for converting said image data into text data.
16. An apparatus according to claim 12, wherein the splitter is operable to split the current block by forming a new block starting at or near the first timing or nodal point (β) of the preceding block and ending at or near the first timing or nodal point (β) of the current block.
17. An apparatus according to claim 12, wherein the splitter is operable to split the current black by forming a new block starting at or near the first timing or nodal point (β) of the preceding block and ending at or near the first timing or nodal point (β) of the current block if the first timing or nodal point (β) of the preceding block is later than the second timing or nodal point (α) of the preceding block, whereas the splitter is operable to split the current block between the first timing or nodal point (β) and the second timing or nodal point (α) if the first timing or nodal point (β) of the preceding block is earlier than the second timing or nodal point (α) of the preceding block.
18. An apparatus according to claim 5, further comprising a sixth generator operable to generate the phoneme and/or word data from input audio or text data.
19. An apparatus according to claim 18, wherein the data file comprises audio data, and the sixth generator comprises an automatic speech recognition system for generating phoneme data for audio data in the data file.
20. An apparatus according to claim 19, wherein the sixth generator further comprises a word decoder for generating word data by identifying possible words within the phoneme data generated by the automatic speech recognition system.
21. An apparatus according to claim 18, wherein the data file comprises text data, and the sixth generator comprises a text-to-phoneme converter for generating phoneme data from text data in the data file.
22. An apparatus according to claim 5, wherein said first generator is operable to generate data defining time stamp information for each of said nodes.
23. An apparatus according to claim 22, wherein said data file includes a time sequential signal, and wherein said first generator is operable to generate time stamp data which is time synchronised with said time sequential signal.
24. An apparatus according to claim 23, wherein said time sequential signal is an audio and/or video signal.
25. An apparatus according to claim 5, wherein said first generator is operable to generate data which defines each block's location within the database.
26. A method of generating annotation data for use in annotating a data file, the method comprising:
i) receiving phoneme and/or word data; and
ii) generating annotation data defining a phoneme and/or word lattice corresponding to the received phoneme and/or word data;
wherein the step of generating annotation data defining the lattice comprises:
generating node data defining a plurality of time-ordered nodes within the lattice;
generating link data defining a plurality of links within the lattice, each link extending from a first node to a second node;
generating association data associating each link or node with a phoneme or word from the phoneme and/or word data; and
generating block data for arranging the nodes in a sequence of time-ordered blocks fulfilling a block criteria in which links from nodes in any given block do not extend beyond the nodes in a block that is a predetermined number of blocks later in the sequence.
27. A method according to claim 26, wherein the block criteria is that links from nodes in any given block do not extend beyond the nodes in the succeeding block.
28. A method according to claim 26, wherein the step of generating annotation data defining the lattice comprises the following steps for forming the phoneme and/or word lattice by processing the node data for each node and the link data for each link:
i) adding one or more nodes and associated link or links to a current block of the lattice until the number of nodes in the current block reaches a predetermined number;
ii) determining that the current block can be split in accordance with said block criteria; and
iii) splitting the current block into at least two blocks of nodes.
29. A method according to claim 28, wherein the node data and the link data is generated in correspondence to the phoneme and/or word data separately for each phoneme and/or word.
30. A method according to claim 29, wherein all the node data and all the link data is generated prior to forming the phoneme and/or word lattice.
31. A method according to claim 29, wherein the node data and link data for each phoneme and/or ward is added to the phoneme and/or word lattice incrementally as it is generated for each said phoneme and/or word.
32. A method according to claim 31, wherein the node data and link data is added incrementally by:
determining if a node already exists for the start and end times for the current phoneme or Word being processed;
adding to the lattice a node or nodes corresponding to the start and/or end time if they do not already exist; and
adding a link between the nodes corresponding to the start and end times for the current phoneme or word being processed.
33. A method according to claim 28, further comprising determining a first timing or nodal point (β) for each block identifying the latest node in the block to which any link originating in the preceding block extends and a second timing or nodal point (α) for each block identifying the earliest node in the block from which a link extends into the succeeding block; and
wherein the step of determining that the current block of nodes can be split in accordance with said block criteria comprises determining that the first timing or nodal point (β) is before the second timing or nodal point (α) and wherein the current block is split into the at least two blocks in response to it being determined that the current block of nodes can be split.
34. A method according to claim 33, further comprising updating the first dining or nodal point (β) and the second timing or nodal point (α) for each block, on addition of further nodes to the lattice.
35. A method according to claim 33, wherein the step of splitting the current block comprises splitting the current block between the first timing or nodal point (β) and the second timing or nodal point (α).
36. A method according to claim 33, wherein the step of splitting the current block comprises forming a new block starting at or near the first timing or nodal point (β) of the preceding block and ending at or near the first timing or nodal point (β) of the current block.
37. A method according to claim 33, wherein the step of splitting the current block comprises forming a new block starting at or near the first timing or nodal point (β) of the preceding block and ending at or near the first timing or nodal point (β) of the current block when the first timing or nodal point (β) of the preceding block is later than the second timing or nodal point (α) of the preceding block, whereas it comprises splitting the current block between the first timing or nodal point (β) and the second timing or nodal point (α) if the first timing or nodal point (β) of the preceding block is earlier than the second timing or nodal point (α) of the preceding block.
38. A method according to claim 26, further comprising toe step of generating the phoneme and/or word data from input audio or text data.
39. A method according to claim 38, wherein the data file comprises audio data, and the step of generating the phoneme and word data comprises:
using an automatic speech recognition system to generate phoneme data for audio data in the data file; and
using a word decoder to generate word data by identifying possible words within the phoneme data generated by the automatic speech recognition system.
40. A method according to claim 38, wherein the data file comprises text data, and the step of generating the phoneme and word data comprises using a text-to-phoneme converter to generate phoneme data from text data in the data file.
41. A method according to claim 38, wherein the step of generating the phoneme and/or word data comprises one of the following group:
a) receiving and processing an input voice annotation signal;
b) receiving and processing a text annotation; and
c) receiving image data representative of a text document and a character recognition unit for converting said image data into text data.
42. A method according to claim 26, further comprising generating data defining time stamp information for each of said nodes.
43. A method according to claim 42, wherein said data file includes a time sequential signal, and wherein the generated time stamp data is time synchronised with said time sequential signal.
44. A method according to claim 43, wherein said time sequential signal is an audio and/or video signal.
45. A method according to claim 26, further comprising generating data which defines each block's location within the database.
46. A method according to claim 26, further comprising forming the phoneme and/or word lattice by processing the node data for each node and the link data for each link by;
i) adding node data for two nodes and link data for one or mote links therebetween;
ii) adding block data to provide an initial block of nodes constituted by the two added nodes;
iii) adding to the initial block of nodes further node data and/or link data for one or more thither nodes and/or links;
iv) repeating (iii) until the number of nodes in the initial block reaches a predetermined number of nodes;
v) determining that the initial block of nodes can be split in accordance with said block criteria;
vi) adding further block data to split the initial block of nodes into at least two current blocks of nodes;
vii) adding to one of the current blocks of nodes further node data and/or link data for one or more further nodes and/or links;
viii) repeating (vii) until the number of nodes in any current block is identified as reaching the predetermined number of nodes;
ix) determining that the identified current block can be split in accordance with said block criteria;
x) adding further block data to split the identified current block into at least two blocks;
xi) repeating (viii), (ix) and (x) if required until the node data and link data for all of the nodes and links generated for the phoneme and/or word data has been added to the phoneme and/or word lattice.
47. An apparatus for generating annotation data for use in annotating a data file, the apparatus comprising:
receiving means for receiving phoneme and/or word data; and
first generating means for generating annotation data defining a phoneme and/or word lattice corresponding to the received phoneme and/or word data;
wherein the first generating means comprises:
second generating means for generating node data defining a plurality of time-ordered nodes within the lattice;
third generating means for generating link data defining a plurality of links within the lattice, each link extending from a first node to a second node;
fourth generating means for generating association data associating each node or link with a phoneme or word from the phoneme and/or word data; and
fifth generating means for generating block data for arranging the nodes in a sequence of time-ordered blocks fulfilling a block criteria in which links from nodes in any given block do not extend beyond the nodes in a block that is a predetermined number of blocks later in the sequence.
48. A computer readable medium storing computer executable instructions for causing a programmable computer device to carry out a method of searching a database, comprising data defining a phoneme and/or word lattice for use in the database, said data comprising data for defining a plurality of time-ordered nodes within the lattice, data for defining a plurality of links within the lattice, each link extending from a first node to a second node, data for associating a phoneme or a word with at least one node or link, and data for arranging the nodes in a sequence of time-ordered blocks so that links from nodes in any given block do not extend beyond the nodes in a block that is a predetermined number of blocks later in the sequence in response to an input query by a user, the instructions comprising:
instructions for generating phoneme data corresponding to the user's input query;
instructions for searching the phoneme and word lattice using the phoneme data generated for the input query; and
instructions for outputting search results in dependence upon the results of said searching step.
49. Computer executable instructions for causing a programmable computer device to carry out a method of searching a database, comprising data defining a phoneme and/or word lattice for use in the database; said data comprising data for defining a plurality of time-ordered nodes within the lattice, data for defining a plurality of links within the lattice, each link extending from a first node to a second node, data for associating a phoneme or a word with at least one node or link, and data for arranging the nodes in a sequence of time-ordered blocks so that links from nodes in any given block do not extend beyond the nodes in a block that is a predetermined number of blocks later in the sequence, in response to an input query by a user, the instructions comprising:
instructions for generating phoneme data corresponding to the user's input query;
instructions for searching the phoneme and word lattice using the phoneme data generated for the input query; and
instructions for outputting search results in dependence upon the results of said searching step.
50. A computer readable medium storing computer executable instructions for causing a programmable computer device to carry out a method of generating annotation data for use in annotating a data file, the computer executable instructions comprising:
instructions for receiving phoneme and/or word data; and
instructions for generating annotation data defining a phoneme and/or word lattice corresponding to the received phoneme and/or word data;
wherein the instructions for generating annotation data defining the lattice comprise:
instructions for generating node data defining a plurality of time-ordered nodes within the lattice;
instructions for generating link data defining a plurality of links within the lattice, each link extending from a first node to a second node;
instructions for generating association data associating each link or node with a phoneme or word from the phoneme and/or word data; and
instructions for generating block data for arranging the nodes in a sequence of time-ordered blocks fulfilling a block criteria in which links from nodes in any given block do not extend beyond the nodes in a block that is a predetermined number of blocks later in the sequence.
51. Computer executable instructions for causing a programmable computer device to carry out a method of generating annotation data for use in annotating a data file, the computer executable instructions comprising:
instructions for receiving phoneme and/or word data; and
instructions for generating annotation data defining a phoneme and/or word lattice corresponding to the received phoneme and/or word data;
wherein the instructions for generating annotation data defining the lattice comprise:
instructions for generating node data defining a plurality of time-ordered nodes within the lattice;
instructions for generating link data defining a plurality of links within the lattice, each link extending from a first node to a second node;
instructions for generating association data associating each link or node with a phoneme or word from the phoneme and/or word data; and
instructions for generating block data for arranging the nodes in a sequence of time-ordered blocks fulfilling a block criteria in which links from nodes in any given block do not extend beyond the nodes in a block that is a predetermined number of blocks later in the sequence.
US10/363,752 2000-09-29 2001-09-28 Database annotation and retrieval Expired - Fee Related US7240003B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
GBGB0023930.1A GB0023930D0 (en) 2000-09-29 2000-09-29 Database annotation and retrieval
GB002390.1 2000-09-29
PCT/GB2001/004331 WO2002027546A2 (en) 2000-09-29 2001-09-28 Database annotation and retrieval

Publications (2)

Publication Number Publication Date
US20030177108A1 US20030177108A1 (en) 2003-09-18
US7240003B2 true US7240003B2 (en) 2007-07-03

Family

ID=9900403

Family Applications (1)

Application Number Title Priority Date Filing Date
US10/363,752 Expired - Fee Related US7240003B2 (en) 2000-09-29 2001-09-28 Database annotation and retrieval

Country Status (8)

Country Link
US (1) US7240003B2 (en)
EP (1) EP1327206A2 (en)
JP (1) JP2004510256A (en)
KR (1) KR100612169B1 (en)
CN (1) CN1227613C (en)
AU (1) AU2001290136A1 (en)
GB (1) GB0023930D0 (en)
WO (1) WO2002027546A2 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050001909A1 (en) * 2003-07-02 2005-01-06 Konica Minolta Photo Imaging, Inc. Image taking apparatus and method of adding an annotation to an image
US20060085182A1 (en) * 2002-12-24 2006-04-20 Koninklijke Philips Electronics, N.V. Method and system for augmenting an audio signal
US20060206327A1 (en) * 2005-02-21 2006-09-14 Marcus Hennecke Voice-controlled data system
US20070150273A1 (en) * 2005-12-28 2007-06-28 Hiroki Yamamoto Information retrieval apparatus and method
US7310600B1 (en) * 1999-10-28 2007-12-18 Canon Kabushiki Kaisha Language recognition using a similarity measure
US20090030680A1 (en) * 2007-07-23 2009-01-29 Jonathan Joseph Mamou Method and System of Indexing Speech Data
US20090030894A1 (en) * 2007-07-23 2009-01-29 International Business Machines Corporation Spoken Document Retrieval using Multiple Speech Transcription Indices
US20090112593A1 (en) * 2007-10-24 2009-04-30 Harman Becker Automotive Systems Gmbh System for recognizing speech for searching a database
US20100125458A1 (en) * 2006-07-13 2010-05-20 Sri International Method and apparatus for error correction in speech recognition applications
US20120245936A1 (en) * 2011-03-25 2012-09-27 Bryan Treglia Device to Capture and Temporally Synchronize Aspects of a Conversation and Method and System Thereof
US20120271915A1 (en) * 2005-01-05 2012-10-25 Microsoft Corporation Processing files from a mobile device
US20130006625A1 (en) * 2011-06-28 2013-01-03 Sony Corporation Extended videolens media engine for audio recognition
US20130124203A1 (en) * 2010-04-12 2013-05-16 II Jerry R. Scoggins Aligning Scripts To Dialogues For Unmatched Portions Based On Matched Portions
US20130166303A1 (en) * 2009-11-13 2013-06-27 Adobe Systems Incorporated Accessing media data using metadata repository
US8959071B2 (en) 2010-11-08 2015-02-17 Sony Corporation Videolens media system for feature selection
US20150302848A1 (en) * 2014-04-21 2015-10-22 International Business Machines Corporation Speech retrieval method, speech retrieval apparatus, and program for speech retrieval apparatus
US10452661B2 (en) 2015-06-18 2019-10-22 Microsoft Technology Licensing, Llc Automated database schema annotation
US10572538B2 (en) * 2015-04-28 2020-02-25 Kabushiki Kaisha Toshiba Lattice finalization device, pattern recognition device, lattice finalization method, and computer program product

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6882970B1 (en) 1999-10-28 2005-04-19 Canon Kabushiki Kaisha Language recognition using sequence frequency
US7212968B1 (en) 1999-10-28 2007-05-01 Canon Kabushiki Kaisha Pattern matching method and apparatus
GB0011798D0 (en) 2000-05-16 2000-07-05 Canon Kk Database annotation and retrieval
JP4175093B2 (en) * 2002-11-06 2008-11-05 日本電信電話株式会社 Topic boundary determination method and apparatus, and topic boundary determination program
US7725318B2 (en) * 2004-07-30 2010-05-25 Nice Systems Inc. System and method for improving the accuracy of audio searching
US7912699B1 (en) * 2004-08-23 2011-03-22 At&T Intellectual Property Ii, L.P. System and method of lattice-based search for spoken utterance retrieval
JP4587165B2 (en) * 2004-08-27 2010-11-24 キヤノン株式会社 Information processing apparatus and control method thereof
JP4638726B2 (en) * 2004-12-22 2011-02-23 株式会社アルファジェン Sample set manufacturing method, gene alignment program, and sample set
US8694317B2 (en) * 2005-02-05 2014-04-08 Aurix Limited Methods and apparatus relating to searching of spoken audio data
US7634407B2 (en) 2005-05-20 2009-12-15 Microsoft Corporation Method and apparatus for indexing speech
US7809568B2 (en) 2005-11-08 2010-10-05 Microsoft Corporation Indexing and searching speech with text meta-data
US7831428B2 (en) * 2005-11-09 2010-11-09 Microsoft Corporation Speech index pruning
US7831425B2 (en) 2005-12-15 2010-11-09 Microsoft Corporation Time-anchored posterior indexing of speech
TWI312945B (en) * 2006-06-07 2009-08-01 Ind Tech Res Inst Method and apparatus for multimedia data management
EP2126707A2 (en) * 2007-01-17 2009-12-02 Verbal World, Inc. Methods and apparatus for manipulation of primary audio-optical data content and associated secondary data content
US20080270344A1 (en) * 2007-04-30 2008-10-30 Yurick Steven J Rich media content search engine
US20080270110A1 (en) * 2007-04-30 2008-10-30 Yurick Steven J Automatic speech recognition with textual content input
US7983915B2 (en) * 2007-04-30 2011-07-19 Sonic Foundry, Inc. Audio content search engine
US8504361B2 (en) * 2008-02-07 2013-08-06 Nec Laboratories America, Inc. Deep neural networks and methods for using same
US20110224982A1 (en) * 2010-03-12 2011-09-15 c/o Microsoft Corporation Automatic speech recognition based upon information retrieval methods
US8788434B2 (en) * 2010-10-28 2014-07-22 Google Inc. Search with joint image-audio queries
JP2013025299A (en) * 2011-07-26 2013-02-04 Toshiba Corp Transcription support system and transcription support method
US8849041B2 (en) * 2012-06-04 2014-09-30 Comcast Cable Communications, Llc Data recognition in content
ES2566569T3 (en) * 2012-06-28 2016-04-13 Jajah Ltd System and method to perform textual queries in voice communications
CN104978963A (en) * 2014-04-08 2015-10-14 富士通株式会社 Speech recognition apparatus, method and electronic equipment
US10769495B2 (en) * 2018-08-01 2020-09-08 Adobe Inc. Collecting multimodal image editing requests
CN111354348A (en) * 2018-12-21 2020-06-30 北京搜狗科技发展有限公司 Data processing method and device and data processing device
KR20210033258A (en) 2019-09-18 2021-03-26 삼성전자주식회사 Method and apparatus for processing sequence

Citations (71)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4227176A (en) 1978-04-27 1980-10-07 Dialog Systems, Inc. Continuous speech recognition method
US4736429A (en) 1983-06-07 1988-04-05 Matsushita Electric Industrial Co., Ltd. Apparatus for speech recognition
US4903305A (en) 1986-05-12 1990-02-20 Dragon Systems, Inc. Method for representing word models for use in speech recognition
US4975959A (en) 1983-11-08 1990-12-04 Texas Instruments Incorporated Speaker independent speech recognition process
US4980918A (en) 1985-05-09 1990-12-25 International Business Machines Corporation Speech recognition system with efficient storage and rapid assembly of phonological graphs
US4985924A (en) 1987-12-24 1991-01-15 Kabushiki Kaisha Toshiba Speech recognition apparatus
US5075896A (en) 1989-10-25 1991-12-24 Xerox Corporation Character and phoneme recognition based on probability clustering
US5131043A (en) 1983-09-05 1992-07-14 Matsushita Electric Industrial Co., Ltd. Method of and apparatus for speech recognition wherein decisions are made based on phonemes
US5136655A (en) 1990-03-26 1992-08-04 Hewlett-Pacard Company Method and apparatus for indexing and retrieving audio-video data
US5202952A (en) 1990-06-22 1993-04-13 Dragon Systems, Inc. Large-vocabulary continuous speech prefiltering and processing system
EP0597798A1 (en) 1992-11-13 1994-05-18 International Business Machines Corporation Method and system for utilizing audible search patterns within a multimedia presentation
US5333275A (en) 1992-06-23 1994-07-26 Wheatley Barbara J System and method for time aligning speech
US5390278A (en) 1991-10-08 1995-02-14 Bell Canada Phoneme based speech recognition
EP0649144A1 (en) 1993-10-18 1995-04-19 International Business Machines Corporation Automatic indexing of audio using speech recognition
EP0689153A2 (en) 1994-05-30 1995-12-27 Texas Instruments Incorporated Character recognition
US5500920A (en) 1993-09-23 1996-03-19 Xerox Corporation Semantic co-occurrence filtering for speech recognition and signal transcription applications
US5577249A (en) 1992-07-31 1996-11-19 International Business Machines Corporation Method for finding a reference token sequence in an original token string within a database of token strings using appended non-contiguous substrings
GB2302199A (en) 1996-09-24 1997-01-08 Allvoice Computing Plc Text processing
US5594641A (en) 1992-07-20 1997-01-14 Xerox Corporation Finite-state transduction of related word forms for text indexing and retrieval
US5638425A (en) 1992-12-17 1997-06-10 Bell Atlantic Network Services, Inc. Automated directory assistance system using word recognition and phoneme processing method
US5640487A (en) 1993-02-26 1997-06-17 International Business Machines Corporation Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models
EP0789349A2 (en) 1996-02-09 1997-08-13 Canon Kabushiki Kaisha Pattern matching method and apparatus and telephone system
US5675706A (en) 1995-03-31 1997-10-07 Lucent Technologies Inc. Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition
US5680605A (en) 1995-02-07 1997-10-21 Torres; Robert J. Method and apparatus for searching a large volume of data with a pointer-based device in a data processing system
US5684925A (en) 1995-09-08 1997-11-04 Matsushita Electric Industrial Co., Ltd. Speech representation by feature-based word prototypes comprising phoneme targets having reliable high similarity
US5708759A (en) 1996-11-19 1998-01-13 Kemeny; Emanuel S. Speech recognition using phoneme waveform parameters
US5721939A (en) 1995-08-03 1998-02-24 Xerox Corporation Method and apparatus for tokenizing text
US5729741A (en) 1995-04-10 1998-03-17 Golden Enterprises, Inc. System for storage and retrieval of diverse types of information obtained from different media sources which includes video, audio, and text transcriptions
US5737489A (en) 1995-09-15 1998-04-07 Lucent Technologies Inc. Discriminative utterance verification for connected digits recognition
US5737723A (en) 1994-08-29 1998-04-07 Lucent Technologies Inc. Confusable word detection in speech recognition
US5752227A (en) 1994-05-10 1998-05-12 Telia Ab Method and arrangement for speech to text conversion
EP0849723A2 (en) 1996-12-20 1998-06-24 ATR Interpreting Telecommunications Research Laboratories Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition
US5781884A (en) 1995-03-24 1998-07-14 Lucent Technologies, Inc. Grapheme-to-phoneme conversion of digit strings using weighted finite state transducers to apply grammar to powers of a number basis
US5787414A (en) 1993-06-03 1998-07-28 Kabushiki Kaisha Toshiba Data retrieval system using secondary information of primary data to be retrieved as retrieval key
US5799267A (en) 1994-07-22 1998-08-25 Siegel; Steven H. Phonic engine
WO1998047084A1 (en) 1997-04-17 1998-10-22 Sharp Kabushiki Kaisha A method and system for object-based video description and linking
US5835667A (en) 1994-10-14 1998-11-10 Carnegie Mellon University Method and apparatus for creating a searchable digital video library and a system and method of using such a library
US5852822A (en) 1996-12-09 1998-12-22 Oracle Corporation Index-only tables with nested group keys
WO1999005681A1 (en) 1997-07-23 1999-02-04 Siemens Aktiengesellschaft Process for storing search parameters of an image sequence and access to an image stream in said image sequence
US5870740A (en) 1996-09-30 1999-02-09 Apple Computer, Inc. System and method for improving the ranking of information retrieval results for short queries
US5873061A (en) 1995-05-03 1999-02-16 U.S. Philips Corporation Method for constructing a model of a new word for addition to a word model database of a speech recognition system
US5907821A (en) 1995-11-06 1999-05-25 Hitachi, Ltd. Method of computer-based automatic extraction of translation pairs of words from a bilingual text
GB2331816A (en) 1997-10-16 1999-06-02 Imarket Inc Searching a database using a phonetically encoded inverted index
US5983177A (en) 1997-12-18 1999-11-09 Nortel Networks Corporation Method and apparatus for obtaining transcriptions from multiple training utterances
US5999902A (en) 1995-03-07 1999-12-07 British Telecommunications Public Limited Company Speech recognition incorporating a priori probability weighting factors
US6023536A (en) 1995-07-03 2000-02-08 Fujitsu Limited Character string correction system and method using error pattern
US6061679A (en) 1997-11-25 2000-05-09 International Business Machines Corporation Creating and searching a data structure ordered by ranges of key masks associated with the data structure
US6070140A (en) 1995-06-05 2000-05-30 Tran; Bao Q. Speech recognizer
WO2000031723A1 (en) 1998-11-25 2000-06-02 Sony Electronics, Inc. Method and apparatus for very large vocabulary isolated word recognition in a parameter sharing speech recognition system
WO2000054168A2 (en) 1999-03-05 2000-09-14 Canon Kabushiki Kaisha Database annotation and retrieval
US6122613A (en) 1997-01-30 2000-09-19 Dragon Systems, Inc. Speech recognition using multiple recognizers (selectively) applied to the same input sample
GB2349260A (en) 1999-04-23 2000-10-25 Canon Kk Training apparatus
US6172675B1 (en) 1996-12-05 2001-01-09 Interval Research Corporation Indirect manipulation of data using temporally related data, with particular application to manipulation of audio or audiovisual data
US6182039B1 (en) 1998-03-24 2001-01-30 Matsushita Electric Industrial Co., Ltd. Method and apparatus using probabilistic language model based on confusable sets for speech recognition
US6192337B1 (en) 1998-08-14 2001-02-20 International Business Machines Corporation Apparatus and methods for rejecting confusible words during training associated with a speech recognition system
WO2001031627A2 (en) 1999-10-28 2001-05-03 Canon Kabushiki Kaisha Pattern matching method and apparatus
US6236964B1 (en) 1990-02-01 2001-05-22 Canon Kabushiki Kaisha Speech recognition apparatus and method for matching inputted speech and a word generated from stored referenced phoneme data
US6243680B1 (en) 1998-06-15 2001-06-05 Nortel Networks Limited Method and apparatus for obtaining a transcription of phrases through text and spoken utterances
US6272242B1 (en) 1994-07-15 2001-08-07 Ricoh Company, Ltd. Character recognition method and apparatus which groups similar character patterns
US6289140B1 (en) 1998-02-19 2001-09-11 Hewlett-Packard Company Voice control input for portable capture devices
US6314400B1 (en) 1998-09-16 2001-11-06 U.S. Philips Corporation Method of estimating probabilities of occurrence of speech vocabulary elements
US6321226B1 (en) 1998-06-30 2001-11-20 Microsoft Corporation Flexible keyboard searching
US20020022960A1 (en) 2000-05-16 2002-02-21 Charlesworth Jason Peter Andrew Database annotation and retrieval
US6389395B1 (en) 1994-11-01 2002-05-14 British Telecommunications Public Limited Company System and method for generating a phonetic baseform for a word and using the generated baseform for speech recognition
US6463413B1 (en) 1999-04-20 2002-10-08 Matsushita Electrical Industrial Co., Ltd. Speech recognition training for small hardware devices
US6487532B1 (en) 1997-09-24 2002-11-26 Scansoft, Inc. Apparatus and method for distinguishing similar-sounding utterances speech recognition
US6490563B2 (en) 1998-08-17 2002-12-03 Microsoft Corporation Proofreading with text to speech feedback
US6535850B1 (en) 2000-03-09 2003-03-18 Conexant Systems, Inc. Smart training and smart scoring in SD speech recognition system with user defined vocabulary
US6567778B1 (en) 1995-12-21 2003-05-20 Nuance Communications Natural language speech recognition using slot semantic confidence scores related to their word recognition confidence scores
US6567816B1 (en) 2000-03-07 2003-05-20 Paramesh Sampatrai Desai Method, system, and program for extracting data from database records using dynamic code
US6662180B1 (en) 1999-05-12 2003-12-09 Matsushita Electric Industrial Co., Ltd. Method for searching in large databases of automatically recognized text

Patent Citations (75)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4227176A (en) 1978-04-27 1980-10-07 Dialog Systems, Inc. Continuous speech recognition method
US4736429A (en) 1983-06-07 1988-04-05 Matsushita Electric Industrial Co., Ltd. Apparatus for speech recognition
US5131043A (en) 1983-09-05 1992-07-14 Matsushita Electric Industrial Co., Ltd. Method of and apparatus for speech recognition wherein decisions are made based on phonemes
US4975959A (en) 1983-11-08 1990-12-04 Texas Instruments Incorporated Speaker independent speech recognition process
US4980918A (en) 1985-05-09 1990-12-25 International Business Machines Corporation Speech recognition system with efficient storage and rapid assembly of phonological graphs
US4903305A (en) 1986-05-12 1990-02-20 Dragon Systems, Inc. Method for representing word models for use in speech recognition
US4985924A (en) 1987-12-24 1991-01-15 Kabushiki Kaisha Toshiba Speech recognition apparatus
US5075896A (en) 1989-10-25 1991-12-24 Xerox Corporation Character and phoneme recognition based on probability clustering
US6236964B1 (en) 1990-02-01 2001-05-22 Canon Kabushiki Kaisha Speech recognition apparatus and method for matching inputted speech and a word generated from stored referenced phoneme data
US5136655A (en) 1990-03-26 1992-08-04 Hewlett-Pacard Company Method and apparatus for indexing and retrieving audio-video data
US5202952A (en) 1990-06-22 1993-04-13 Dragon Systems, Inc. Large-vocabulary continuous speech prefiltering and processing system
US5390278A (en) 1991-10-08 1995-02-14 Bell Canada Phoneme based speech recognition
US5333275A (en) 1992-06-23 1994-07-26 Wheatley Barbara J System and method for time aligning speech
US5594641A (en) 1992-07-20 1997-01-14 Xerox Corporation Finite-state transduction of related word forms for text indexing and retrieval
US5577249A (en) 1992-07-31 1996-11-19 International Business Machines Corporation Method for finding a reference token sequence in an original token string within a database of token strings using appended non-contiguous substrings
EP0597798A1 (en) 1992-11-13 1994-05-18 International Business Machines Corporation Method and system for utilizing audible search patterns within a multimedia presentation
US5638425A (en) 1992-12-17 1997-06-10 Bell Atlantic Network Services, Inc. Automated directory assistance system using word recognition and phoneme processing method
US5640487A (en) 1993-02-26 1997-06-17 International Business Machines Corporation Building scalable n-gram language models using maximum likelihood maximum entropy n-gram models
US5787414A (en) 1993-06-03 1998-07-28 Kabushiki Kaisha Toshiba Data retrieval system using secondary information of primary data to be retrieved as retrieval key
US5500920A (en) 1993-09-23 1996-03-19 Xerox Corporation Semantic co-occurrence filtering for speech recognition and signal transcription applications
US5649060A (en) 1993-10-18 1997-07-15 International Business Machines Corporation Automatic indexing and aligning of audio and text using speech recognition
EP0649144A1 (en) 1993-10-18 1995-04-19 International Business Machines Corporation Automatic indexing of audio using speech recognition
US5752227A (en) 1994-05-10 1998-05-12 Telia Ab Method and arrangement for speech to text conversion
EP0689153A2 (en) 1994-05-30 1995-12-27 Texas Instruments Incorporated Character recognition
US6272242B1 (en) 1994-07-15 2001-08-07 Ricoh Company, Ltd. Character recognition method and apparatus which groups similar character patterns
US5799267A (en) 1994-07-22 1998-08-25 Siegel; Steven H. Phonic engine
US5737723A (en) 1994-08-29 1998-04-07 Lucent Technologies Inc. Confusable word detection in speech recognition
US5835667A (en) 1994-10-14 1998-11-10 Carnegie Mellon University Method and apparatus for creating a searchable digital video library and a system and method of using such a library
US6389395B1 (en) 1994-11-01 2002-05-14 British Telecommunications Public Limited Company System and method for generating a phonetic baseform for a word and using the generated baseform for speech recognition
US5680605A (en) 1995-02-07 1997-10-21 Torres; Robert J. Method and apparatus for searching a large volume of data with a pointer-based device in a data processing system
US5999902A (en) 1995-03-07 1999-12-07 British Telecommunications Public Limited Company Speech recognition incorporating a priori probability weighting factors
US5781884A (en) 1995-03-24 1998-07-14 Lucent Technologies, Inc. Grapheme-to-phoneme conversion of digit strings using weighted finite state transducers to apply grammar to powers of a number basis
US5675706A (en) 1995-03-31 1997-10-07 Lucent Technologies Inc. Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition
US5729741A (en) 1995-04-10 1998-03-17 Golden Enterprises, Inc. System for storage and retrieval of diverse types of information obtained from different media sources which includes video, audio, and text transcriptions
US5873061A (en) 1995-05-03 1999-02-16 U.S. Philips Corporation Method for constructing a model of a new word for addition to a word model database of a speech recognition system
US6070140A (en) 1995-06-05 2000-05-30 Tran; Bao Q. Speech recognizer
US6023536A (en) 1995-07-03 2000-02-08 Fujitsu Limited Character string correction system and method using error pattern
US5721939A (en) 1995-08-03 1998-02-24 Xerox Corporation Method and apparatus for tokenizing text
US5684925A (en) 1995-09-08 1997-11-04 Matsushita Electric Industrial Co., Ltd. Speech representation by feature-based word prototypes comprising phoneme targets having reliable high similarity
US5737489A (en) 1995-09-15 1998-04-07 Lucent Technologies Inc. Discriminative utterance verification for connected digits recognition
US5907821A (en) 1995-11-06 1999-05-25 Hitachi, Ltd. Method of computer-based automatic extraction of translation pairs of words from a bilingual text
US6567778B1 (en) 1995-12-21 2003-05-20 Nuance Communications Natural language speech recognition using slot semantic confidence scores related to their word recognition confidence scores
EP0789349A2 (en) 1996-02-09 1997-08-13 Canon Kabushiki Kaisha Pattern matching method and apparatus and telephone system
GB2302199A (en) 1996-09-24 1997-01-08 Allvoice Computing Plc Text processing
US5870740A (en) 1996-09-30 1999-02-09 Apple Computer, Inc. System and method for improving the ranking of information retrieval results for short queries
US5708759A (en) 1996-11-19 1998-01-13 Kemeny; Emanuel S. Speech recognition using phoneme waveform parameters
US6172675B1 (en) 1996-12-05 2001-01-09 Interval Research Corporation Indirect manipulation of data using temporally related data, with particular application to manipulation of audio or audiovisual data
US5852822A (en) 1996-12-09 1998-12-22 Oracle Corporation Index-only tables with nested group keys
EP0849723A2 (en) 1996-12-20 1998-06-24 ATR Interpreting Telecommunications Research Laboratories Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition
US6122613A (en) 1997-01-30 2000-09-19 Dragon Systems, Inc. Speech recognition using multiple recognizers (selectively) applied to the same input sample
WO1998047084A1 (en) 1997-04-17 1998-10-22 Sharp Kabushiki Kaisha A method and system for object-based video description and linking
WO1999005681A1 (en) 1997-07-23 1999-02-04 Siemens Aktiengesellschaft Process for storing search parameters of an image sequence and access to an image stream in said image sequence
US6487532B1 (en) 1997-09-24 2002-11-26 Scansoft, Inc. Apparatus and method for distinguishing similar-sounding utterances speech recognition
US6026398A (en) 1997-10-16 2000-02-15 Imarket, Incorporated System and methods for searching and matching databases
GB2331816A (en) 1997-10-16 1999-06-02 Imarket Inc Searching a database using a phonetically encoded inverted index
US6061679A (en) 1997-11-25 2000-05-09 International Business Machines Corporation Creating and searching a data structure ordered by ranges of key masks associated with the data structure
US5983177A (en) 1997-12-18 1999-11-09 Nortel Networks Corporation Method and apparatus for obtaining transcriptions from multiple training utterances
US6289140B1 (en) 1998-02-19 2001-09-11 Hewlett-Packard Company Voice control input for portable capture devices
US6182039B1 (en) 1998-03-24 2001-01-30 Matsushita Electric Industrial Co., Ltd. Method and apparatus using probabilistic language model based on confusable sets for speech recognition
US6243680B1 (en) 1998-06-15 2001-06-05 Nortel Networks Limited Method and apparatus for obtaining a transcription of phrases through text and spoken utterances
US6321226B1 (en) 1998-06-30 2001-11-20 Microsoft Corporation Flexible keyboard searching
US6192337B1 (en) 1998-08-14 2001-02-20 International Business Machines Corporation Apparatus and methods for rejecting confusible words during training associated with a speech recognition system
US6490563B2 (en) 1998-08-17 2002-12-03 Microsoft Corporation Proofreading with text to speech feedback
US6314400B1 (en) 1998-09-16 2001-11-06 U.S. Philips Corporation Method of estimating probabilities of occurrence of speech vocabulary elements
WO2000031723A1 (en) 1998-11-25 2000-06-02 Sony Electronics, Inc. Method and apparatus for very large vocabulary isolated word recognition in a parameter sharing speech recognition system
US6990448B2 (en) * 1999-03-05 2006-01-24 Canon Kabushiki Kaisha Database annotation and retrieval including phoneme data
US20020052740A1 (en) 1999-03-05 2002-05-02 Charlesworth Jason Peter Andrew Database annotation and retrieval
WO2000054168A2 (en) 1999-03-05 2000-09-14 Canon Kabushiki Kaisha Database annotation and retrieval
US6463413B1 (en) 1999-04-20 2002-10-08 Matsushita Electrical Industrial Co., Ltd. Speech recognition training for small hardware devices
GB2349260A (en) 1999-04-23 2000-10-25 Canon Kk Training apparatus
US6662180B1 (en) 1999-05-12 2003-12-09 Matsushita Electric Industrial Co., Ltd. Method for searching in large databases of automatically recognized text
WO2001031627A2 (en) 1999-10-28 2001-05-03 Canon Kabushiki Kaisha Pattern matching method and apparatus
US6567816B1 (en) 2000-03-07 2003-05-20 Paramesh Sampatrai Desai Method, system, and program for extracting data from database records using dynamic code
US6535850B1 (en) 2000-03-09 2003-03-18 Conexant Systems, Inc. Smart training and smart scoring in SD speech recognition system with user defined vocabulary
US20020022960A1 (en) 2000-05-16 2002-02-21 Charlesworth Jason Peter Andrew Database annotation and retrieval

Non-Patent Citations (40)

* Cited by examiner, † Cited by third party
Title
"Template Averaging For Adapting A Dynamic Time Warping Speech Recognizer", IBM Technical Disclosure Bulletin, vol. 32, No. 11 (Apr. 1990), pp. 422-426.
Bahl, L. R., et al., "A Method For The Construction Of Acoustic Markov Models For Words", IEEE Transactions on Speech and Audio Processing, vol. 1, No. 4 (Oct. 1993), pp. 443-452.
Berge, C., "Graphs And Hypergraphs", North Holland Mathematical Library, Amsterdam (1976) p. 175.
Besling, S., "A Statistical Approach To Multilingual Phonetic Transcription", Philips J. Res., vol. 49 (1995), pp. 367-379.
Bird, S. & Liberman, M., "A Formal Framework For Linguistic Annotation", Linguistic Data Consortium, University of Pennsylvania, Tech. Report No. MS-CIS-99-01 (Mar. 1999), pp. 1-48.
Bird, S. & Liberman, M., "Towards A Formal Framework For Linguistic Annotations", Linguistic Data Consortium, University of Pennsylvania, Presented at ICSLP (Sydney, Dec. 1998), 12 pages.
Cassidy, S. & Harrington, J., "EMU: An Enhanced Hierarchical Speech Data Management System", Proc. of the 6<SUP>th </SUP>Australian Speech Science and Technology Conference (Adalaide, 1996), pp. 381-386.
F. Schiel et al., "The Partitur Format at BAS", In Proc. of the First Int'l. Conference on Language Resources and Evaluation, Granada, Spain, 1998.
Foote, J. L., et al., "Unconstrained Keyword Spotting Using Phone Lattices With Application to Spoken Document Retrieval", Computer Speech and Language (1997), vol. II, pp. 207-224.
Gagnoulet, C., et al., "Mairievox: A Voice-Activated Information System", Speech Communication, Amsterdam, NL, 10(Feb. 1991), No. 1, pp. 23-31.
Gelin, P. & Wellekens, C. J., "Keyword Spotting For Video Soundtrack Indexing", 1996 IEEE Int. Conf. on Acoustics, Speech, and Sig. Proc., ICASSP-96, Conference Proceedings (May 7-10, 1996), vol. 1, pp. 299-302.
Gerber, C., "A General Approach to Speech Recognition", Proceedings of the Final Workshop on Multimedia Information Retrieval (Miro '95), Glasgow, Scotland (Sep. 18-20, 1995) pp. 1-12.
Haeb-Umbach, R., et al, "Automatic Transcription Of Unknown Words In A Speech Recognition System", IEEE (1995), pp. 840-843.
Jain, N., et al., "Creating Speaker-Specific Phonetic Templates With A Speaker-Independent Phonetic Recognizer: Implications For Voice Dialing", IEEE (1996), pp. 881-884.
James, D.A. & Young, S. J., "A Fast Lattice-Based Approach To Vocabulary Independent Wordspotting", 1994 IEEE Int. Conf. on Acoustics, Speech and Sig. Proc., ICASSP-94, vol. i (Adelaide, Australia, Apr. 19-22, 1994), pp. I-377-380.
Jokinen, P., et al., "A Comparison Of Approximate String Matching Algorithms", Software-Practice and Experience, vol. 26(12) (Dec. 1996), pp. 1439-1458.
Jonathan Harrington, Jane Johnstone: "The Effects of Equivalence Classes on Parsing Phonemes into Words in Continuous Recognition", (1987) 22, Sep./Dec. Nos. 3-4, London, Great Britain; pp. 273-288, 1987.
Kobayashi, Y. & Niimi, Y., "Matching Algorithms Between A Phonetic Lattice And Two Types Of Templates-Lattice And Graph", IEEE (1985), pp. 1597-1600.
Lee, K., "Automatic Speech Recognition: The Development Of The SPHINX System", Kluwar Academic Publishers (1989), pp. 28-29.
Markowitz, Judith A., "Using Speech Recognition", Prentice Hall PTR (1996), p. 220-221.
Micca, G., et al., "Three Dimensional DP For Phonetic Lattice Matching", Digital Signal Processing-87, Elsevier Science Publishers B.V., North Holland (1987), pp. 547-551.
Ng, K. & Zue, V., "Phonetic Recognition For Spoken Document Retrieval", Spoken Language Systems Group, MIT Laboratory for Computer Science (1998), 4 pages.
Ng, K. & Zue, V., "Subword Unit Representations For Spoken Document Retrieval", Spoken Language Systems Group, MIT Laboratory for Computer Science (1997), 4 pages.
Ng, K., "Survey Of Approaches To Information Retrieval Of Speech Messages", Spoken Language Systems Group, Laboratory for Computer Science, Massachusetts Institute of Technology (Feb. 16, 1996), pp. 1-34.
Niimi, "Speech Recognition", Information Science Series E-19-3, pp. 135-186 (1979).
Okawa, S. et al., "Automatic Training of Phoneme Dictionary Based on Mutual Information Criterion", IEEE (1994), pp. 241-244.
Rabiner & Juang, "Fundamentals Of Speech Recognition" (1993), pp. 42-50.
Rahim, M. et al., "A Neural Tree Network for Phoneme Classification With Experiments on the Timit Database", IEEE (1992), pp. II-345-II-348.
Sankoff, D., et al, "Time Warps, String Edits, And Macromolecules: The Theory And Practice Of Sequence Comparison", Bell Laboratories and David Sandkoff (1983), Ch. One, pp. 1-44; Part Three, pp. 211-214; Ch. Eleven, pp. 311-321, Ch. Sixteen, pp. 359-362.
Sankoff, D., et al., "Time Warps, String Edits, And Macromolecules: The Theory And Practice Of Sequence Comparison", Bell Laboratories and David Sandkoff (1983), Ch. One, pp. 1-44; Part Three, pp. 211-214; Ch. Eleven, pp. 311-318, Ch. Sixteen, pp. 359-362.
Schmid, P., et al., "Automatically Generated Word Pronunciations From Phoneme Classifier Output", IEEE (1993), pp. II-223-226.
Skilling, J., "Maximum Entropy And Bayesian Methods", Fundamental Theories of Physics, Kluwar Academic Publishers (1988), pp. 45-52.
Srinivasan, S. & Petkovic, D., "Phonetic Confusion Matrix Based Spoken Document Retrieval", Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Athens, Greece, Jul. 24-28, 2000), pp. 81-87.
Steve Reynolds, David McKelvie, Fergus McInnes: "A Comparative Study of Continuous Speech Recognition using Neural Networks and Hidden Markov Models", 1991, Proceedings IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 369-372, Toronto.
Wang, H., "Retrieval of Mandarin Spoken Documents Based on Syllable Lattice Matching", Pattern Recognition Letters (Jun. 2000), 21, pp. 615-624.
Wechsler, M., "Spoken Document Retrieval Based On Phoneme Recognition", Swiss Federal Institute of Technology (Zurich, 1998), pp. 1-121.
Witbrock, M. J. & Hauptmann, A. G., "Using Words And Phonetic Strings For Efficient Information Retrieval From Imperfectly Transcribed Spoken Documents", School of Computer Science, Carnegie Mellon University (Jul. 23, 1997), 4 pages.
Wold, E. et al., "Content-Based Classification Search, And Retrieval Of Audio", IEEE Multimedia (Fall 1996), pp. 27-36.
Wright, J. H., et al., "Statistical Models For Topic Identification Using Phoneme Substrings", IEEE (1996), pp. 307-310.
Zobel, J. & Dart, P., "Phonetic String Matching: Lessons From Information Retrieval", Sigir Forum, Assoc. for Computing Machinery (New York, 1996), pp. 166-172.

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7310600B1 (en) * 1999-10-28 2007-12-18 Canon Kabushiki Kaisha Language recognition using a similarity measure
US20060085182A1 (en) * 2002-12-24 2006-04-20 Koninklijke Philips Electronics, N.V. Method and system for augmenting an audio signal
US8433575B2 (en) * 2002-12-24 2013-04-30 Ambx Uk Limited Augmenting an audio signal via extraction of musical features and obtaining of media fragments
US20050001909A1 (en) * 2003-07-02 2005-01-06 Konica Minolta Photo Imaging, Inc. Image taking apparatus and method of adding an annotation to an image
US9106759B2 (en) * 2005-01-05 2015-08-11 Microsoft Technology Licensing, Llc Processing files from a mobile device
US11616820B2 (en) * 2005-01-05 2023-03-28 Microsoft Technology Licensing, Llc Processing files from a mobile device
US10432684B2 (en) 2005-01-05 2019-10-01 Microsoft Technology Licensing, Llc Processing files from a mobile device
US20120271915A1 (en) * 2005-01-05 2012-10-25 Microsoft Corporation Processing files from a mobile device
US20060206327A1 (en) * 2005-02-21 2006-09-14 Marcus Hennecke Voice-controlled data system
US9153233B2 (en) * 2005-02-21 2015-10-06 Harman Becker Automotive Systems Gmbh Voice-controlled selection of media files utilizing phonetic data
US20070150273A1 (en) * 2005-12-28 2007-06-28 Hiroki Yamamoto Information retrieval apparatus and method
US7756710B2 (en) * 2006-07-13 2010-07-13 Sri International Method and apparatus for error correction in speech recognition applications
US20100125458A1 (en) * 2006-07-13 2010-05-20 Sri International Method and apparatus for error correction in speech recognition applications
US8831946B2 (en) * 2007-07-23 2014-09-09 Nuance Communications, Inc. Method and system of indexing speech data
US20090030680A1 (en) * 2007-07-23 2009-01-29 Jonathan Joseph Mamou Method and System of Indexing Speech Data
US20090030894A1 (en) * 2007-07-23 2009-01-29 International Business Machines Corporation Spoken Document Retrieval using Multiple Speech Transcription Indices
US9405823B2 (en) 2007-07-23 2016-08-02 Nuance Communications, Inc. Spoken document retrieval using multiple speech transcription indices
US8380505B2 (en) * 2007-10-24 2013-02-19 Nuance Communications, Inc. System for recognizing speech for searching a database
US20090112593A1 (en) * 2007-10-24 2009-04-30 Harman Becker Automotive Systems Gmbh System for recognizing speech for searching a database
US20130166303A1 (en) * 2009-11-13 2013-06-27 Adobe Systems Incorporated Accessing media data using metadata repository
US8825488B2 (en) * 2010-04-12 2014-09-02 Adobe Systems Incorporated Method and apparatus for time synchronized script metadata
US8447604B1 (en) 2010-04-12 2013-05-21 Adobe Systems Incorporated Method and apparatus for processing scripts and related data
US20130124203A1 (en) * 2010-04-12 2013-05-16 II Jerry R. Scoggins Aligning Scripts To Dialogues For Unmatched Portions Based On Matched Portions
US20130124212A1 (en) * 2010-04-12 2013-05-16 II Jerry R. Scoggins Method and Apparatus for Time Synchronized Script Metadata
US8825489B2 (en) * 2010-04-12 2014-09-02 Adobe Systems Incorporated Method and apparatus for interpolating script data
US20130124213A1 (en) * 2010-04-12 2013-05-16 II Jerry R. Scoggins Method and Apparatus for Interpolating Script Data
US9191639B2 (en) 2010-04-12 2015-11-17 Adobe Systems Incorporated Method and apparatus for generating video descriptions
US9066049B2 (en) * 2010-04-12 2015-06-23 Adobe Systems Incorporated Method and apparatus for processing scripts
US9594959B2 (en) 2010-11-08 2017-03-14 Sony Corporation Videolens media engine
US8966515B2 (en) 2010-11-08 2015-02-24 Sony Corporation Adaptable videolens media engine
US9734407B2 (en) 2010-11-08 2017-08-15 Sony Corporation Videolens media engine
US8959071B2 (en) 2010-11-08 2015-02-17 Sony Corporation Videolens media system for feature selection
US8971651B2 (en) 2010-11-08 2015-03-03 Sony Corporation Videolens media engine
US20120245936A1 (en) * 2011-03-25 2012-09-27 Bryan Treglia Device to Capture and Temporally Synchronize Aspects of a Conversation and Method and System Thereof
US8938393B2 (en) * 2011-06-28 2015-01-20 Sony Corporation Extended videolens media engine for audio recognition
US20130006625A1 (en) * 2011-06-28 2013-01-03 Sony Corporation Extended videolens media engine for audio recognition
US9373328B2 (en) * 2014-04-21 2016-06-21 International Business Machines Corporation Speech retrieval method, speech retrieval apparatus, and program for speech retrieval apparatus
US9378736B2 (en) * 2014-04-21 2016-06-28 International Business Machines Corporation Speech retrieval method, speech retrieval apparatus, and program for speech retrieval apparatus
US20150310860A1 (en) * 2014-04-21 2015-10-29 International Business Machines Corporation Speech retrieval method, speech retrieval apparatus, and program for speech retrieval apparatus
US9626957B2 (en) * 2014-04-21 2017-04-18 Sinoeast Concept Limited Speech retrieval method, speech retrieval apparatus, and program for speech retrieval apparatus
US9626958B2 (en) * 2014-04-21 2017-04-18 Sinoeast Concept Limited Speech retrieval method, speech retrieval apparatus, and program for speech retrieval apparatus
US20150302848A1 (en) * 2014-04-21 2015-10-22 International Business Machines Corporation Speech retrieval method, speech retrieval apparatus, and program for speech retrieval apparatus
US10572538B2 (en) * 2015-04-28 2020-02-25 Kabushiki Kaisha Toshiba Lattice finalization device, pattern recognition device, lattice finalization method, and computer program product
US10452661B2 (en) 2015-06-18 2019-10-22 Microsoft Technology Licensing, Llc Automated database schema annotation

Also Published As

Publication number Publication date
WO2002027546A8 (en) 2002-08-15
EP1327206A2 (en) 2003-07-16
CN1457476A (en) 2003-11-19
CN1227613C (en) 2005-11-16
KR20030072327A (en) 2003-09-13
WO2002027546A3 (en) 2002-06-20
GB0023930D0 (en) 2000-11-15
AU2001290136A1 (en) 2002-04-08
KR100612169B1 (en) 2006-08-14
JP2004510256A (en) 2004-04-02
US20030177108A1 (en) 2003-09-18
WO2002027546A2 (en) 2002-04-04

Similar Documents

Publication Publication Date Title
US7240003B2 (en) Database annotation and retrieval
US6990448B2 (en) Database annotation and retrieval including phoneme data
US7590605B2 (en) Lattice matching
KR101255405B1 (en) Indexing and searching speech with text meta-data
US6873993B2 (en) Indexing method and apparatus
US8694317B2 (en) Methods and apparatus relating to searching of spoken audio data
CN101382937B (en) Multimedia resource processing method based on speech recognition and on-line teaching system thereof
US6172675B1 (en) Indirect manipulation of data using temporally related data, with particular application to manipulation of audio or audiovisual data
WO1998025216A9 (en) Indirect manipulation of data using temporally related data, with particular application to manipulation of audio or audiovisual data
US7177800B2 (en) Method and device for the processing of speech information
US20130080384A1 (en) Systems and methods for extracting and processing intelligent structured data from media files
JP5296598B2 (en) Voice information extraction device
JP2005257954A (en) Speech retrieval apparatus, speech retrieval method, and speech retrieval program
Robert-Ribes et al. Automatic generation of hyperlinks between audio and transcript.
EP1688915A1 (en) Methods and apparatus relating to searching of spoken audio data
JP3903738B2 (en) Information recording / retrieval apparatus, method, program, and recording medium
EP1688914A1 (en) Method and apparatus relating to searching of spoken audio data
JP6555613B2 (en) Recognition error correction apparatus and program, and caption generation system

Legal Events

Date Code Title Description
AS Assignment

Owner name: CANON KABUSHIKI KAISHA, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CHARLESWORTH, JASON PETER ANDREW;GARNER, PHILIP NEIL;REEL/FRAME:014113/0321

Effective date: 20011108

AS Assignment

Owner name: CANON KABUSHIKI KAISHA, JAPAN

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE ASSIGNEE ADDRESS. PREVIOUSLY RECORDED AT REEL 014113 FRAME 0321;ASSIGNORS:CHARLESWORTH, JASON PETER;GARNER, PHILIP NEIL;REEL/FRAME:016905/0108

Effective date: 20011108

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

CC Certificate of correction
FPAY Fee payment

Year of fee payment: 4

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20150703