US6640225B1 - Search method using an index file and an apparatus therefor - Google Patents

Search method using an index file and an apparatus therefor Download PDF

Info

Publication number
US6640225B1
US6640225B1 US09/676,803 US67680300A US6640225B1 US 6640225 B1 US6640225 B1 US 6640225B1 US 67680300 A US67680300 A US 67680300A US 6640225 B1 US6640225 B1 US 6640225B1
Authority
US
United States
Prior art keywords
position data
file
key character
character string
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US09/676,803
Inventor
Nobuaki Takishita
Takako Suzuki
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SUZUKI, TAKAKO, TAKISHITA, NOBUAKI
Application granted granted Critical
Publication of US6640225B1 publication Critical patent/US6640225B1/en
Adjusted expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99931Database or file accessing
    • Y10S707/99933Query processing, i.e. searching
    • Y10S707/99935Query augmenting and refining, e.g. inexact access

Definitions

  • the present invention relates to a search method that uses an index file that is composed of a key file, including key character strings, and a position data file, including position data that correspond to key character strings, and to an apparatus used for the search method.
  • an index is prepared for character strings that appear in documents that are sought, and then an all-sentences search is conducted, based on the index, to examine all available documents for the desired character string, or document.
  • the importance of such a search is acknowledged; however, with time, the amount of data searched for increases, and since the search index is thereby expanded, the required hard disk space may grow until it is almost prohibitively large.
  • all character strings for which search requests are submitted must be included, and a corresponding index be prepared.
  • a stop word system is employed. According to this method, a list is prepared of words such as THIS, A and THE in English, for example, that seem to be most frequently used, and these words are not included in an index file.
  • a compressing scheme is used to reduce the size of the index information.
  • stop word method and the compression method can reduce the size of an index file, the following shortcomings make them less effective.
  • stop words are inherent to a language, stop words that are unique to a pertinent language must be selected.
  • stop word may be included in a string for which a search request is submitted, data relative to the stop word is always deleted, and can not be searched for.
  • the size of the index can be reduced by data compression, the index information that is not actually required for a search can not be deleted.
  • a search method that uses an index file is a method for using an index file consisting of a key file, which includes key character strings, and a position data file, which includes position data corresponding to the key character strings.
  • a position data delete flag is correlated with a specific key character string, and position data that correspond to the specific key character string are deleted from the position data file.
  • the position data is deleted when a position data size, corresponding to the specific key character string, is attained that with the position data file size provides a specific ratio.
  • the position data is also deleted when the size of the position data reaches a specific value that corresponds to the.specific key character string.
  • An apparatus used for the search method of the present invention comprises: a new difference index preparation unit for preparing a new difference index file using a newly registered document; an index merge unit for merging a conventional index file with the new difference index file prepared by the new difference index preparation unit, for determining whether the above described position data file is to be deleted, and for preparing a new index file; and a search unit for beginning a search based on the new index file generated by the index merge unit.
  • the position data delete flag is correlated with a specific key character string, and the position data that corresponds to the specific key character string is deleted from the position data file. Therefore, while a request for an all-sentence search is satisfied, the position data for a character string that can not actually be employed for a search can be deleted, so that a considerable reduction in the size of an index file can be realized.
  • the structure of a key file is constituted by a key character string, the location of the key character string in the position data file, the size of the position data, and the position data delete flag.
  • a group of key character strings is specified in advance, for which position data are not to be deleted, even though it has been determined that the position data are to be deleted.
  • a search is to be performed by a method that uses a search key character string consisting of one word, a method that uses one search key character string when an index is prepared using the N gram method, or a method that uses using a search key character string consisting of a plurality of words. All of these methods can appropriately carry out the present invention.
  • FIG. 1 is a diagram showing the structure of an index file used to carry out a search method according to the present invention.
  • FIG. 2 is a diagram showing the structure of a key file in FIG. 1 .
  • FIG. 3 is a graph showing the relationship between the size of an overall index file and the amount of position data for each key character string.
  • FIG. 4 is a block diagram illustrating the arrangement of an apparatus used for the search method that employs the index file of the present invention.
  • FIG. 5 is a flowchart for explaining the position data deleting process performed by an index merging unit according to the present invention.
  • FIG. 6 is a diagram for explaining an example for the merging of key character strings according to the present invention.
  • FIG. 7 is a flowchart showing a search process performed when a search character string consists of one word.
  • FIG. 8 is a flowchart showing a search process performed when an index is prepared by the N-gram method and there is only one search character string.
  • FIG. 9 is a flowchart showing a search process performed when a search character string consists of a plurality of words.
  • an index file that consists of a key file, which includes key character strings, and a position data file, which includes position data that correspond to the key character strings.
  • FIG. 1 is a diagram showing the structure of an example index file that is used to carry out the search method of the present invention.
  • an index file 1 consists of a key file 2 and a position data file 3 .
  • the key file 2 includes keywords, such as JAVA and SQL, and information associated with these keywords.
  • the position data file 3 includes all the locations recorded for each keyword in the key file 2 .
  • the keyword “JAVA” occupies the 001 st character position in document 1 and the 001 st and 100 th character position in document 3
  • the keyword “SQL” occupies the 010 th character position in document 2 .
  • FIG. 2 is a diagram showing an example structure for the key file 2 in FIG. 1 .
  • the key file 2 consists of keywords 4 , position data file locations 5 , position data sizes 6 , and position data delete flags 7 .
  • the key file 2 holds data describing pertinent portions of the position data file 3 that in turn hold information describing the documents in which the keywords 4 can be found and the positions in the documents that are occupied by the keywords 4 .
  • a location 500 in the position data file holds 4000 entries and the position data delete flag is set to “NO,” which indicates that this keyword is to be employed for a search.
  • the index file 1 used for the search consists of the key file 2 and the position data file 3 described above.
  • the size of the key file 2 changes very little, but the position data file 3 grows in consonance with the index file 1 , so that the ratio of the position data file 3 to the index file 1 is constantly increasing.
  • the size of the index file 1 reaches a specific level, e.g., 20 MB or 100 MB, and efficient management of the index file 1 is not possible, or even in cases where efficient management of the index file 1 is possible, since so many keywords are entered, during a search the processing of meaningless keywords will occur. Therefore, in this invention, the delete flag is set for such keywords, so that locations in the position data file 3 that correspond to the keywords are deleted and the size of the index file 1 is reduced. This method will now be described.
  • FIG. 3 is a graph showing the relationship between the overall size of an index file 1 and the amount of position data acquired for each keyword 4 .
  • A the size at which an index file in which the position data file 3 contents that are to be deleted are indicated
  • B the amount of position data that is being held for the keywords that are to be deleted at that time
  • C the amount of position data being held for several keywords that are to be deleted because the data can not be employed for a search
  • the search for position data file 3 contents continues, and no contents are deleted until the size of the index file reaches A, 20 MB, for example.
  • the horizontal axis represents the total size of the index file and includes a portion that is to be deleted
  • when the size of the index file reaches A only the amount of position data that is equal to or greater than B, e.g., only the position data for character strings that occupy space at a ratio to the size of the index file of 0.1% or greater, are deleted (straight/linear lines a and b in FIG. 3 ). This is the first time the deletion of position data is performed.
  • the deletion of position data is again performed when the size of the index file exceeds A, and the amount of position data obtained during the search that can not actually be used, because of the system resources or because too many hits were recorded during the course of the search, reaches the amount represented by C (straight/linear line c in FIG. 3 ).
  • FIG. 4 is a block diagram illustrating the arrangement of an example apparatus used for the search method that employs the index file of this invention.
  • the apparatus comprises: a new difference index preparation unit 13 , for preparing a new difference index file 14 from newly registered documents 12 ; an index merging unit 15 , for merging a new difference index file 14 prepared by the new difference index preparation unit 13 and a conventional index file 11 and preparing a new index file 16 , and for determining whether the above described position data file should be deleted; and a search unit 17 , for executing an all-sentences search based on the new index file 16 that is prepared by the index merging unit 15 .
  • the conventional index file 11 is the last one that has currently been prepared. Normally, for an all-sentences search data is added to the index each time the number of original documents is increased.
  • the newly registered document 12 is comprises the most recently registered document group.
  • the new difference index preparation unit 13 defines, as the new difference index file 14 , a new difference that is registered, and assembles the file 14 using a structure corresponding to that of the index structure, so that the new difference index file 14 and the conventional index file 11 can be merged at the next step.
  • the index merging unit 15 prepares a new index file 16 by combining the conventional index file 11 and the new difference index file 14 that is prepared by the new difference index preparation unit 13 , and determines whether the position data file of the present invention should be deleted.
  • the new index file 16 which is output by the index merging unit 15 , is the one that is thereafter used to perform a search. Later, when the search unit 17 receives a search character string entered by a user, it searches the new index file 16 to find a document in which the pertinent character string is included.
  • FIG. 5 is a flowchart for explaining the position data deletion process performed by the index merging unit 15 of the present invention.
  • FIG. 6 is a diagram for explaining one example of the merging of keywords according to this invention. While referring to FIGS. 5 and 6, first, at step 401 a check is performed to determine whether, for the first time, the sum of the size of a conventional index file and the size of a new difference index file has exceeded size A. When, for the first time, the sum of the files has exceeded size A, process ( 1 ) at steps 402 to 406 is performed.
  • process ( 2 ) at steps 407 to 412 is performed.
  • the position data deletion process including processes ( 1 ) and ( 2 )
  • the deletion process of this present invention can be carried out by performing only 1 of the processes ( 1 ) and ( 2 ).
  • program control moves to step 402 .
  • keywords are read in order beginning with key files in a conventional index file 501 and a new difference index file 502 in FIG. 6, i.e., in the order employed in a new index file 503 provided by merging the conventional index file 501 and the new difference index file 502 .
  • a check is performed to determine whether the sum of the position data held by a conventional index file and a new difference file that correspond to the keyword that was read is equal to or greater than B. If the sum of the position data held by these files is not greater than B, program control advances to step 404 .
  • step 405 to add the keyword to the new index file, the data are written to the key file and the position data file.
  • step 405 data are not added to the position data file of the new index file, only the entry for the keyword is added to the key file of the new index file, and a corresponding position data delete flag is set to “YES.”
  • step 406 if the keyword that is currently being processed is the last keyword, the merging procedure performed in process ( 1 ) is terminated. If the currently processed keyword is not the last one, the process beginning at step 402 is repeated.
  • program control moves to step 407 .
  • step 407 keywords are read in order beginning with the key files in the conventional index file 501 and in the new difference index file 502 in FIG. 6, i.e., in the order of their arrangement in the new index file 503 that is obtained by merging the conventional index file 501 and the new difference index file 502 .
  • step 408 a check is performed to determine whether the sum of the position data held by a conventional index file and that held by a new difference file that correspond to the keyword that was read is equal to or greater than C.
  • step 409 If the sum of the position data held by these files is not greater than C, program control advances to step 409 . If the sum of the position data held by these files is equal to or greater than C, program control advances to step 411 .
  • step 409 the position data delete flag of the key file in the index file before it was merged is examined to determine whether it is set to “YES.” If the flag is not set to “YES,” program control goes to step 410 . If the flag is set to “YES,” program control moves to step 411 .
  • step 410 to add the keyword to the new index file, the data are written to the key file and the position data file.
  • step 411 data are not added to the position data file of the new index file, only the entry for the keyword is added to the key file of the new index file, and a corresponding position data delete flag is set to “YES.”
  • step 412 if the keyword that is currently being processed is the last keyword, the merging procedure in process ( 2 ) is terminated. If the currently processed keyword is not the last one, the process beginning at step 407 is repeated.
  • search unit 17 According to the search method that uses the index file of this invention, the structure of the search unit 17 is not specifically prescribed; any conventional structure can be employed. When a search is performed based on the following three search character strings, the present invention can be employed more effectively. The three typical search character strings will now be described.
  • FIG. 7 is a flowchart showing the search process when a search character string consists of one word.
  • a search character string is accepted, and at step 602 a position data delete flag is examined to determine whether the position data for the search character string have been deleted. If the position data have not been deleted, program control advances to step 603 . If the position data have been deleted, program control goes to step 604 .
  • the character string that corresponds to the search character string is found in the key file, and the contents of a pertinent position data file are returned to a user.
  • a message is returned to the user to request a search using another search character string because too many hits were found. The search process is thereafter terminated.
  • FIG. 8 is a flowchart showing the search process when an index is prepared using the N-gram method and only one search character string exists.
  • a search character string is accepted, and at step 702 a position data delete flag is examined to determine whether the position data for the search character string have been deleted. If the position data have not been deleted, program control advances to step 703 . If the position data have been deleted, program control goes to step 704 .
  • a search is conducted for the character string that corresponds to the search character string, and the contents of a pertinent position data file are returned to a user.
  • a search is conducted by using a character string that is shifted one character from the search character string for which the position data have been deleted. For example, when the search character string, is “Nippon Ginko,” at step 703 the search process is initiated with “Nippon” and “Ginko,” however, at step 704 the search process is initiated with “Hon Gin” and “Ginko” if the position data for “Nippon” has been deleted.
  • the position data delete flag for the search character string “Hon Gin” is examined to determine whether the position data for “Hon Gin” have also been deleted. If the position data have been deleted, program control advances to step 706 . If the position data have not been deleted, program control goes to step 707 .
  • a message is returned to a user to request that a search be performed using another search character string because too many hits were found.
  • the results obtained by the search for which “Hon Gin” and “Ginko” were used are transmitted to the user. The search process is thereafter terminated.
  • FIG. 9 is a flowchart for explaining the search process performed when a search character string consists of a plurality of words.
  • a search character string is accepted, and at step 802 a position data delete flag for each word of the string is examined to determine whether the position data for each word have been deleted. If the position data have not been deleted, program control advances to step 803 . If the position data have been deleted, program control goes to step 804 .
  • the character string that correspond to the individual search character string are obtained from the key file, and the contents of pertinent position data files are returned to a user.
  • a check is performed to determine whether the position data for the entire search character string have been deleted, and if the position data for the entire string have been deleted, program control advances to step 805 . If the position data for the entire string have not been deleted, program control advances to step 806 .
  • a message is returned to the user to request that a search be made using another character string because too many hits have been found.
  • the search results are returned to the user, with the observation that no computations are performed for a search character string for which the position data have been deleted. At the same time, for the character string for which the position data have been deleted, a message is returned to the user indicating that too many hits were made.
  • search results obtained by using only “SQL” are returned to the user, as is a message indicating too many hits have been found for “FORUM.” The search process is thereafter terminated.
  • an entire character string is deleted for which the size of the position data that is reached is equal to or greater than B or C.
  • a Japanese character string may be included as a target to be deleted. If the pertinent Japanese string is not to be deleted, it can be so designated that a character string consisting of only Chinese characters, for example, should not be deleted, and thus, position data can be maintained for a character string group for which position data are not supposed to be deleted.
  • Values A, B and C in the above embodiment are not constant values, and arbitrary values can be selected as needed in accordance with the configuration of a system and the performances that are demanded.
  • a position data delete flag is correlated with a predetermined key character string, and position data corresponding to the key character string are deleted from a position data file. Specifically, when the size of the overall index file reaches A for the first time, a position data delete flag is set that corresponds to a key character string for which the amount of position data is equal to or greater than B, and the position data are deleted from the position data file. And/or when the size of the entire index file has exceeded A, a position data delete flag is set that corresponds to a key character string for which the amount of position data is equal to or greater than C, and the position data are deleted from the position data file. Therefore, while a request for an all-sentences search is satisfied, position data for a character string that can not be used for the actual search can be deleted, and the size of an index file can be considerably reduced.

Abstract

A search method that uses an index file consisting of a key file that includes key character strings, and a position data file that includes position data corresponding to the key character strings, a position data delete flag is correlated with a specific key character string, and position data that correspond to the specific key character string are deleted from the position data file. Where A denotes the size of an index at which the contents of the position data file are to be deleted, B denotes the amount of position data for each keyword that is to be deleted at that time, and C denotes the amount of position data for a keyword that is to be deleted because the data can not be employed for the search. When (1) the size of the overall index file reaches A for the first time, a position data delete flag is set that corresponds to a key character string for which the amount of the position data is equal to or greater than B while the position data are deleted from the position data file; and/or when (2) the size of the entire index file exceeds A, a position data flag is set that corresponds to a key character string for which the amount of position data is equal to or greater than C, and the position data are deleted from the position data file.

Description

FIELD OF THE INVENTION
The present invention relates to a search method that uses an index file that is composed of a key file, including key character strings, and a position data file, including position data that correspond to key character strings, and to an apparatus used for the search method.
BACKGROUND OF THE INVENTION
In order to quickly search for in-house document data or a home page on the Internet, conventionally, an index is prepared for character strings that appear in documents that are sought, and then an all-sentences search is conducted, based on the index, to examine all available documents for the desired character string, or document. The importance of such a search is acknowledged; however, with time, the amount of data searched for increases, and since the search index is thereby expanded, the required hard disk space may grow until it is almost prohibitively large. Further, for an all-sentences search, all character strings for which search requests are submitted must be included, and a corresponding index be prepared. Thus, again, the size of the index increases, and as it does, so too does the size of the obtained results; and this makes it difficult for a user to find a desired document. Furthermore, unsuccessful searches, for character strings for which positive results are not obtained, may occur due to an existing system/resources relationship.
The following two conventional methods are well known techniques employed to reduce the size of index files, and thus resolve the above problems. For the first method, a stop word system is employed. According to this method, a list is prepared of words such as THIS, A and THE in English, for example, that seem to be most frequently used, and these words are not included in an index file. For the second method, a compressing scheme is used to reduce the size of the index information.
Although the stop word method and the compression method can reduce the size of an index file, the following shortcomings make them less effective.
For the stop word method:
Although the amount of information included for a character string that is frequently used can be reduced, once an index for FORUM has been prepared, information can not be deleted for a word, such as FORUM or APPENDED, that appears more frequently and is inherent to an index, and that is not searched for using the pertinent index.
Since stop words are inherent to a language, stop words that are unique to a pertinent language must be selected.
Although a stop word may be included in a string for which a search request is submitted, data relative to the stop word is always deleted, and can not be searched for.
If the number of words that can be handled by a system is set, and there is an increase in the size of an index that causes it to exceed the limit, even though a large number of search results may be obtained, a search will be interrupted and the system will be adversely affected, because with the stop word method, index information that is not needed for a search can not be deleted.
For the compression method:
Although the size of the index can be reduced by data compression, the index information that is not actually required for a search can not be deleted.
A technique for eliminating inefficient searches and useless search results is disclosed in Japanese Unexamined Patent Publication No. Hei 10-171692. However, this technique involves the deletion, from a search index, of very common words that are located at the ends of the index terms, and provides an approach that differs from the method for reducing the size of a position data file that constitutes a problem when an index for an all-sentences search is prepared.
To resolve the above shortcomings, it is one object of the present invention to provide a search method using an index file, whereby the size of an index file can be considerably reduced, and an apparatus that is to be used for the search method.
SUMMARY OF THE INVENTION
A search method according to the present invention that uses an index file is a method for using an index file consisting of a key file, which includes key character strings, and a position data file, which includes position data corresponding to the key character strings. According to this search method, a position data delete flag is correlated with a specific key character string, and position data that correspond to the specific key character string are deleted from the position data file. In one preferred aspect, the position data is deleted when a position data size, corresponding to the specific key character string, is attained that with the position data file size provides a specific ratio. Furthermore, the position data is also deleted when the size of the position data reaches a specific value that corresponds to the.specific key character string.
An apparatus used for the search method of the present invention comprises: a new difference index preparation unit for preparing a new difference index file using a newly registered document; an index merge unit for merging a conventional index file with the new difference index file prepared by the new difference index preparation unit, for determining whether the above described position data file is to be deleted, and for preparing a new index file; and a search unit for beginning a search based on the new index file generated by the index merge unit.
According to the present invention, the position data delete flag is correlated with a specific key character string, and the position data that corresponds to the specific key character string is deleted from the position data file. Therefore, while a request for an all-sentence search is satisfied, the position data for a character string that can not actually be employed for a search can be deleted, so that a considerable reduction in the size of an index file can be realized.
In another preferred aspect, the structure of a key file is constituted by a key character string, the location of the key character string in the position data file, the size of the position data, and the position data delete flag. A group of key character strings is specified in advance, for which position data are not to be deleted, even though it has been determined that the position data are to be deleted. When the position data delete flag, corresponding to a specific key character string, is set, position data for the key character string is not added to the position data file for the index file. When the position data delete flag, corresponding to a specific key character string, is not set, position data for the key character string is added to the position data file for the index file. A search is to be performed by a method that uses a search key character string consisting of one word, a method that uses one search key character string when an index is prepared using the N gram method, or a method that uses using a search key character string consisting of a plurality of words. All of these methods can appropriately carry out the present invention.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention will now be described in greater detail with reference to the appended figures wherein:
FIG. 1 is a diagram showing the structure of an index file used to carry out a search method according to the present invention.
FIG. 2 is a diagram showing the structure of a key file in FIG. 1.
FIG. 3 is a graph showing the relationship between the size of an overall index file and the amount of position data for each key character string.
FIG. 4 is a block diagram illustrating the arrangement of an apparatus used for the search method that employs the index file of the present invention.
FIG. 5 is a flowchart for explaining the position data deleting process performed by an index merging unit according to the present invention.
FIG. 6 is a diagram for explaining an example for the merging of key character strings according to the present invention.
FIG. 7 is a flowchart showing a search process performed when a search character string consists of one word.
FIG. 8 is a flowchart showing a search process performed when an index is prepared by the N-gram method and there is only one search character string.
FIG. 9 is a flowchart showing a search process performed when a search character string consists of a plurality of words.
DETAILED DESCRIPTION OF THE INVENTION
In this invention, a search method is employed whereby an index file is used that consists of a key file, which includes key character strings, and a position data file, which includes position data that correspond to the key character strings. First, the structures of the index file and the key file will be described.
FIG. 1 is a diagram showing the structure of an example index file that is used to carry out the search method of the present invention. In FIG. 1, an index file 1 consists of a key file 2 and a position data file 3. The key file 2 includes keywords, such as JAVA and SQL, and information associated with these keywords. The position data file 3 includes all the locations recorded for each keyword in the key file 2. For example, the keyword “JAVA” occupies the 001 st character position in document 1 and the 001 st and 100 th character position in document 3, and the keyword “SQL” occupies the 010 th character position in document 2.
FIG. 2 is a diagram showing an example structure for the key file 2 in FIG. 1. In FIG. 2, the key file 2 consists of keywords 4, position data file locations 5, position data sizes 6, and position data delete flags 7. The key file 2 holds data describing pertinent portions of the position data file 3 that in turn hold information describing the documents in which the keywords 4 can be found and the positions in the documents that are occupied by the keywords 4. As can be determined from the example structure, for the keyword “JAVA,” for example, a location 500 in the position data file holds 4000 entries and the position data delete flag is set to “NO,” which indicates that this keyword is to be employed for a search. In like manner, for the keyword “FORUM,” since the position data delete flag is set to “YES,” it is evident that for this keyword no position data is available that can be used during a search, while from the “N/A” entry it can be inferred that no storage location has been allocated for the position data. Note in that the key file 2 the keywords 4 are arranged in alphabetical order.
According to this invention, the index file 1 used for the search consists of the key file 2 and the position data file 3 described above. As the size of the index file 1 increases, the size of the key file 2 changes very little, but the position data file 3 grows in consonance with the index file 1, so that the ratio of the position data file 3 to the index file 1 is constantly increasing. When the size of the index file 1 reaches a specific level, e.g., 20 MB or 100 MB, and efficient management of the index file 1 is not possible, or even in cases where efficient management of the index file 1 is possible, since so many keywords are entered, during a search the processing of meaningless keywords will occur. Therefore, in this invention, the delete flag is set for such keywords, so that locations in the position data file 3 that correspond to the keywords are deleted and the size of the index file 1 is reduced. This method will now be described.
FIG. 3 is a graph showing the relationship between the overall size of an index file 1 and the amount of position data acquired for each keyword 4. In FIG. 3, if A: the size at which an index file in which the position data file 3 contents that are to be deleted are indicated, if B: the amount of position data that is being held for the keywords that are to be deleted at that time, and if C: the amount of position data being held for several keywords that are to be deleted because the data can not be employed for a search, there are two points at which the position data stored in the position data file 3 are deleted. First, there is the time at which the overall size of the index file reaches A. Since a request to search for all character strings is issued for an all-sentences search, the search for position data file 3 contents continues, and no contents are deleted until the size of the index file reaches A, 20 MB, for example. According to the graph, wherein the horizontal axis represents the total size of the index file and includes a portion that is to be deleted, when the size of the index file reaches A, only the amount of position data that is equal to or greater than B, e.g., only the position data for character strings that occupy space at a ratio to the size of the index file of 0.1% or greater, are deleted (straight/linear lines a and b in FIG. 3). This is the first time the deletion of position data is performed. Then, the deletion of position data is again performed when the size of the index file exceeds A, and the amount of position data obtained during the search that can not actually be used, because of the system resources or because too many hits were recorded during the course of the search, reaches the amount represented by C (straight/linear line c in FIG. 3).
An apparatus that is used for the search method and that employs the index file of the present invention will now be described. FIG. 4 is a block diagram illustrating the arrangement of an example apparatus used for the search method that employs the index file of this invention. In FIG. 4, the apparatus comprises: a new difference index preparation unit 13, for preparing a new difference index file 14 from newly registered documents 12; an index merging unit 15, for merging a new difference index file 14 prepared by the new difference index preparation unit 13 and a conventional index file 11 and preparing a new index file 16, and for determining whether the above described position data file should be deleted; and a search unit 17, for executing an all-sentences search based on the new index file 16 that is prepared by the index merging unit 15.
The individual units will now be described in detail. First, the conventional index file 11 is the last one that has currently been prepared. Normally, for an all-sentences search data is added to the index each time the number of original documents is increased. The newly registered document 12 is comprises the most recently registered document group. The new difference index preparation unit 13 defines, as the new difference index file 14, a new difference that is registered, and assembles the file 14 using a structure corresponding to that of the index structure, so that the new difference index file 14 and the conventional index file 11 can be merged at the next step. The index merging unit 15 prepares a new index file 16 by combining the conventional index file 11 and the new difference index file 14 that is prepared by the new difference index preparation unit 13, and determines whether the position data file of the present invention should be deleted. The new index file 16, which is output by the index merging unit 15, is the one that is thereafter used to perform a search. Later, when the search unit 17 receives a search character string entered by a user, it searches the new index file 16 to find a document in which the pertinent character string is included.
The position data deleting process performed by the index merging unit 15 will now be described. FIG. 5 is a flowchart for explaining the position data deletion process performed by the index merging unit 15 of the present invention. FIG. 6 is a diagram for explaining one example of the merging of keywords according to this invention. While referring to FIGS. 5 and 6, first, at step 401 a check is performed to determine whether, for the first time, the sum of the size of a conventional index file and the size of a new difference index file has exceeded size A. When, for the first time, the sum of the files has exceeded size A, process (1) at steps 402 to 406 is performed. When it is not the first time that the sum of the two files has exceeded size A, i.e., when the sum of the two files had also exceeded size A in the past, process (2) at steps 407 to 412 is performed. In this embodiment, the position data deletion process, including processes (1) and (2), is explained; however, the deletion process of this present invention can be carried out by performing only 1 of the processes (1) and (2).
Process (1):
When the sum of the size of the conventional index file and the size of the new difference index file has exceeded size A for the first time, program control moves to step 402. At step 402, keywords are read in order beginning with key files in a conventional index file 501 and a new difference index file 502 in FIG. 6, i.e., in the order employed in a new index file 503 provided by merging the conventional index file 501 and the new difference index file 502. Then, at step 403, a check is performed to determine whether the sum of the position data held by a conventional index file and a new difference file that correspond to the keyword that was read is equal to or greater than B. If the sum of the position data held by these files is not greater than B, program control advances to step 404. If the sum of the position data held by these files is equal to or greater than B, program control advances to step 405. At step 404, to add the keyword to the new index file, the data are written to the key file and the position data file. At step 405, data are not added to the position data file of the new index file, only the entry for the keyword is added to the key file of the new index file, and a corresponding position data delete flag is set to “YES.” At step 406, if the keyword that is currently being processed is the last keyword, the merging procedure performed in process (1) is terminated. If the currently processed keyword is not the last one, the process beginning at step 402 is repeated.
Process (2):
If it is not the first time that the sum of the size of the conventional index file and the size of the new difference index file has exceeded the size of A, program control moves to step 407. At step 407, keywords are read in order beginning with the key files in the conventional index file 501 and in the new difference index file 502 in FIG. 6, i.e., in the order of their arrangement in the new index file 503 that is obtained by merging the conventional index file 501 and the new difference index file 502. At step 408, a check is performed to determine whether the sum of the position data held by a conventional index file and that held by a new difference file that correspond to the keyword that was read is equal to or greater than C. If the sum of the position data held by these files is not greater than C, program control advances to step 409. If the sum of the position data held by these files is equal to or greater than C, program control advances to step 411. At step 409, the position data delete flag of the key file in the index file before it was merged is examined to determine whether it is set to “YES.” If the flag is not set to “YES,” program control goes to step 410. If the flag is set to “YES,” program control moves to step 411. At step 410, to add the keyword to the new index file, the data are written to the key file and the position data file. At step 411, data are not added to the position data file of the new index file, only the entry for the keyword is added to the key file of the new index file, and a corresponding position data delete flag is set to “YES.” At step 412, if the keyword that is currently being processed is the last keyword, the merging procedure in process (2) is terminated. If the currently processed keyword is not the last one, the process beginning at step 407 is repeated.
The operation performed by the search unit 17 will now be described. According to the search method that uses the index file of this invention, the structure of the search unit 17 is not specifically prescribed; any conventional structure can be employed. When a search is performed based on the following three search character strings, the present invention can be employed more effectively. The three typical search character strings will now be described.
(1) When a search character string consists of one word:
(2) When an index is prepared by the N-gram method and only one search character string exists:
Assume a search is made for “Nippon Ginko” in an N-gram index file. If position data for “Nippon” is deleted, the pertinent document can not be designated; however, if character strings “Hon Gin” and “Ginko” are searched for, search results close to those for “Nippon Ginko” can be obtained. In other words, a search can be made for a character string that includes a stop word.
(3) When a search character string consists of a plurality of words:
If the size of an index file is greatly increased, considerable system resources are consumed during a search that includes a search character string for which the number of documents found by the search is very large. For a search that uses a plurality of character strings, the search results are returned to the user, with the observation that no computations are performed for a character string for which the number of documents found by the search is equal to or greater than a predetermined number, and for a character string for which position data are deleted. Further, a message is transmitted suggesting that a different character string be employed for the search because too many documents were found for the pertinent character string. As a result, the results can be quickly provided for a search that conventionally requires an extended period of time, and the system resources can be effectively employed.
The above three specific search character strings will now be described in detail.
(1) When a search character string consists of one word: FIG. 7 is a flowchart showing the search process when a search character string consists of one word. First, at step 601 in FIG. 7 a search character string is accepted, and at step 602 a position data delete flag is examined to determine whether the position data for the search character string have been deleted. If the position data have not been deleted, program control advances to step 603. If the position data have been deleted, program control goes to step 604. At step 603, the character string that corresponds to the search character string is found in the key file, and the contents of a pertinent position data file are returned to a user. At step 604, a message is returned to the user to request a search using another search character string because too many hits were found. The search process is thereafter terminated.
(2) When an index is prepared using the N-gram method and only one search character string exists: FIG. 8 is a flowchart showing the search process when an index is prepared using the N-gram method and only one search character string exists. First, at step 701 in FIG. 8 a search character string is accepted, and at step 702 a position data delete flag is examined to determine whether the position data for the search character string have been deleted. If the position data have not been deleted, program control advances to step 703. If the position data have been deleted, program control goes to step 704. At step 703, a search is conducted for the character string that corresponds to the search character string, and the contents of a pertinent position data file are returned to a user. At step 704, a search is conducted by using a character string that is shifted one character from the search character string for which the position data have been deleted. For example, when the search character string, is “Nippon Ginko,” at step 703 the search process is initiated with “Nippon” and “Ginko,” however, at step 704 the search process is initiated with “Hon Gin” and “Ginko” if the position data for “Nippon” has been deleted. At step 705, the position data delete flag for the search character string “Hon Gin” is examined to determine whether the position data for “Hon Gin” have also been deleted. If the position data have been deleted, program control advances to step 706. If the position data have not been deleted, program control goes to step 707. At step 706, a message is returned to a user to request that a search be performed using another search character string because too many hits were found. At step 707, the results obtained by the search for which “Hon Gin” and “Ginko” were used are transmitted to the user. The search process is thereafter terminated.
(3) When a search character string consists of a plurality of words:
FIG. 9 is a flowchart for explaining the search process performed when a search character string consists of a plurality of words. First, at step 801 in FIG. 9 a search character string is accepted, and at step 802 a position data delete flag for each word of the string is examined to determine whether the position data for each word have been deleted. If the position data have not been deleted, program control advances to step 803. If the position data have been deleted, program control goes to step 804. At step 803, the character string that correspond to the individual search character string are obtained from the key file, and the contents of pertinent position data files are returned to a user. At step 804, a check is performed to determine whether the position data for the entire search character string have been deleted, and if the position data for the entire string have been deleted, program control advances to step 805. If the position data for the entire string have not been deleted, program control advances to step 806. At step 805, a message is returned to the user to request that a search be made using another character string because too many hits have been found. At step 806, the search results are returned to the user, with the observation that no computations are performed for a search character string for which the position data have been deleted. At the same time, for the character string for which the position data have been deleted, a message is returned to the user indicating that too many hits were made. When, for example, a search character string is “FORUM and SQL” and the position data for “FORUM” have been deleted, the search results obtained by using only “SQL” are returned to the user, as is a message indicating too many hits have been found for “FORUM.” The search process is thereafter terminated.
In processes (1) and (2), an entire character string is deleted for which the size of the position data that is reached is equal to or greater than B or C. However, in process (1), if B is to be reduced to a degree, for a Japanese newspaper article data, a Japanese character string may be included as a target to be deleted. If the pertinent Japanese string is not to be deleted, it can be so designated that a character string consisting of only Chinese characters, for example, should not be deleted, and thus, position data can be maintained for a character string group for which position data are not supposed to be deleted. Values A, B and C in the above embodiment are not constant values, and arbitrary values can be selected as needed in accordance with the configuration of a system and the performances that are demanded.
As is apparent from the above explanation, according to the present invention, a position data delete flag is correlated with a predetermined key character string, and position data corresponding to the key character string are deleted from a position data file. Specifically, when the size of the overall index file reaches A for the first time, a position data delete flag is set that corresponds to a key character string for which the amount of position data is equal to or greater than B, and the position data are deleted from the position data file. And/or when the size of the entire index file has exceeded A, a position data delete flag is set that corresponds to a key character string for which the amount of position data is equal to or greater than C, and the position data are deleted from the position data file. Therefore, while a request for an all-sentences search is satisfied, position data for a character string that can not be used for the actual search can be deleted, and the size of an index file can be considerably reduced.

Claims (20)

What is claimed is:
1. A search method, for using an index file consisting of a key file, which includes key character strings, and a position data file, having position data corresponding to the key character strings, said search method comprising the steps of:
correlating a position data delete flag with a specific key character string; and
deleting, from said position data file, position data that correspond to said specific key character string, wherein said position data are deleted when the size of the position data, corresponding to said specific key character string, reaches a specific ratio.
2. The search method according to claim 1, wherein the structure of a key file is constituted by a key character string, the location of said key character string in said position data file, the size of said position data, and said position data delete flag.
3. The search method according to claim 1, wherein a group of key character strings is specified in advance, for which position data are not to be deleted.
4. The search method according to claim 1, wherein a search is to be performed by one of the group consisting of a method that uses a search key character string consisting of one word, a method that uses one search key character string when an index is prepared using the N gram method, and a method that uses using a search key character string consisting of a plurality of words.
5. A search method for using an index file consisting of a key file, which includes key character strings, and a position data file, having position data corresponding to the key character strings, said search method comprising the steps of:
correlating a position data delete flag with a specific key character string; and
deleting, from said position data file, position data that correspond to said specific key character string, wherein said position data are deleted when the size of said position data reaches a specific value that corresponds to said specific key character string.
6. The search method according to claim 5, wherein the structure of a key file is constituted by a key character string, the location of said key character string in said position data file, the size of said position data, and said position data delete flag.
7. The search method according to claim 5, wherein a group of key character strings is specified in advance, for which position data are not to be deleted.
8. The search method according to claim 5, wherein a search is to be performed by one of the group consisting of a method that uses a search key character string consisting of one word, a method that uses one search key character string when an index is prepared using the N gram method, and a method that uses using a search key character string consisting of a plurality of words.
9. A search method, for using an index file consisting of a key file, which includes key character strings, and a position data file, having position data corresponding to the key character strings, said search method comprising the steps of:
correlating a position data delete flag with a specific key character string; and
deleting, from said position data file, position data that correspond to said specific key character string, wherein, when said position data delete flag corresponding to a specific key character string, is set, position data for said key character string is not added to said position data file for said index file; and wherein, when said position data delete flag, corresponding to a specific key character string, is not set, position data for said key character string is added to said position data file for said index file.
10. The search method according to claim 9, wherein said position data are deleted when the size of the position data, corresponding to said specific key character string, reaches a specific ratio.
11. The search method according to claim 9, wherein said position data are deleted when the size of said position data reaches a specific value that corresponds to said specific key character string.
12. The search method according to claim 9, wherein the structure of a key file is constituted by a key character string, the location of said key character string in said position data file, the size of said position data, and said position data delete flag.
13. The search method according to claim 9, wherein a group of key character strings is specified in advance, for which position data are not to be deleted.
14. The search method according to claim 9, wherein a search is to be performed by one of the group consisting of a method that uses a search key character string consisting of one word, a method that uses one search key character string when an index is prepared using the N gram method, and a method that uses using a search key character string consisting of a plurality of words.
15. An apparatus comprising:
a new difference index preparation unit for preparing a new difference index file using a newly registered document, said index file consisting of a key file, which includes key character strings, and a position data file, having position data corresponding to the key character strings;
an index merge unit for merging a conventional index file with said new difference index file prepared by said new difference index preparation unit, for determining whether a position data file is to be deleted by the steps of correlating a position data delete flag with a specific key character string; and deleting, from said position data file, position data that correspond to said specific key character string, and wherein said position data are deleted when the size of the position data, corresponding to said specific key character string, reaches a specific ratio, and for preparing a new index file; and
a search unit for beginning a search based on said new index file generated by said index merge unit.
16. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for using an index file consisting of a key file, which includes key character strings, and a position data file, having position data corresponding to the key character strings, said search method comprising the steps of:
correlating a position data delete flag with a specific key character string; and
deleting, from said position data file, position data that correspond to said specific key character string, wherein said position data are deleted when the size of the position data, corresponding to said specific key character string, reaches a specific ratio.
17. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for using an index file consisting of a key file, which includes key character strings, and a position data file, having position data corresponding to the key character strings, said search method comprising the steps of:
correlating a position data delete flag with a specific key character string; and
deleting, from said position data file, position data that correspond to said specific key character string, wherein said position data are deleted when the size of said position data reaches a specific value that corresponds to said specific key character string.
18. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for using an index file consisting of a key file, which includes key character strings, and a position data file, having position data corresponding to the key character strings, said search method comprising the steps of:
correlating a position data delete flag with a specific key character string; and
deleting, from said position data file, position data that correspond to said specific key character string, wherein, when said position data delete flag corresponding to a specific key character string, is set, position data for said key character string is not added to said position data file for said index file; and wherein, when said position data delete flag, corresponding to a specific key character string, is not set, position data for said key character string is added to said position data file for said index file.
19. An apparatus comprising:
a new difference index preparation unit for preparing a new difference index file using a newly registered document, said index file consisting of a key file, which includes key character strings, and a position data file, having position data corresponding to the key character strings;
an index merge unit for merging a conventional index file with said new difference index file prepared by said new difference index preparation unit, for determining whether a position data file is to be deleted by the steps of correlating a position data delete flag with a specific key character string; and deleting, from said position data file, position data that correspond to said specific key character string, and wherein said position data are deleted when the size of said position data reaches a specific value that corresponds to said specific key character string, and for preparing a new index file; and
a search unit for beginning a search based on said new index file generated by said index merge unit.
20. An apparatus comprising:
a new difference index preparation unit for preparing a new difference index file using a newly registered document, said index file consisting of a key file, which includes key character strings, and a position data file, having position data corresponding to the key character strings;
an index merge unit for merging a conventional index file with said new difference index file prepared by said new difference index preparation unit, for determining whether a position data file is to be deleted by the steps of correlating a position data delete flag with a specific key character string; and deleting, from said position data file, position data that correspond to said specific key character string, and wherein, when said position data delete flag corresponding to a specific key character string, is set, position data for said key character string is not added to said position data file for said index file; and wherein, when said position data delete flag, corresponding to a specific key character string, is not set, position data for said key character string is added to said position data file for said index file, and for preparing a new index file; and
a search unit for beginning a search based on said new index file generated by said index merge unit.
US09/676,803 1999-09-30 2000-09-29 Search method using an index file and an apparatus therefor Expired - Fee Related US6640225B1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP27797899A JP2001109754A (en) 1999-09-30 1999-09-30 Retrieving method using index file and device used for the method
JP11-277978 1999-09-30

Publications (1)

Publication Number Publication Date
US6640225B1 true US6640225B1 (en) 2003-10-28

Family

ID=17590930

Family Applications (1)

Application Number Title Priority Date Filing Date
US09/676,803 Expired - Fee Related US6640225B1 (en) 1999-09-30 2000-09-29 Search method using an index file and an apparatus therefor

Country Status (2)

Country Link
US (1) US6640225B1 (en)
JP (1) JP2001109754A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194398A1 (en) * 2001-06-15 2002-12-19 Bentley Keith C. System and method for building a target operating system from a source operating system
US20040006555A1 (en) * 2002-06-06 2004-01-08 Kensaku Yamamoto Full-text search device performing merge processing by using full-text index-for-registration/deletion storage part with performing registration/deletion processing by using other full-text index-for-registration/deletion storage part
US20050198010A1 (en) * 2004-03-04 2005-09-08 Veritas Operating Corporation System and method for efficient file content searching within a file system
US20060282669A1 (en) * 2005-06-11 2006-12-14 Legg Stephen P Method and apparatus for virtually erasing data from WORM storage devices
US20070100816A1 (en) * 2005-09-30 2007-05-03 Brother Kogyo Kabushiki Kaisha Information management device, information management system, and computer usable medium
US20080071805A1 (en) * 2006-09-18 2008-03-20 John Mourra File indexing framework and symbolic name maintenance framework
US20090037381A1 (en) * 2007-07-31 2009-02-05 Hitachi Ltd. Data registration and retrieval method, data registration and retrieval program and database system
US20110022596A1 (en) * 2009-07-23 2011-01-27 Alibaba Group Holding Limited Method and system for document indexing and data querying
CN102081649A (en) * 2010-12-31 2011-06-01 深圳联友科技有限公司 Method and system for searching computer files
US8768893B2 (en) 2006-11-21 2014-07-01 International Business Machines Corporation Identifying computer users having files with common attributes
US11138218B2 (en) * 2016-01-29 2021-10-05 Splunk Inc. Reducing index file size based on event attributes

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5265638B2 (en) * 2010-09-28 2013-08-14 ヤフー株式会社 Electronic terminal and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5745894A (en) * 1996-08-09 1998-04-28 Digital Equipment Corporation Method for generating and searching a range-based index of word-locations
US6317741B1 (en) * 1996-08-09 2001-11-13 Altavista Company Technique for ranking records of a database
US6427147B1 (en) * 1995-12-01 2002-07-30 Sand Technology Systems International Deletion of ordered sets of keys in a compact O-complete tree
US6460047B1 (en) * 1998-04-02 2002-10-01 Sun Microsystems, Inc. Data indexing technique

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6427147B1 (en) * 1995-12-01 2002-07-30 Sand Technology Systems International Deletion of ordered sets of keys in a compact O-complete tree
US5745894A (en) * 1996-08-09 1998-04-28 Digital Equipment Corporation Method for generating and searching a range-based index of word-locations
US6317741B1 (en) * 1996-08-09 2001-11-13 Altavista Company Technique for ranking records of a database
US6460047B1 (en) * 1998-04-02 2002-10-01 Sun Microsystems, Inc. Data indexing technique

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194398A1 (en) * 2001-06-15 2002-12-19 Bentley Keith C. System and method for building a target operating system from a source operating system
US7702666B2 (en) 2002-06-06 2010-04-20 Ricoh Company, Ltd. Full-text search device performing merge processing by using full-text index-for-registration/deletion storage part with performing registration/deletion processing by using other full-text index-for-registration/deletion storage part
US7730069B2 (en) 2002-06-06 2010-06-01 Ricoh Company, Ltd. Full-text search device performing merge processing by using full-text index-for-registration/ deletion storage part with performing registration/deletion processing by using other full-text index-for-registration/deletion storage part
US7644097B2 (en) * 2002-06-06 2010-01-05 Ricoh Company, Ltd. Full-text search device performing merge processing by using full-text index-for-registration/deletion storage part with performing registration/deletion processing by using other full-text index-for-registration/deletion storage part
US20040006555A1 (en) * 2002-06-06 2004-01-08 Kensaku Yamamoto Full-text search device performing merge processing by using full-text index-for-registration/deletion storage part with performing registration/deletion processing by using other full-text index-for-registration/deletion storage part
US20070118543A1 (en) * 2002-06-06 2007-05-24 Kensaku Yamamoto Full-text search device performing merge processing by using full-text index-for-registration/ deletion storage part with performing registration/deletion processing by using other full-text index-for-registration/deletion storage part
US20070136258A1 (en) * 2002-06-06 2007-06-14 Kensaku Yamamoto Full-text search device performing merge processing by using full-text index-for-registration/deletion storage part with performing registration/deletion processing by using other full-text index-for-registration/deletion storage part
US20050198010A1 (en) * 2004-03-04 2005-09-08 Veritas Operating Corporation System and method for efficient file content searching within a file system
US7636710B2 (en) 2004-03-04 2009-12-22 Symantec Operating Corporation System and method for efficient file content searching within a file system
US20060282669A1 (en) * 2005-06-11 2006-12-14 Legg Stephen P Method and apparatus for virtually erasing data from WORM storage devices
US8429401B2 (en) * 2005-06-11 2013-04-23 International Business Machines Corporation Method and apparatus for virtually erasing data from WORM storage devices
US20070100816A1 (en) * 2005-09-30 2007-05-03 Brother Kogyo Kabushiki Kaisha Information management device, information management system, and computer usable medium
US7685111B2 (en) * 2005-09-30 2010-03-23 Brother Kogyo Kabushiki Kaisha Information management device, information management system, and computer usable medium
US20080071805A1 (en) * 2006-09-18 2008-03-20 John Mourra File indexing framework and symbolic name maintenance framework
US7873625B2 (en) * 2006-09-18 2011-01-18 International Business Machines Corporation File indexing framework and symbolic name maintenance framework
US8768893B2 (en) 2006-11-21 2014-07-01 International Business Machines Corporation Identifying computer users having files with common attributes
US20090037381A1 (en) * 2007-07-31 2009-02-05 Hitachi Ltd. Data registration and retrieval method, data registration and retrieval program and database system
US9275128B2 (en) 2009-07-23 2016-03-01 Alibaba Group Holding Limited Method and system for document indexing and data querying
US9946753B2 (en) 2009-07-23 2018-04-17 Alibaba Group Holding Limited Method and system for document indexing and data querying
US20110022596A1 (en) * 2009-07-23 2011-01-27 Alibaba Group Holding Limited Method and system for document indexing and data querying
CN102081649B (en) * 2010-12-31 2012-08-15 深圳联友科技有限公司 Method and system for searching computer files
CN102081649A (en) * 2010-12-31 2011-06-01 深圳联友科技有限公司 Method and system for searching computer files
US11138218B2 (en) * 2016-01-29 2021-10-05 Splunk Inc. Reducing index file size based on event attributes
US11934418B2 (en) 2016-01-29 2024-03-19 Splunk, Inc. Reducing index file size based on event attributes

Also Published As

Publication number Publication date
JP2001109754A (en) 2001-04-20

Similar Documents

Publication Publication Date Title
US9619565B1 (en) Generating content snippets using a tokenspace repository
JP4805267B2 (en) Multi-stage query processing system and method for use with a token space repository
US6212525B1 (en) Hash-based system and method with primary and secondary hash functions for rapidly identifying the existence and location of an item in a file
US6668263B1 (en) Method and system for efficiently searching for free space in a table of a relational database having a clustering index
US5963954A (en) Method for mapping an index of a database into an array of files
US7840774B2 (en) Compressibility checking avoidance
KR100971863B1 (en) System and method for batched indexing of network documents
US6317741B1 (en) Technique for ranking records of a database
US20040205044A1 (en) Method for storing inverted index, method for on-line updating the same and inverted index mechanism
US20060041606A1 (en) Indexing system for a computer file store
US8209305B2 (en) Incremental update scheme for hyperlink database
US20050033779A1 (en) Database management program, a database managing method and an apparatus therefor
CA2387653C (en) File processing method, data processing device and storage medium
US20070208733A1 (en) Query Correction Using Indexed Content on a Desktop Indexer Program
US6640225B1 (en) Search method using an index file and an apparatus therefor
WO2008042442A2 (en) Systems and methods for providing a dynamic document index
US6721753B1 (en) File processing method, data processing apparatus, and storage medium
US20110113052A1 (en) Query result iteration for multiple queries
US6397216B1 (en) Ordering keys in a table using an ordering mask
JPH10111821A (en) Client server system
US6076089A (en) Computer system for retrieval of information
JP2000322293A (en) Data base managing method, its execution device and recoring medium recording its processing program

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:TAKISHITA, NOBUAKI;SUZUKI, TAKAKO;REEL/FRAME:011491/0155

Effective date: 20001020

FEPP Fee payment procedure

Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

FPAY Fee payment

Year of fee payment: 4

REMI Maintenance fee reminder mailed
LAPS Lapse for failure to pay maintenance fees
STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20111028