CN102411580A

CN102411580A - Retrieval method and device for extensible markup language (XML) files

Info

Publication number: CN102411580A
Application number: CN2010102905414A
Authority: CN
Inventors: 邓慧芳
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2010-09-20
Filing date: 2010-09-20
Publication date: 2012-04-11
Anticipated expiration: 2030-09-20
Also published as: CN102411580B

Abstract

The invention discloses a retrieval method and a device for extensible markup language (XML) files, which belong to the field of retrieval. The method comprises the following steps that: an inquiry path is built according to keywords input by users; XML inquiry sentences are built according to the inquiry path and the keywords; and the retrieval is carried out in an XML file data base according to the XML inquiry sentences. The device comprises a first building module, a second building module and a retrieval module. The XML retrieval is carried out according to the keywords, so the retrieval is realized by the users under the condition that the XML file structure does not need to be known, the retrieval complexity of the XML file can be reduced on the premise of ensuring the retrieval accuracy, and the user experience can also be improved.

Description

The search method of XML document and device

Technical field

The present invention relates to searching field, particularly a kind of search method of XML document and device.

Background technology

Along with Internet fast development, also more and more widely as the application of the XML (ExtensibleMarkup Language, extend markup language) of Internet resources important component part.Because XML can the efficient storage data; Solve the deficiency of information stores aspect effectively, in centralized stores, aspects such as enhanced scalability, portability have shown powerful advantage; Therefore; Design a kind of search method and realize efficient retrieval, not only can strengthen the dirigibility of retrieval, also have very strong practicality to XML document.

Prior art is when retrieving XML document; The mode that is based on structure query that adopts; So-called structure query is exactly the internal structural information of earlier known XML document to be checked and result's partial information, through certain mechanism the description of these structural informations is carried out out then.

In realizing process of the present invention, the inventor finds that there is following shortcoming at least in prior art:

Based on the general more complicated of mode of structure query, be difficult for grasping, and need the user to know the result of XML document to be checked, cause operation to be had relatively high expectations; In addition, along with rapid development of Internet and XML being widely used in the internet, the value volume and range of product of XML document constantly increases, and makes the retrieval difficulty more that becomes based on the mode of structure query.

Summary of the invention

For under the prerequisite that guarantees the retrieval accuracy, reduce the retrieval complexity of XML document, the embodiment of the invention provides a kind of search method and device of XML document.Said technical scheme is following:

On the one hand, a kind of search method of XML document is provided, said method comprises:

Key word according to user's input makes up query path;

Make up the expandable mark language XML query statement according to said query path and said key word;

Retrieve in the XML document database according to said XML query statement.

Further, said key word according to user's input makes up before the query path, also comprises:

The XML document that parsing gets access to obtains the structure routing information of said XML document;

Structure routing information according to said XML document carries out cluster to said XML document and structure path, and XML document after the cluster and structure path are stored, and obtains the XML document database.

Said structure routing information according to said XML document carries out said XML document and structure path also comprising after the cluster:

Set up and store the index information of each type XML document;

Correspondingly, said key word according to user's input makes up query path, specifically comprises:

Key word to said user's input carries out pre-service, the index information corresponding according to pretreated keyword search, and the classification of the XML document of definite correspondence;

Classification according to the XML document of confirming makes up query path.

Said key word according to user's input makes up before the query path, also comprises:

Searching algorithm that reception is uploaded and XML document store said searching algorithm of uploading and XML document into assigned address, and write down the size of said searching algorithm of uploading and XML document.

After the size of said searching algorithm of uploading of said record and XML document, also comprise:

The prompting user specifies retrieving information, and said retrieving information comprises the size and the searching algorithm of XML document;

Correspondingly, saidly in the XML document database, retrieve, specifically comprise according to said XML query statement:

Retrieving information according to said XML query statement and user's appointment is retrieved in the XML document database.

Alternatively, said retrieve in the XML document database according to said XML query statement after, also comprise:

The recall precision of statistics searching algorithm makes the user select searching algorithm according to statistics.

Said retrieve in the XML document database according to said XML query statement after, also comprise:

The result for retrieval and the performance index that show searching algorithm.

On the other hand, a kind of indexing unit of XML document is provided also, said device comprises:

First makes up module, is used for making up query path according to the key word of user's input;

Second makes up module, is used for making up the expandable mark language XML query statement according to the query path and the pretreated key word of said pre-processing module of the said first structure module construction;

Retrieval module is used for retrieving at the XML document database according to the XML query statement of the said second structure module construction.

Further, said device also comprises:

Parsing module is used to resolve the XML document that gets access to, and obtains the structure routing information of said XML document;

The cluster module, the structure routing information of the XML document that is used for resolving according to said parsing module carries out cluster to said XML document and structure path;

Memory module is used for XML document and structure path after the said cluster module cluster are stored, and obtains the XML document database.

Said memory module also is used to set up and store the index information of each type XML document;

Correspondingly, said first makes up module, specifically comprises:

Pretreatment unit is used for the key word of said user's input is carried out pre-service;

Construction unit is used for according to the corresponding index information of the pretreated keyword search of said pretreatment unit, and confirms the classification of corresponding XML document, according to the classification structure query path of the XML document of confirming.

Said device also comprises:

Receiver module is used to receive searching algorithm and the XML document uploaded, stores said searching algorithm of uploading and XML document into assigned address;

Logging modle is used to write down the searching algorithm of uploading of said receiver module reception and the size of XML document.

Said device also comprises:

Reminding module is used to point out the user to specify retrieving information, and said retrieving information comprises the size and the searching algorithm of XML document;

Correspondingly, said retrieval module is used for retrieving at the XML document database according to the retrieving information of said XML query statement and user's appointment.

Preferably, said device also comprises:

Statistical module is used to add up the recall precision of searching algorithm, makes the user select searching algorithm according to statistics.

Said device also comprises:

Display module is used to show the result for retrieval and the performance index of searching algorithm.

The beneficial effect of the technical scheme that the embodiment of the invention provides is:

Through carry out the XML retrieval according to key word, make the user under the situation that need not understand the XML document structure, realize retrieval, not only can under the prerequisite that guarantees the retrieval accuracy, reduce the retrieval complexity of XML document, can also promote user experience.

Description of drawings

In order to be illustrated more clearly in the technical scheme in the embodiment of the invention; The accompanying drawing of required use is done to introduce simply in will describing embodiment below; Obviously, the accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills; Under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the search method process flow diagram of the XML document that provides of the embodiment of the invention one;

Fig. 2 is the search method process flow diagram of the XML document that provides of the embodiment of the invention two;

Fig. 3 be the embodiment of the invention two provide converge synoptic diagram;

Fig. 4 is the distributed search configuration diagram that the embodiment of the invention two provides;

Fig. 5 is the indexing unit structural representation of first kind of XML document providing of the embodiment of the invention three;

Fig. 6 is the indexing unit structural representation of second kind of XML document providing of the embodiment of the invention three;

Fig. 7 is the first structure modular structure synoptic diagram that the embodiment of the invention three provides;

Fig. 8 is the indexing unit structural representation of the third XML document of providing of the embodiment of the invention three;

Fig. 9 is the indexing unit structural representation of the 4th kind of XML document providing of the embodiment of the invention three;

Figure 10 is the indexing unit structural representation of the 5th kind of XML document providing of the embodiment of the invention three;

Figure 11 is the indexing unit structural representation of the 6th kind of XML document providing of the embodiment of the invention three.

Embodiment

For making the object of the invention, technical scheme and advantage clearer, will combine accompanying drawing that embodiment of the present invention is done to describe in detail further below.

Embodiment one

Referring to Fig. 1, present embodiment provides a kind of search method of XML document, and this method flow is specific as follows:

101: the key word according to user's input makes up query path;

102: make up the expandable mark language XML query statement according to query path and key word;

103: in the XML document database, retrieve according to the XML query statement.

Further, before the key word structure query path according to user's input, also comprise:

The XML document that parsing gets access to obtains the structure routing information of XML document;

Structure routing information according to XML document carries out cluster to XML document and structure path, and XML document after the cluster and structure path are stored, and obtains the XML document database.

Further, according to the structure routing information of XML document XML document and structure path are carried out also comprising after the cluster:

Set up and store the index information of each type XML document;

Correspondingly, the key word structure query path according to user's input specifically comprises:

Key word to user's input carries out pre-service, the index information corresponding according to pretreated keyword search, and the classification of the XML document of definite correspondence;

Classification according to the XML document of confirming makes up query path.

Alternatively, before the key word structure query path according to user's input, also comprise:

Searching algorithm that reception is uploaded and XML document store searching algorithm of uploading and XML document into assigned address, and the searching algorithm uploaded of record and the size of XML document.

Further, after the searching algorithm that record is uploaded and the size of XML document, also comprise: the prompting user specifies retrieving information, and retrieving information comprises the size and the searching algorithm of XML document;

Correspondingly, in the XML document database, retrieve, specifically comprise according to the XML query statement:

Retrieving information according to XML query statement and user's appointment is retrieved in the XML document database.

Preferably, after retrieving in the XML document database according to the XML query statement, also comprise: the recall precision of statistics searching algorithm makes the user select searching algorithm according to statistics.

After retrieving in the XML document database according to the retrieving information of XML query statement and user's appointment, also comprise: the result for retrieval and the performance index that show searching algorithm.

The method that present embodiment provides; Through carry out the XML retrieval according to key word, make the user under the situation that need not understand the XML document structure, realize retrieval, not only can be under the prerequisite that guarantees the retrieval accuracy; Reduce the retrieval complexity of XML document, can also promote user experience.

Embodiment two

Present embodiment provides a kind of search method of XML document, and this method makes the user under the situation that need not understand the XML document structure, realize retrieval through carry out the XML retrieval according to key word.Referring to Fig. 2, the method flow that present embodiment provides is specific as follows:

201: resolve the XML document that gets access to, obtain the structure routing information of XML document;

Wherein, present embodiment does not specifically limit the mode of obtaining XML document, and for the single document data, present embodiment is supported the data upload function, and the user can click and upload the XML document data, and only accepts the XML document data, and other data will be filtered.If the XML document data of uploading exist; Then former document is covered; And the recorded information of the XML document data of having uploaded with a txt document storing, what provide simultaneously that a JavaBean is used for operating txt uploads data recording (mainly being to fetch data, increase data etc.); And come manipulation data, algorithm and index record (additions and deletions change) with a special class AccessTextFile, data upload is that the upload through jsp realizes.When upload operation takes place, call DataUploadServlet and upload corresponding data, just increase a record if upload the data success simultaneously in data recording information document the inside.Supply the user to select when the size of all right statistics is for the back Advanced Search during data upload, all data of uploading all upload in the file under the assigned catalogue unified, are used for the class of reading of data.In addition, upload the XML document data,, can adopt web crawlers regularly to grasp the XML webpage in the internet in order to guarantee dynamically updating of XML document data except that accepting the user.

During XML document that parsing gets access to; Present embodiment adopts the VTD-XML analytic technique, at first XML document is loaded into internal memory with the binary data form, uses the depth-first traversal XML document then; And same paths carried out their frequencies of occurring of merge record, detailed process is following:

A, XML document is loaded into internal memory with scale-of-two (byte array);

B, navigate to the root node of XML document, apply for a node stack, be used for depth-first traversal, and root node is pressed in the stack with the AutoPilot class among the VTD-XML;

C, utilization depth-first search begin to travel through XML document from root node; For each non-leaf node in the XML document, preserve among their traverse path to one hashtable (Hash table), if this path occurred; Then frequency adds 1, is 1 otherwise insert this node path and put frequency;

D, when node stack when being empty, promptly depth-first traversal finishes, and deposits path among the hashtable and frequency information in BDB (Berkeley Date Base, Berkeley database) database.

202: the structure routing information according to XML document carries out cluster to XML document and structure path;

Particularly, for each XML document, it is similar that DTD (Document Type Definition, the DTD) tree of Trie tree and XML document can be created in the structure path after the parsing, and Trie is a kind of multiway tree structure of retrieval fast that is used for.For per two XML documents; If two Trie trees meet certain similarity; Can they be gathered one type; After at last the class formation information after the cluster being optimized with parents' storage mode of tree, deposit in the BDB database, the document after the cluster is deposited into is used for the XQuery inquiry among the BDB XML again.

Wherein, standard Trie is used for match word, and each node node comprises an alphabetical key assignments and 26 alphabetical pointer fields; And when mating the path with Trie; The key assignments of node node becomes a word, and 26 alphabetical pointers have very big waste in the storage of space, and present embodiment adopts a Hash table to store pointer field; Be used for directly locating (0 (1)) next word, the node node storage organization after therefore improving is following:

class?TrieNode{

Private String elemString; // storage element value

Private long freq; The frequency of // record path

Public Hashtable < String, TrieNode>points; The pointer field of // word

}

XML structural information after the parsing is with the path stored in form, and the process that the structure path of XML document is built into a Trie is exactly constantly to insert the process of node, and insertion process is following:

A, for the path P of a paths length L, be divided into the array E of L word;

B, if root node be the sky, putting the root node key assignments is E [0]; Otherwise change c;

C, in the Hash pointer field of current node, search the next element of path P,, then change d if locate successfully; If the location is unsuccessful, then in current node Hash pointer field, inserting with this element is the node node of key assignments, changes d;

D, repetition c step, up to all elements of having located P, path P is inserted successfully.

A given XML document structure Trie and a paths P, the prefix matching process of path P is following:

A, it is divided into the array E of L word with P.

B, to establish current node be root node, if first element of root node key assignments and E relatively, if equate commentaries on classics c; Otherwise coupling finishes, and returns sky;

C, in the Hash pointer field of current node the next element of location E, if locate successfully, put current node for locating successful node, continuation 3); Otherwise coupling finishes, and returns prefix matching path prefixMatchedPath.

For measuring the similarity of two file structure Trie, at first provide following two definition:

Definition 1:XML file structure Trie path total length L p

For given XML document x, if x has the n1 paths, every paths length is l _i, total length then

Lp = Σ_{i = 1}^{n 1} l_{i};

Define the similarity θ of 2: two XML document structures

For two given XML document x ₁, x ₂, establishing x1 is that x2 is a match objects by match objects, and the Trie path total length of x1 is respectively Lp1, and the path total length of prefix matching is Lpp, then two XML document structural similarity

If θ (x1, θ x2) _Threshold, think that then two file structures are similar, can gather into one type; Otherwise two file structures are dissimilar.Wherein, 0≤θ _Threshold≤1, θ _ThresholdBy user oneself definition, θ _ThresholdBig more, similarity is high more, and is high more to the cluster requirement, otherwise similarity is low more, requires low more to cluster.

Suppose existing N XML document, be respectively x ₁x ₂... x _N, their structural information is stored in Stru respectively ₁Stru ₂... Stru _N, Stru wherein _iStorage of array document x _iAll paths, utilize the clustering algorithm of Trie coupling to be described below:

1) currentXMLFiles that defines arrays is used to write down and currently is initialized as whole document names not by the XML document name of cluster, and L is the length of array currentXMLFiles;

2) get first element x in the current XMLFiles ₁All path Stru ₁, use Stru ₁Make up structure Trie ₁

3) get x _i(i=2..L) Stru _i, for Stru _iEvery paths p _i, all at Trie ₁In carry out match query, the path total length that obtains prefix matching is Lpp _i, if

(θ in this algorithm _Threshold=0.5), then changes 4); Otherwise if i＜L continues 3), otherwise change 5)

4) with Stru _iEvery paths p _iUtilize structure Trie mode to be inserted into Trie ₁In, promptly preserve such all routing information, and from currentXMLFiles, delete x ₁If i＜L continues 3), otherwise go to 5).

5) preserve the structural information Trie that has gathered good class ₁With the document name X that belongs to such, wherein, document name can be thought such index information, if the currentXMLFiles array is not empty, continues 2), otherwise cluster finishes.

203: XML document after the cluster and structure path are stored, obtained the XML document database;

To this step, for each Trie after the cluster, direct storing path information, but comprise the prefix information of a lot of redundancies in the routing information, and wasted storage space, present embodiment proposes parents' node storage method of a kind of tree for this reason.Directly during store path information, if user/personalInfo has been stored 6 times, and < personalInfo, user>only has been stored 1 time under parents' node storage of tree, saved storage space so greatly.Therefore; In experimental data; Can find that the tree parents node storage of optimizing is stored in the structural information storage than the path and goes up the space of saving 60-70%; And under tree parents node storage mode, for given keyword, can be in the time complexity of constant level dynamic generated query path.

Large-scale XML document data at first need be carried out cluster to document sets, and the document that has same structure after the cluster is divided into one type, has preserved the structural information (this manipulation is accomplished in cluster process) of each classification simultaneously.According to different classes; The XML document that will belong to kind is put into identical container the inside, and after setting up the index information of each type XML document, guarantees in this case according to the keyword search index information; Confirm the name of container, dwindle range of search gradually.

Mainly explained why can document be assigned to different container the insides by class, be storing process below above:

The first step: initialization BDB XML environmental variance, more traditional method are to create a class to accomplish separately associative operation, a class (myDbEnv) that present embodiment is newly-built, in the initialization context variable, only need instantiation such;

Second step: obtain the XMLManager object and handle the BDB database, need create some configuration informations simultaneously and accomplish relevant the manipulation;

The 3rd step: open the container of appointment, if non-existent words are just created;

The 4th step: all XML documents are deposited into corresponding container the inside with the for circulation.

Alternatively, after obtaining the XML document database, the method that present embodiment provides also supporting database is reset, i.e. constructs database again.When the operation of resetting takes place; At first delete original database file; To restart clustering algorithm then existing all documents will be carried out cluster analysis, set up the storage index information, according to clustering result the XML document of same kind will be deposited into identical container the inside at last.

204: the key word to user's input carries out pre-service, and makes up query path according to pretreated key word;

Particularly, when the key word of user input is carried out pre-service, can carry out many spaces participle to key word earlier, the calculating of the line character string editing distance of going forward side by side and synonym correction, detailed process is following:

1) carries out many spaces participle with regular expression;

Wherein, carry out participle with one or more spaces as separator.

2) editing distance between calculating two character strings;

edit (i, j) = \{\begin{matrix} edit (0,0) = 1 (i = j = 0); \\ edit (0, j) = edit (0, j - 1) + 1 (i = 0, j > 0); \\ edit (i, 0) = edit (i - 1,0) + 1 (i > 0, j = 0); \\ edit (i, j) = \min (edit (i - 1, j) + 1, edit (i, j - 1) + 1, edit (i - 1, j - 1) + f (i, j)) (i, j > 0); \end{matrix}

Edit (i, j) editing distance of substring s i substring t1 of [0....j] in target string T of [0....i] among the expression source string S.F (i, j) i character s (i) is transformed into j character s (j) necessary operations number of times among the T among the expression S, if s (i)==s (j), then without any need for operate f (i, j)=0; Otherwise, need replacement operation, and f (i, j)=1.

Be to convert the editing distance problem between long character string between short character strings editing distance problem step by step, editing distance is 1 until between the string that has only 1 character here.

3) judge that whether editing distance d is less than correcting threshold value D, if then carry out 4), if not, then carry out 5);

Wherein, present embodiment does not limit the size of correcting threshold value D, can set according to actual conditions.If editing distance d then means to need not key word is corrected less than correcting threshold value D, on the contrary, if editing distance d then means and need correct key word greater than correcting threshold value D.

4) if editing distance d=0 judges that then the key word input is correct, preprocessing process finishes; If 0＜d＜D then carries out after the spelling correcting key word, preprocessing process finishes;

5) key word is carried out synonym and correct, and after correcting successfully, preprocessing process finishes; If correct not success, then judge the key word input error, return miscue, preprocessing process finishes.

Particularly, when carrying out the synonym correction of keyword, can in the myElemSynSetDB database, search as key assignments,, then the data value that finds corrected as synonym if search successfully with key word.After the key word pre-service, correct if spelling correcting or synonym are corrected, then deposit this time input key word in the myFrequentInputDB database, for providing the storehouse, completion automatically, relevant search and popular vocabulary supports.Wherein, Return relevant search; Be meant pretreated key word is carried out match query in the myFrequentInputDB database; Return 6 the highest vocabulary of matching degree and be presented at the showOptions module of result page as relevant search, this matching algorithm adopts between two character strings same word numbers coupling at most.Returning popular vocabulary, is that pretreated key word is inquired about in the myFrequentInputDB database equally, returns 6 the highest input vocabulary of the frequency of occurrences are presented at result page as popular vocabulary show Options module.When depositing key word in the myFrequentInputDB database, if this key word occurs in database, then frequency adds 1, is 1 otherwise insert this key word and put frequency.

Further, after the key word pre-service finishes, the index information corresponding, and confirm the classification of corresponding XML document according to pretreated keyword search, the process that makes up query path according to the classification of the XML document of confirming is following:

1) navigates to the cluster under the key word according to the element inverted list, and be saved among the array clusterNames;

2) for each cluster database, utilization Hash parents key is to the path of dynamic construction from the root node to the key word of the inquiry;

3) if path and frequency thereof under this cluster then directly preserved in single key word, change 4); Otherwise utilization Trie match query is tried to achieve the common prefix path, and preserves common prefix path and frequency thereof under this cluster, changes 4);

4) repeat 2), each cluster was visited in array clusterNames, changeed 5);

5), carry out descending sort from big to small according to the frequency that they occur, for the XQuery Query Result provides the one-level ordering for every paths.

Wherein, in dynamic construction query path process, step 2) be core with step 3), be specifically described as Hash parents key to the path of dynamic construction from the root node to the key word, utilization Trie match query is asked the common prefix path.For example, after to the XML document cluster, will belong to after the XML document storage of automotive-type; The index information of setting up such is an automobile, and then when the key word of user's input was certain automobile brand or the vocabulary relevant with automobile, then can find corresponding index information according to this key word was automobile; And can confirm the classification of corresponding XML document; Through in this classification, making up query path, thereby can dwindle range of search, improve recall precision.

205: make up the XML query statement according to query path and pretreated key word, and in the XML document database, retrieve according to the XML query statement.

Particularly; Query path is to produce according to prefix code algorithm before, when retrieval takes place, utilizes query path and text key word to create out XQuery (XML inquiry) statement; Thereby the document of retrieval by window makes retrieval more efficiently more accurate in which container the inside more accurately.

Then in retrieval, open corresponding container, inquire about according to the following steps with the XQuery statement:

The first step: locate and open container.Behind above-mentioned initialization context variable, utilize given container title to open corresponding container;

Second step: make up the XQuery statement and carry out inquiry;

The 3rd step: handle the Query Result that obtains.

When retrieving in the XML document database according to the XML query statement, present embodiment does not limit the concrete searching algorithm that adopts.Similar with the mode of obtaining XML document, the method that present embodiment provides is supported uploading of searching algorithm equally, and uploading also of this searching algorithm realized through upload+Servlet.Upload when taking place when algorithm, call under the file directory of appointment that AlgorithmUploadServlet wraps the jar of algorithm biography, assigned catalogue is made as lib catalogue under the webroot (be used for cooperating realize reflection) here.The file of simultaneously non-jar bag is not supported to upload, and bag also will be tackled if upload file is not jar.If algorithm is uploaded success, then revise the algorithm records file, and after algorithm is uploaded, should restart system and let system can discern the algorithm of having uploaded, thereby be used to read the class of algorithm.

For Query Result, two kinds of processing modes are arranged generally: the one, directly store Query Result into container the inside as a result, the 2nd, Query Result is write in the XML file.Owing to need Query Result be sent to the page and show, so deal with relative complex a bit, wouldn't give unnecessary details here, only need the XMLResults object of Query Result be returned and get final product.

Method that present embodiment provides realizes the two-stage ordering to the Query Result document, and first order ordering is sorted in the generated query path, the path frequency of adding up when principle is based on cluster, and frequency is high comes the front; Second level ordering is that TF (Term Frequency, the word frequency)/IDF (InverseDocument Frequency, reverse file frequency) according to classics sorts.

Definition: the property value structure ratio, with ω-t (v _i, t _i) expression.A key word t _iBe contained in certain Text node, also possibly comprised by a plurality of Text nodes, therefore, a t _iPossibly belong to a plurality of attributes.This step is theoretical according to TF/IDF, and it is improved, and the calculating granularity that is about to TF/IDF becomes element in the XML document, a t _iBe contained in certain Element node v _iUnder property value in number of times many more, t is described _iCan express attribute v more _iImplication; An Element node v _iIn, the property value that comprises is few more, then this key word t _iGet over illustrative v _iTherefore implication can draw property value t _iThe computing formula of structure ratio is:

ω_t (v_{i}, t_{i}) = \frac{tf (v_{i}, t_{i}) \times idf (v_{i}, t_{i})}{\max {tf (v_{i}, t_{i}) \times idf (v_{i}, t_{i}) | 1 \leq i \leq n}}

When execution was retrieved according to the XQuery statement, each execution retrieval all can obtain an XMLResults object, comprises the record that all satisfy the XQuery condition in this object.All XMLResults of primary retrieval leave in the XMLResultsSet object, and data volume is very huge.If all convert the record in all XMLResults objects to can be used for the foreground demonstration record, can spend the too many time (generally can consume more than the 3s), what present embodiment was taked is the strategy of segmentation, concrete steps are:

Getting 100 records (not to 100, taking out all records) earlier is used for showing.When if the user clicks paging hurdle to 10 page later record; Get 100 of the back again and give the user (front taken out all records also preserved); Because the time that the user spends in one page under clicking seldom; Whether imperceptible is new 100 records that take out, and just the 3s time is sub-divided in the time that the user clicks down one page, thereby can efficient be improved about 3s.Again since EntitiyBean integrated variously be used for the information that the foreground shows; Present embodiment adopts the dTree of JavaScript to carry out the demonstration of XML tree structure simultaneously; When then converting XMLSegment to EntityBean, the content text-converted is become XML tree (being used for the directory tree that the foreground shows).

Further, for the XML document retrieval, different algorithms all has difference in retrieval time in accuracy and the ordering.For the platform of an algorithm research is provided; Thereby select fairly perfect sound algorithm, make the XML document retrieval more efficient, and realize big single document is retrieved; Make retrieval more accurately quicker; The method that present embodiment provides is also supported different algorithms, data and index after pre-service is carried out on the backstage, by the user on the foreground selection algorithm and data carry out advanced inquiry.And the method that present embodiment provides can also be added up the recall precision of searching algorithm in the advanced search; In addition; The performance index that can also show searching algorithm make the user come the recall precision of comparison related algorithm to data set and key word through checking statistics, can also come the performance of comparison searching algorithm through checking performance index; Thereby selection respective algorithms, the i.e. dynamic use of implementation algorithm.

When the dynamic use of implementation algorithm, need to realize the dynamic load of class, thereby will be referred to reflex mechanism.Reflection (Reflection) is the key property that Java is regarded as dynamically (or dynamically accurate) language.This machine-processed permission program sees through the internal information that Reflection APIs obtains the class of any one known title when operation; Comprise that its modifiers is (such as public; Static or the like), the interfaces (for example Cloneable) of superclass (for example Object), realization; All information that also comprise fields and methods, and can when operation, perhaps arouse methods in the change fields.Present embodiment has adopted an aspect of reflex mechanism, defines through the title loading classes.Use Class.forName (" the bag name. class name "), return a Class object, the inside has comprised the information about this type, comprises constructed fuction, member function and variable etc.With the newInstance () method of this Class object can be actually " the bag name. class name " an Object object of object, can use the forced type conversion to become the data of actual needs.During actual the use, the file of algorithm is deposited in scanning in the initialized process in website, all backstages, satisfactory jar file load network access station, and is stored in the global variable.When will use algorithm later on, in that global variable, search, and use public interface that the class of algorithms that finds is called with name.

When carrying out advanced search; The same with the many file retrievals pre-service in the above-mentioned steps; Use VTD-XML technology depth-first traversal XML document here; Different is, has added up more information, is in particular on SLCA+xmlfileName+TokensIdDB and SLCA+xmlfileName+TokensRecordDB two database storing.What need proposition is; Before analyzing XML file; For after XML document storage on save storage space, and remove redundant symbol, standard XML document form; Be convenient to calculate the endOffset property value of Object object node, at first will remove space, the Tab symbol of newline and newline front and back XML document.Regular expression is treated to: xmlValue=xmlValue.replaceAll (" * n{1, * ", " "); Secondly, behind the Object object node that obtains converging, need XML document be loaded in the internal memory and could take out corresponding fragment.If XML document is very little, directly be loaded into internal memory and get final product, if XML document is very big, then need not all be loaded into internal memory, only need load the document that comprises fragment and get final product.Present embodiment does not limit the mode of storage, is example to adopt the mode with the storage of XML document piecemeal only, and for example, block size can be 1M, and not enough 1M is directly storage just.

Adopt SLCA+xml fileName+TokensIdDB during storage, tokenValue representes the string value of each word in the document, and tokenId representes the numbering of this word in document, and the TokensIdDB file layout is < tokenValue, tokenId >.During depth-first traversal; Give each non-leafy node unique number; The textual words of leafy node is shared same numbering, when the input key word, just can directly navigate to the numbering of this key word in document like this; If this key word repeatedly occurs certainly, just can obtain a plurality of numberings in document.During the textual words of this external storage leafy node, present embodiment is removed like a, and stop words such as the are to save storage space.The TokensRecordDB file layout is < tokenId; TokenRecord >, tokenRecord is a byte array, whether the byte array length is object object node and difference according to token during storage; If token is an object object node, length is 23; If token is not an object object node, length is 11.In storage during tokenRecord, we convert all properties of tokenRecord to the byte storage in order to save storage space.

For given key word; At first carry out many spaces participle; In TokensIdDB, inquire the tokenId of key word then, all token id are upwards converged and converge to always object object node, from TokensRecordDB, take out the tokenRecord after converging again; Start offset amount with each tokenRecord reads the response fragment with the end side-play amount from disk file at last, and carries out the tf*idf ordering.In key search, converge, from TokensRecordDB, get tokenRecord and read the detailed description of three processes of fragment following according to side-play amount for providing tokenId:

1) tokenId converges

For the input that N key word arranged, both keyword converges before at first choosing, and converges with node and the 3rd node of converging to again, and so iteration to the last converges to a node and finishes.With 3 key word K1 shown in Figure 3, K2, the synoptic diagram that converges of K3 in XML document tree is example, and wherein, the line of band arrow is represented the process of converging.

2) from TokensRecordDB, get tokenRecord

The tokenRecord that from TokensRecordDB, takes out comprises two types, if the Object object, what then take out is that length is 23 byte array, otherwise is that length is 11 byte array.For the id of tokenRecord, pid, endOffset; It is intermediate that property values such as type, present embodiment have been write a SlcaTokenRecord, such directly byte array establishment; And directly call method obtains id, pid, endOffset; Property values such as type greatly make things convenient for to convert some position in the byte array to the int value.For example, if type has only one, thus need not convert 4 byte arrays to int, and only need a byte conversion is become character string and then convert int to get final product, can save 3 bytes of memory spaces like this.If type has only three, so need not convert 4 byte arrays to, only need 3 byte conversion are become character string and then convert int to get final product with int, can save 1 bytes of memory space like this.

3) read fragment according to side-play amount

According to start offset amount that converges back id and end side-play amount, the disk file piece is loaded into internal memory and reads respective segments.It should be noted that if the start offset amount that converges back id with finish side-play amount in same, directly load this piece and read to internal memory and get final product; If at this moment start offset amount and end side-play amount need load two or more in internal memory, and recomputate the end side-play amount in striding piece, just can obtain fragment as a result after the splicing not in same.

For the retrieval of internet mass XML document, this enforcement proposes a kind of distributed structure/architecture, improves a lot on recall precision with respect to traditional distributed search, and the distributed search framework under this kind mode is as shown in Figure 4.Querying server (being the slave server) need not return all object informations; And only need return summary info, ordering score value and the server address of self of demonstration; Concrete object information is kept in its data storehouse, has significantly reduced the required network bandwidth of querying server return results.Behind the summary of results information sorting that terminal server (being the master server) returns all querying servers, return to the user; The user clicks and makes a summary when checking detailed results; Webpage is redirected on the querying server at this result place, only needs from this querying server, to take out the result this moment and be shown to the user and get final product.

For the display mode of XML document, present embodiment does not specifically limit.In the practical application; Can adopt XML DOM (Document Object Model; DOM Document Object Model) object comes dynamically to show the XML document fragment; XML DOM is written into XML file or character string to liking through load and loadXML mode, and the XMLDOM acquiescence is the asynchronous XML file that is written into, and clicks asynchronous loading XML fragment or character string information through the user.

Embodiment three

Referring to Fig. 5, present embodiment provides a kind of indexing unit of XML document, and this device comprises:

First makes up module 501, is used for making up query path according to the key word of user's input;

Second makes up module 502, is used for making up query path and the key word structure expandable mark language XML query statement that module 501 makes up according to first;

Retrieval module 503 is used for retrieving at the XML document database according to the XML query statement that the second structure module 502 makes up.

Referring to Fig. 6, this device also comprises:

Parsing module 504 is used to resolve the XML document that gets access to, and obtains the structure routing information of XML document;

Cluster module 505, the structure routing information of the XML document that is used for resolving according to parsing module 504 carries out cluster to XML document and structure path;

Memory module 506 is used for XML document after cluster module 505 clusters and structure path are stored, and obtains the XML document database.

Wherein, memory module 506 also is used to set up and store the index information of each type XML document;

Correspondingly, referring to Fig. 7, first makes up module 501, specifically comprises:

Pretreatment unit 501a is used for the key word of user's input is carried out pre-service;

Construction unit 501b is used for according to the corresponding index information of the pretreated keyword search of pretreatment unit 501a, and confirms the classification of corresponding XML document, according to the classification structure query path of the XML document of confirming.

Referring to Fig. 8, this device also comprises:

Receiver module 507 is used to receive searching algorithm and the XML document uploaded, stores searching algorithm of uploading and XML document into assigned address;

Logging modle 508 is used for the searching algorithm of uploading of recorder module 507 receptions and the size of XML document.

Referring to Fig. 9, this device also comprises:

Reminding module 509 is used to point out the user to specify retrieving information, and retrieving information comprises the size and the searching algorithm of XML document;

Correspondingly, retrieval module 503 is used for retrieving at the XML document database according to the retrieving information of XML query statement and user's appointment.

Referring to Figure 10, this device also comprises:

Statistical module 510 is used to add up the recall precision of searching algorithm, makes the user select searching algorithm according to statistics.

Referring to Figure 11, this device also comprises:

Display module 511 is used to show the result for retrieval and the performance index of searching algorithm.

The device that present embodiment provides; Through carry out the XML retrieval according to key word, make the user under the situation that need not understand the XML document structure, realize retrieval, not only can be under the prerequisite that guarantees the retrieval accuracy; Reduce the retrieval complexity of XML document, can also promote user experience.

Need to prove: the indexing unit of the XML document that the foregoing description provides is when retrieving; Only the division with above-mentioned each functional module is illustrated; In the practical application; Can as required above-mentioned functions be distributed by the different functional completion, the inner structure that is about to device is divided into different functional, to accomplish all or part of function of above description.In addition, the indexing unit of the XML document that the foregoing description provides and the search method embodiment of XML document belong to same design, and its concrete implementation procedure sees method embodiment for details, repeats no more here.

The invention described above embodiment sequence number is not represented the quality of embodiment just to description.

All or part of step in the embodiment of the invention can utilize software to realize that corresponding software programs can be stored in the storage medium that can read, like CD or hard disk etc.

The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being done, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the search method of an XML document is characterized in that, said method comprises:

Key word according to user's input makes up query path;

Retrieve in the XML document database according to said XML query statement.

2. method according to claim 1 is characterized in that, said key word according to user's input makes up before the query path, also comprises:

3. method according to claim 2 is characterized in that, said structure routing information according to said XML document carries out said XML document and structure path also comprising after the cluster:

Set up and store the index information of each type XML document;

Classification according to the XML document of confirming makes up query path.

4. method according to claim 1 is characterized in that, said key word according to user's input makes up before the query path, also comprises:

5. method according to claim 4 is characterized in that, after the size of said searching algorithm of uploading of said record and XML document, also comprises:

6. method according to claim 5 is characterized in that, said retrieve in the XML document database according to said XML query statement after, also comprise:

7. method according to claim 5 is characterized in that, said retrieving information according to said XML query statement and user's appointment also comprises after in the XML document database, retrieving:

8. the indexing unit of an XML document is characterized in that, said device comprises:

9. device according to claim 8 is characterized in that, said device also comprises:

10. device according to claim 9 is characterized in that, said memory module also is used to set up and store the index information of each type XML document;

Correspondingly, said first makes up module, specifically comprises:

11. device according to claim 8 is characterized in that, said device also comprises:

12. device according to claim 11 is characterized in that, said device also comprises:

13. device according to claim 12 is characterized in that, said device also comprises:

14. device according to claim 12 is characterized in that, said device also comprises: