CN103955514A

CN103955514A - Image feature indexing method based on Lucene inverted index

Info

Publication number: CN103955514A
Application number: CN201410185288.4A
Authority: CN
Inventors: 叶柏龙; 龙坡; 陈浩; 姚明东; 程京; 杨国龙
Original assignee: CHANGSHA BOLONG INFORMATION TECHNOLOGY Co Ltd
Current assignee: CHANGSHA BOLONG INFORMATION TECHNOLOGY Co Ltd
Priority date: 2014-05-05
Filing date: 2014-05-05
Publication date: 2014-07-30

Abstract

The invention discloses an image feature indexing method based on a Lucene inverted index. Position information is proposed to be added into each dimension in an image feature to be indexed so as to identify a numerical value of the position information on the fixed dimension, therefore, the problem that the dimension position of each data cannot be distinguished is solved; an irrelevant image can be filtered during the retrieval, so that the retrieval efficiency is improved.

Description

A kind of characteristics of image indexing means based on Lucene inverted index

Technical field

The invention belongs to multimedia information retrieval technical field, relate to a kind of inverted index improvement project, specifically, relate to a kind of characteristics of image indexing means based on Lucene inverted index.

Background technology

In recent years, along with popularizing of web2.0, multimedia information retrieval demand is increasing, the development of demand driving technology, the research of image retrieval has in recent years obtained very large achievement, and the research of CBIR technology has become one of the most active research field of search engine research.

At present the main indexed mode of CBIR has LSH algorithm and data structure thereof, hyperspace tree R-TREE, layering K-meansTree and inverted index.LSH algorithm, by local sensitivity perception principle, filters out most of incoherent data, thereby only need to calculate the similarity between a small amount of image and source images, and his advantage is speed, and shortcoming is that memory consumption is very large, and does not ensure to obtain optimum solution.R-TREE: the higher dimensional space index structure of a kind of similar B-tree, the dimension that R-TREE adapts to is not high, can only be between 2-5 dimension, the increase of dimension can cause the exponential decline of performance journey, namely so-called dimension disaster, follow-up a series of R-TREE improves algorithm also all could not overcome dimension disaster.That retrieval performance is best at present should be layering K-meansTree, he is by the k-means clustering algorithm of layering, by similar image clustering to similar radius the inside, in inquiry time, has directly been got image calculation similarity in similar radius just like this, the advantage of this algorithm is, avoid the similarity between the linear each image of calculating, improved greatly retrieval rate.But shortcoming is, we largely affect recall precision by every layer of class number of choosing, if in by k-means algorithm cluster, initial value selects improper meeting to cause producing locally optimal solution, instead of globally optimal solution, this time, we must repeatedly select initial value at random, then selected optimum solution.Also have a shortcoming to be exactly, if the newly-increased image of this structure, the just hierarchical cluster again a time of whole tree of needs, this cost is too large.

Comparatively speaking, adopt mode index and the retrieving images feature of inverted index, it is a kind of good selection, there is not dimension disaster in inverted index, and now quite ripe for the application of Inverted Index Technique, his index upgrade cost is relatively little, it is excessive that but inverted index is applied in image retrieval fog-level, because inverted index is marked according to the degree of association in giving document scores, namely document package is containing certain word, he just has certain mark, and for image feature vector, value in each dimension is to be associated with the value in other dimensions, the present general inverted index of index is applied in image retrieval and can causes Search Results too fuzzy, result set is too large, cause retrieval performance low.

In existing achievement in research, inverted index is now applied in field of image search, cannot distinguish the dimension position at data place, thereby has a large amount of incoherent images while causing retrieving and add Candidate Set, and the quantity of Candidate Set has directly affected recall precision.Common text in inverted list, set up index be do not have sequential, that is to say vector [1,2,3] it is similar can being considered to [3,1,2], this proper vector for image is not meet logic, in the proper vector of image, the value comparison of each dimension can only with the comparison that is worth in own corresponding dimension, and can not with other dimension values comparisons.Therefore, we need to find a kind of mode can allow inverted index think [1,2,3] and [1,2,4] be similar, instead of [1,2,3] and [3,1,2] be dissimilar, this when the proper vector of image is set up to inverted index, need to allow index add positional information with regard to requiring.

Summary of the invention

In order to overcome defect of the prior art, the invention provides a kind of characteristics of image indexing means based on Lucene inverted index, for image retrieval, this method, by improving storage and the indexed mode of characteristics of image, improves retrieval rate and the integrated retrieval performance of image.Its technical scheme is,

A characteristics of image indexing means based on Lucene inverted index, comprises the following steps:

A creates Index process:

A1 lexical analysis Language Processing, does word segmentation processing to text string;

A2 adds positional information, and to each point of good word, the position according to it in text adds the prefix that contains Position Number, for example: first position adds L1_;

A3 index creation, according to the word having divided, creates index, sets up the inverted list of word and document;

A4 writes disk by inverted index table and preserves;

B retrieving:

B1 lexical analysis Language Processing, does word segmentation processing to text string;

B2 adds positional information, and to each point of good word, the position according to it in text adds the prefix that contains Position Number;

B3 grammatical analysis, the query logic of analysis and consult statement, submits to searcher according to query logic Search Results;

B4 search index, the conjunctive search relevant documentation of submitting to according to syntax analyzer;

B5 relevance ranking, selects TopN as a result of to collect and return according to the relevance ranking of inquiry document and Candidate Set document.

Further preferably, lexical analysis Language Processing described in steps A 1 is according to the feature after image feature vector text string, uses the WhitespaceAnalyzer of lucene according to space participle.

Further preferably, described in steps A 2, add positional information, to each point of good word, the position according to it in text adds the prefix that contains Position Number, removes value simultaneously and be 0 word;

Further preferably, search index described in step B4 has comprised the logic that document is given a mark according to index correlativity, according to the distance calculating method between vector, rewrites the Similarity object of lucene, only need to retain tdf attribute as standards of grading, mark by word frequency.

Beneficial effect of the present invention:

The present invention proposes in the characteristics of image for the treatment of index, each dimension is added to positional information, to identify its numerical value in fixed dimension, thereby solve the problem that cannot distinguish the residing dimension of each data position, in the time of retrieval, can filter incoherent image, improve recall precision.

Brief description of the drawings

Fig. 1 is the characteristics of image indexing means process flow diagram that the present invention is based on Lucene inverted index.

Embodiment

Below in conjunction with the drawings and specific embodiments, technical scheme of the present invention is described in more detail.

A, concept

First the achievement in research based in the past, definition image feature vector:

image1:v＝{a ₁,a ₂,a ₃,a ₄```a _n}

image2:v＝{b ₁,b ₂,b ₃,b ₄```b _n}

Image vector is set up index before need vectorial text string, namely v is become to a1a2a3a4 ... an (centre separates with space), then the each amount in proper vector is added to positional information, for example first position adds the prefix of L1_, feature string has become L1_a1L2_a2L3_a3......L3_an, in setting up index, will bring positional information like this, when we are by space participle, and when setting up index, positional information all do not brought in a word, having indicated is the value in certain dimension, thereby not can with other dimensions on value obscure.

The basic thought of B, algorithm

V={d1, d2, d3 ... dn} is the proper vector of the n dimension of image.

S={v1, v2, v3 ... vn} is characteristics of image storehouse, and we find similarity to be greater than the characteristics of image of threshold values th in S by V, thereby obtain Candidate Set.

Traditional lucene manner of comparison is that the frequency of occurrences in document is marked according to word.

The similarity calculating method of V and Vj is:

score (v, vj) = Σ_{1}^{n} p (di)

Wherein di ∈ v, i ∈ (1,2,3...rt)

P[di] frequency that occurs in vj for word di, finally according to score (v, vj) sequence, take out some documents that similarity is greater than threshold values th, as the similar image Candidate Set of query image.

Be actually like this and there is no position difference.For example V1{d1, d2, d3, d4} and V2{d3, d4, d1, it be the proper vector of completely not identical picture that d2} understands in image similarity, but for the inquiry of lucene, can think that their similarity is very high, even thinks same pictures.This is obviously not all right.

So we need to not having in individual word in proper vector add positional information, tell that lucened1 is that the locational d1 of L1 is designated as L1_d1, so p (di) function is converted into

score (v, vj) = Σ_{1}^{n} p (Li, di)

Wherein di ∈ v, i ∈ (1,2,3...n)

So position Li can be taken into account in the probability that calculates di, thereby there will not be the problems referred to above.

This mode is similar to the editing distance that calculates text similarity.So the similarity of picture has:

score (v, vj) = Σ_{1}^{n} p (Li, di) = (v, vj)

According to above-mentioned code of points, we take out the image that similarity score is greater than th and as a result of collect:

set(v)＝sorf({vi|score(v，vj)＞th})

Wherein, set (v) is the query results of image v.

With reference to Fig. 1, a kind of characteristics of image indexing means based on Lucene inverted index, comprises the following steps:

A creates Index process:

A4 writes disk by inverted index table and preserves;

B retrieving:

Lexical analysis Language Processing described in steps A 1 is according to the feature after image feature vector text string, uses the WhitespaceAnalyzer of lucene according to space participle.

Described in steps A 2, add positional information, to each point of good word, the position according to it in text adds the prefix that contains Position Number, removes value simultaneously and be 0 word; The word that is 0 for these, is actually nonsensical, but also can cause result set greatly to increase, and affects performance, is 0 word so need to remove result set.

Search index described in step B4 has comprised document according to the logic of index correlativity marking, according to the distance calculating method between vector, rewrite the Similarity object of lucene, only need to retain tdf attribute as standards of grading, namely just mark by word frequency.

The above; it is only preferably embodiment of the present invention; protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in, the simple change of the technical scheme that can obtain apparently or equivalence replace all fall within the scope of protection of the present invention.

Claims

1. the characteristics of image indexing means based on Lucene inverted index, is characterized in that, comprises the following steps:

A creates Index process:

A4 writes disk by inverted index table and preserves;

B retrieving:

2. the characteristics of image indexing means based on Lucene inverted index according to claim 1, it is characterized in that, lexical analysis Language Processing described in steps A 1 is according to the feature after image feature vector text string, uses the WhitespaceAnalyzer of lucene according to space participle.

3. the characteristics of image indexing means based on Lucene inverted index according to claim 1, it is characterized in that, described in steps A 2, add positional information, to each point of good word, position according to it in text adds the prefix that contains Position Number, removes value simultaneously and be 0 word.

4. the characteristics of image indexing means based on Lucene inverted index according to claim 1, it is characterized in that, search index described in step B4 has comprised document according to the logic of index correlativity marking, according to the distance calculating method between vector, rewrite the Similarity object of lucene, only need to retain tdf attribute as standards of grading, mark by word frequency.