CN103955514A - Image feature indexing method based on Lucene inverted index - Google Patents

Image feature indexing method based on Lucene inverted index Download PDF

Info

Publication number
CN103955514A
CN103955514A CN201410185288.4A CN201410185288A CN103955514A CN 103955514 A CN103955514 A CN 103955514A CN 201410185288 A CN201410185288 A CN 201410185288A CN 103955514 A CN103955514 A CN 103955514A
Authority
CN
China
Prior art keywords
index
word
lucene
adds
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410185288.4A
Other languages
Chinese (zh)
Inventor
叶柏龙
龙坡
陈浩
姚明东
程京
杨国龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CHANGSHA BOLONG INFORMATION TECHNOLOGY Co Ltd
Original Assignee
CHANGSHA BOLONG INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CHANGSHA BOLONG INFORMATION TECHNOLOGY Co Ltd filed Critical CHANGSHA BOLONG INFORMATION TECHNOLOGY Co Ltd
Priority to CN201410185288.4A priority Critical patent/CN103955514A/en
Publication of CN103955514A publication Critical patent/CN103955514A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5838Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using colour

Abstract

The invention discloses an image feature indexing method based on a Lucene inverted index. Position information is proposed to be added into each dimension in an image feature to be indexed so as to identify a numerical value of the position information on the fixed dimension, therefore, the problem that the dimension position of each data cannot be distinguished is solved; an irrelevant image can be filtered during the retrieval, so that the retrieval efficiency is improved.

Description

A kind of characteristics of image indexing means based on Lucene inverted index
Technical field
The invention belongs to multimedia information retrieval technical field, relate to a kind of inverted index improvement project, specifically, relate to a kind of characteristics of image indexing means based on Lucene inverted index.
Background technology
In recent years, along with popularizing of web2.0, multimedia information retrieval demand is increasing, the development of demand driving technology, the research of image retrieval has in recent years obtained very large achievement, and the research of CBIR technology has become one of the most active research field of search engine research.
At present the main indexed mode of CBIR has LSH algorithm and data structure thereof, hyperspace tree R-TREE, layering K-meansTree and inverted index.LSH algorithm, by local sensitivity perception principle, filters out most of incoherent data, thereby only need to calculate the similarity between a small amount of image and source images, and his advantage is speed, and shortcoming is that memory consumption is very large, and does not ensure to obtain optimum solution.R-TREE: the higher dimensional space index structure of a kind of similar B-tree, the dimension that R-TREE adapts to is not high, can only be between 2-5 dimension, the increase of dimension can cause the exponential decline of performance journey, namely so-called dimension disaster, follow-up a series of R-TREE improves algorithm also all could not overcome dimension disaster.That retrieval performance is best at present should be layering K-meansTree, he is by the k-means clustering algorithm of layering, by similar image clustering to similar radius the inside, in inquiry time, has directly been got image calculation similarity in similar radius just like this, the advantage of this algorithm is, avoid the similarity between the linear each image of calculating, improved greatly retrieval rate.But shortcoming is, we largely affect recall precision by every layer of class number of choosing, if in by k-means algorithm cluster, initial value selects improper meeting to cause producing locally optimal solution, instead of globally optimal solution, this time, we must repeatedly select initial value at random, then selected optimum solution.Also have a shortcoming to be exactly, if the newly-increased image of this structure, the just hierarchical cluster again a time of whole tree of needs, this cost is too large.
Comparatively speaking, adopt mode index and the retrieving images feature of inverted index, it is a kind of good selection, there is not dimension disaster in inverted index, and now quite ripe for the application of Inverted Index Technique, his index upgrade cost is relatively little, it is excessive that but inverted index is applied in image retrieval fog-level, because inverted index is marked according to the degree of association in giving document scores, namely document package is containing certain word, he just has certain mark, and for image feature vector, value in each dimension is to be associated with the value in other dimensions, the present general inverted index of index is applied in image retrieval and can causes Search Results too fuzzy, result set is too large, cause retrieval performance low.
In existing achievement in research, inverted index is now applied in field of image search, cannot distinguish the dimension position at data place, thereby has a large amount of incoherent images while causing retrieving and add Candidate Set, and the quantity of Candidate Set has directly affected recall precision.Common text in inverted list, set up index be do not have sequential, that is to say vector [1,2,3] it is similar can being considered to [3,1,2], this proper vector for image is not meet logic, in the proper vector of image, the value comparison of each dimension can only with the comparison that is worth in own corresponding dimension, and can not with other dimension values comparisons.Therefore, we need to find a kind of mode can allow inverted index think [1,2,3] and [1,2,4] be similar, instead of [1,2,3] and [3,1,2] be dissimilar, this when the proper vector of image is set up to inverted index, need to allow index add positional information with regard to requiring.
Summary of the invention
In order to overcome defect of the prior art, the invention provides a kind of characteristics of image indexing means based on Lucene inverted index, for image retrieval, this method, by improving storage and the indexed mode of characteristics of image, improves retrieval rate and the integrated retrieval performance of image.Its technical scheme is,
A characteristics of image indexing means based on Lucene inverted index, comprises the following steps:
A creates Index process:
A1 lexical analysis Language Processing, does word segmentation processing to text string;
A2 adds positional information, and to each point of good word, the position according to it in text adds the prefix that contains Position Number, for example: first position adds L1_;
A3 index creation, according to the word having divided, creates index, sets up the inverted list of word and document;
A4 writes disk by inverted index table and preserves;
B retrieving:
B1 lexical analysis Language Processing, does word segmentation processing to text string;
B2 adds positional information, and to each point of good word, the position according to it in text adds the prefix that contains Position Number;
B3 grammatical analysis, the query logic of analysis and consult statement, submits to searcher according to query logic Search Results;
B4 search index, the conjunctive search relevant documentation of submitting to according to syntax analyzer;
B5 relevance ranking, selects TopN as a result of to collect and return according to the relevance ranking of inquiry document and Candidate Set document.
Further preferably, lexical analysis Language Processing described in steps A 1 is according to the feature after image feature vector text string, uses the WhitespaceAnalyzer of lucene according to space participle.
Further preferably, described in steps A 2, add positional information, to each point of good word, the position according to it in text adds the prefix that contains Position Number, removes value simultaneously and be 0 word;
Further preferably, search index described in step B4 has comprised the logic that document is given a mark according to index correlativity, according to the distance calculating method between vector, rewrites the Similarity object of lucene, only need to retain tdf attribute as standards of grading, mark by word frequency.
Beneficial effect of the present invention:
The present invention proposes in the characteristics of image for the treatment of index, each dimension is added to positional information, to identify its numerical value in fixed dimension, thereby solve the problem that cannot distinguish the residing dimension of each data position, in the time of retrieval, can filter incoherent image, improve recall precision.
Brief description of the drawings
Fig. 1 is the characteristics of image indexing means process flow diagram that the present invention is based on Lucene inverted index.
Embodiment
Below in conjunction with the drawings and specific embodiments, technical scheme of the present invention is described in more detail.
A, concept
First the achievement in research based in the past, definition image feature vector:
image1:v={a 1,a 2,a 3,a 4```a n}
image2:v={b 1,b 2,b 3,b 4```b n}
Image vector is set up index before need vectorial text string, namely v is become to a1a2a3a4 ... an (centre separates with space), then the each amount in proper vector is added to positional information, for example first position adds the prefix of L1_, feature string has become L1_a1L2_a2L3_a3......L3_an, in setting up index, will bring positional information like this, when we are by space participle, and when setting up index, positional information all do not brought in a word, having indicated is the value in certain dimension, thereby not can with other dimensions on value obscure.
The basic thought of B, algorithm
V={d1, d2, d3 ... dn} is the proper vector of the n dimension of image.
S={v1, v2, v3 ... vn} is characteristics of image storehouse, and we find similarity to be greater than the characteristics of image of threshold values th in S by V, thereby obtain Candidate Set.
Traditional lucene manner of comparison is that the frequency of occurrences in document is marked according to word.
The similarity calculating method of V and Vj is:
score ( v , vj ) = Σ 1 n p ( di ) Wherein di ∈ v, i ∈ (1,2,3...rt)
P[di] frequency that occurs in vj for word di, finally according to score (v, vj) sequence, take out some documents that similarity is greater than threshold values th, as the similar image Candidate Set of query image.
Be actually like this and there is no position difference.For example V1{d1, d2, d3, d4} and V2{d3, d4, d1, it be the proper vector of completely not identical picture that d2} understands in image similarity, but for the inquiry of lucene, can think that their similarity is very high, even thinks same pictures.This is obviously not all right.
So we need to not having in individual word in proper vector add positional information, tell that lucened1 is that the locational d1 of L1 is designated as L1_d1, so p (di) function is converted into
score ( v , vj ) = Σ 1 n p ( Li , di ) Wherein di ∈ v, i ∈ (1,2,3...n)
So position Li can be taken into account in the probability that calculates di, thereby there will not be the problems referred to above.
This mode is similar to the editing distance that calculates text similarity.So the similarity of picture has:
score ( v , vj ) = Σ 1 n p ( Li , di ) = ( v , vj )
According to above-mentioned code of points, we take out the image that similarity score is greater than th and as a result of collect:
set(v)=sorf({vi|score(v,vj)>th})
Wherein, set (v) is the query results of image v.
With reference to Fig. 1, a kind of characteristics of image indexing means based on Lucene inverted index, comprises the following steps:
A creates Index process:
A1 lexical analysis Language Processing, does word segmentation processing to text string;
A2 adds positional information, and to each point of good word, the position according to it in text adds the prefix that contains Position Number, for example: first position adds L1_;
A3 index creation, according to the word having divided, creates index, sets up the inverted list of word and document;
A4 writes disk by inverted index table and preserves;
B retrieving:
B1 lexical analysis Language Processing, does word segmentation processing to text string;
B2 adds positional information, and to each point of good word, the position according to it in text adds the prefix that contains Position Number;
B3 grammatical analysis, the query logic of analysis and consult statement, submits to searcher according to query logic Search Results;
B4 search index, the conjunctive search relevant documentation of submitting to according to syntax analyzer;
B5 relevance ranking, selects TopN as a result of to collect and return according to the relevance ranking of inquiry document and Candidate Set document.
Lexical analysis Language Processing described in steps A 1 is according to the feature after image feature vector text string, uses the WhitespaceAnalyzer of lucene according to space participle.
Described in steps A 2, add positional information, to each point of good word, the position according to it in text adds the prefix that contains Position Number, removes value simultaneously and be 0 word; The word that is 0 for these, is actually nonsensical, but also can cause result set greatly to increase, and affects performance, is 0 word so need to remove result set.
Search index described in step B4 has comprised document according to the logic of index correlativity marking, according to the distance calculating method between vector, rewrite the Similarity object of lucene, only need to retain tdf attribute as standards of grading, namely just mark by word frequency.
The above; it is only preferably embodiment of the present invention; protection scope of the present invention is not limited to this; any be familiar with those skilled in the art the present invention disclose technical scope in, the simple change of the technical scheme that can obtain apparently or equivalence replace all fall within the scope of protection of the present invention.

Claims (4)

1. the characteristics of image indexing means based on Lucene inverted index, is characterized in that, comprises the following steps:
A creates Index process:
A1 lexical analysis Language Processing, does word segmentation processing to text string;
A2 adds positional information, and to each point of good word, the position according to it in text adds the prefix that contains Position Number, for example: first position adds L1_;
A3 index creation, according to the word having divided, creates index, sets up the inverted list of word and document;
A4 writes disk by inverted index table and preserves;
B retrieving:
B1 lexical analysis Language Processing, does word segmentation processing to text string;
B2 adds positional information, and to each point of good word, the position according to it in text adds the prefix that contains Position Number;
B3 grammatical analysis, the query logic of analysis and consult statement, submits to searcher according to query logic Search Results;
B4 search index, the conjunctive search relevant documentation of submitting to according to syntax analyzer;
B5 relevance ranking, selects TopN as a result of to collect and return according to the relevance ranking of inquiry document and Candidate Set document.
2. the characteristics of image indexing means based on Lucene inverted index according to claim 1, it is characterized in that, lexical analysis Language Processing described in steps A 1 is according to the feature after image feature vector text string, uses the WhitespaceAnalyzer of lucene according to space participle.
3. the characteristics of image indexing means based on Lucene inverted index according to claim 1, it is characterized in that, described in steps A 2, add positional information, to each point of good word, position according to it in text adds the prefix that contains Position Number, removes value simultaneously and be 0 word.
4. the characteristics of image indexing means based on Lucene inverted index according to claim 1, it is characterized in that, search index described in step B4 has comprised document according to the logic of index correlativity marking, according to the distance calculating method between vector, rewrite the Similarity object of lucene, only need to retain tdf attribute as standards of grading, mark by word frequency.
CN201410185288.4A 2014-05-05 2014-05-05 Image feature indexing method based on Lucene inverted index Pending CN103955514A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410185288.4A CN103955514A (en) 2014-05-05 2014-05-05 Image feature indexing method based on Lucene inverted index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410185288.4A CN103955514A (en) 2014-05-05 2014-05-05 Image feature indexing method based on Lucene inverted index

Publications (1)

Publication Number Publication Date
CN103955514A true CN103955514A (en) 2014-07-30

Family

ID=51332789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410185288.4A Pending CN103955514A (en) 2014-05-05 2014-05-05 Image feature indexing method based on Lucene inverted index

Country Status (1)

Country Link
CN (1) CN103955514A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991102A (en) * 2016-01-21 2017-07-28 腾讯科技(深圳)有限公司 The processing method and processing system of key-value pair in inverted index
CN107533547A (en) * 2015-02-24 2018-01-02 拍搜有限公司 Product index editing method and its system
CN108241713A (en) * 2016-12-27 2018-07-03 南京烽火软件科技有限公司 A kind of inverted index search method based on polynary cutting
CN109543696A (en) * 2018-11-06 2019-03-29 吉林大学 A kind of image-recognizing method neural network based and its application
CN109617769A (en) * 2018-11-01 2019-04-12 长沙理工大学 A method of by scanning the two-dimensional code replacement smart home device
CN110019865A (en) * 2017-09-08 2019-07-16 北京京东尚科信息技术有限公司 Mass picture processing method, device, electronic equipment and storage medium
CN111814923A (en) * 2020-09-10 2020-10-23 上海云从汇临人工智能科技有限公司 Image clustering method, system, device and medium
CN112148831A (en) * 2020-11-26 2020-12-29 广州华多网络科技有限公司 Image-text mixed retrieval method and device, storage medium and computer equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6243713B1 (en) * 1998-08-24 2001-06-05 Excalibur Technologies Corp. Multimedia document retrieval by application of multimedia queries to a unified index of multimedia data for a plurality of multimedia data types
CN101458695A (en) * 2008-12-18 2009-06-17 西交利物浦大学 Mixed picture index construct and enquiry method based on key word and content characteristic and use thereof
CN102360431A (en) * 2011-10-08 2012-02-22 大连海事大学 Method for automatically describing image

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6243713B1 (en) * 1998-08-24 2001-06-05 Excalibur Technologies Corp. Multimedia document retrieval by application of multimedia queries to a unified index of multimedia data for a plurality of multimedia data types
CN101458695A (en) * 2008-12-18 2009-06-17 西交利物浦大学 Mixed picture index construct and enquiry method based on key word and content characteristic and use thereof
CN102360431A (en) * 2011-10-08 2012-02-22 大连海事大学 Method for automatically describing image

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐璐: ""基于Lucene和文本图像的全文检索系统的研究与应用"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
曹玉娟,牛振东,赵堃,彭学平: ""基于概念和语义网络的近似网页检测算法"", 《软件学报》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107533547A (en) * 2015-02-24 2018-01-02 拍搜有限公司 Product index editing method and its system
CN106991102A (en) * 2016-01-21 2017-07-28 腾讯科技(深圳)有限公司 The processing method and processing system of key-value pair in inverted index
CN106991102B (en) * 2016-01-21 2021-06-08 腾讯科技(深圳)有限公司 Processing method and processing system for key value pairs in inverted index
CN108241713A (en) * 2016-12-27 2018-07-03 南京烽火软件科技有限公司 A kind of inverted index search method based on polynary cutting
CN108241713B (en) * 2016-12-27 2021-12-28 南京烽火星空通信发展有限公司 Inverted index retrieval method based on multi-element segmentation
CN110019865B (en) * 2017-09-08 2021-01-26 北京京东尚科信息技术有限公司 Mass image processing method and device, electronic equipment and storage medium
CN110019865A (en) * 2017-09-08 2019-07-16 北京京东尚科信息技术有限公司 Mass picture processing method, device, electronic equipment and storage medium
US11395010B2 (en) 2017-09-08 2022-07-19 Beijing Jingdong Shangke Information Technology Co., Ltd. Massive picture processing method converting decimal element in matrices into binary element
CN109617769A (en) * 2018-11-01 2019-04-12 长沙理工大学 A method of by scanning the two-dimensional code replacement smart home device
CN109543696A (en) * 2018-11-06 2019-03-29 吉林大学 A kind of image-recognizing method neural network based and its application
CN111814923B (en) * 2020-09-10 2020-12-25 上海云从汇临人工智能科技有限公司 Image clustering method, system, device and medium
CN111814923A (en) * 2020-09-10 2020-10-23 上海云从汇临人工智能科技有限公司 Image clustering method, system, device and medium
CN112148831B (en) * 2020-11-26 2021-03-19 广州华多网络科技有限公司 Image-text mixed retrieval method and device, storage medium and computer equipment
CN112148831A (en) * 2020-11-26 2020-12-29 广州华多网络科技有限公司 Image-text mixed retrieval method and device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
CN103955514A (en) Image feature indexing method based on Lucene inverted index
US10242071B2 (en) Preliminary ranker for scoring matching documents
US20160078047A1 (en) Method for obtaining search suggestions from fuzzy score matching and population frequencies
CN107818815B (en) Electronic medical record retrieval method and system
JP5616444B2 (en) Method and system for document indexing and data querying
CN104021161A (en) Cluster storage method and device
US20140379719A1 (en) System and method for tagging and searching documents
US11748324B2 (en) Reducing matching documents for a search query
US10747795B2 (en) Cognitive retrieve and rank search improvements using natural language for product attributes
CN103678277A (en) Theme-vocabulary distribution establishing method and system based on document segmenting
Muñoz et al. Triplifying wikipedia's tables
EP3314465B1 (en) Match fix-up to remove matching documents
WO2014206151A1 (en) System and method for tagging and searching documents
CN106227788A (en) Database query method based on Lucene
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
EP3314468A1 (en) Matching documents using a bit vector search index
CN103827852A (en) Clustering WEB pages on a search engine results page
Ma et al. Typifier: Inferring the type semantics of structured data
CN112434167A (en) Information identification method and device
CN103440315A (en) Web page cleaning method based on theme
CN102693320A (en) Searching method and device
CN104750673A (en) Text matching and filtering method and text matching and filtering device
CN102262682B (en) Based on the rapid attribute reduction of rough classification knowledge discovery
CN107526795B (en) Knowledge base construction method and device, storage medium and computing equipment
Krutil et al. Web page classification based on schema. org collection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140730