CN103399913A - Encryption compressing method and information searching method for index of search engine - Google Patents

Encryption compressing method and information searching method for index of search engine Download PDF

Info

Publication number
CN103399913A
CN103399913A CN2013103288463A CN201310328846A CN103399913A CN 103399913 A CN103399913 A CN 103399913A CN 2013103288463 A CN2013103288463 A CN 2013103288463A CN 201310328846 A CN201310328846 A CN 201310328846A CN 103399913 A CN103399913 A CN 103399913A
Authority
CN
China
Prior art keywords
index
encryption
compression
adopt
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013103288463A
Other languages
Chinese (zh)
Inventor
姜贤武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd
Original Assignee
BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd filed Critical BEIJING HUAYI INTERACTIVE TECHNOLOGY Co Ltd
Priority to CN2013103288463A priority Critical patent/CN103399913A/en
Publication of CN103399913A publication Critical patent/CN103399913A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to an encryption compressing method for an index of a search engine and a method for searching information by utilizing the encryption compressing method. First, a digital index is established for information resources through an inverted file method; then compression processing is conducted on the digital index by adopting a variable length compressing method; encryption processing is conducted on the index after compression processing by adopting the Base 64 encryption algorithm, and the secret key serves as a final index. Encryption processing is conducted on a key word to be searched through the Base 64 encryption algorithm, and then information indexing is conducted according to the final index. On the basis of the existing mature compression technology, with combination of a mature encryption algorithm, an effective compressing method is formed, fast and accurate searching of the information can be achieved, and server resources are effectively saved.

Description

A kind of ciphered compressed method and information retrieval method of search engine index
Technical field
The present invention relates to a kind of ciphered compressed method, relate to search engine index bit encryption compression method, and apply the method and carry out method for information retrieval, belong to the Information Technology Agreement field.
Background technology
Flourish along with internet, people increase severely to the demand of quantity of information thereupon, and the approach of people's obtaining information is also more and more.
Search engine, as the Core Feature of network information search, is brought into play huge effect in daily life.At present domestic have a lot of searching products, and domestic internet scale is big, and quantity of information is big has also brought no small challenge to search technique, how to accomplish more fast, more accurate, the resource-saving problem that also need to solve with regard to becoming search engine businessman more.
Setting up index is one of search engine core technology, and the purpose of setting up index is the inquiry that can respond fast the user.The most frequently used index data structure of search engine is inverted entry, and the principle of inverted entry is in fact quite simple, for convenient, processes, and tends to a word and document code and is converted to digital form.
Index is compressed with to a lot of benefits: disk space and internal memory such as reducing index and take, can reduce I/O read-write amount, can make inquiry response speed quickening etc.In order to increase compression effectiveness, generally before compressing, first rewrite index content, at first the numerical value of inverted index is sorted according to size, then with difference but not actual value represents (d-gap); This is the work that will do before each compression algorithm is carried out.
Present compression method can be divided into regular length and elongated compression.
1. the compression method of regular length
Typical method is the bit aligned compression, and this method is take Byte as coding unit, unlike elongated compressed encoding all take bit as coding unit.For the numeral that will compress, generally with two bits, represent length, other bit binary codes express numerical value itself, as shown below:
Numerical range two bit compression sizes
Figure BDA00003600140800011
Comprise fixed length with elongated index compression, a basic assumption being arranged, exactly: most of numerical value that compress are all smaller, thus after compression, take up room do not understand many.The compressibility of this method is the 10-20% of original, uncompressed index.
2. elongated compression method
The a.Unary compression method
For the numerical value of N that will compress, with N+1 bits, represent, wherein top N is 1, last position is 0 as end mark.Such as:
5:111110
10:11111111110
B.Elias compression method (gamma& Delta)
For a numerical value X that will compress, with log2 (x), be decomposed into two numerical value, one is N=log2 (X), with N individual 1, represent this part, another one is remainder k=X-2powor ((log2 (X))), this part numerical value k separates with 0 with binary coding representation (length equals N), centre;
Such as for numerical value 2, its compressed encoding is 100, because log2 (2)=1, residue is 0; One of middle insertion cuts apart 0;
Such as for numerical value 10, N=log2 (10)=3, so first part is: 111; 10-2(3)=2, so the second part is: 010; Centre is cut apart with 0, so be: 1110010.
The c.Golomb compression method
For a numerical value X that will compress, with formula, decompose: X=q*b+r+1; (0=<r<b)
Wherein, b is called bucket size, can according to circumstances specifically arrange.Suppose b=3, so for the numerical value 10 that will compress: 10=3*3+0+1; (q=3, b=3, r=0).
First encodes to factoring q, and its coding method is similar to the unary coding; Such as in 10=3*3+0+1, q=1110;
Second portion is to encode for surplus factor r; Still adopt binary coding, code length is that log2 (b) gets integer or the upper integer-1 of log2 (b); For top r=0, it is encoded to 0.
Golomb just is that bucket size with respect to the benefit of Elias method, and this value can be set, and can adjust bucket size according to the distribution of the numerical value that will compress inside index and obtain better compression effectiveness.
D. mix and use
General the distribution of its numerical value is different, and its characteristics are respectively arranged for different index territories, and the numeric distribution attribute, can take to mix Compression Strategies by analysis.Such as D-gap uses the Golomb compression, tf uses the Gamma compression.
Adopt index compression can bring a lot of benefits, so practical search engine all can adopt the index compression technology, but index is compressed also and can bring problem, that needs more calculated amount than not compressing exactly.
Summary of the invention
On the basis of prior art, the object of the present invention is to provide a kind of more efficient index compression method, and application the method carries out method for information retrieval, with realize to information fast, accurately search, effectively save server resource.
The technical solution used in the present invention is:
A kind of ciphered compressed method of search engine index, its step comprises:
1) adopt the inverted entry method to set up Numerical Index to information resources;
2) adopt elongated compression method to compress processing to described Numerical Index;
3) adopt the index after the Base64 cryptographic algorithm is processed compression to be encrypted, by the key after encryption as final index.
Further, described inverted entry method comprises word segmentation processing and filters high frequency words and process.
Further, described elongated compression method is a kind of in following: Unary method, Elias method, Golomb method.
A kind of information retrieval method, its step comprises:
1) adopt the inverted entry method to set up Numerical Index to information resources;
2) adopt elongated compression method to compress processing to described Numerical Index;
3) adopt the index after the Base64 cryptographic algorithm is processed compression to be encrypted, by the key after encryption as final index;
4) keyword to be checked is encrypted with the Base64 cryptographic algorithm, then according to described final index, carries out information retrieval.
Further, described inverted entry method comprises word segmentation processing and filters high frequency words and process.
Further, described elongated compression method is a kind of in following: Unary method, Elias method, Golomb method.
Of the present invention take existing ripe compress technique as basis, in conjunction with ripe cryptographic algorithm, be combined into the effective compression method of a cover, by the test compression rate, reach as high as 50%(different according to the different compressibilitys of index information length).Application the method can realize to information fast, accurately search, effectively save server resource.
The accompanying drawing explanation
Fig. 1 is the flow chart of steps of ciphered compressed method of the search engine index of the embodiment of the present invention.
Fig. 2 is the flow chart of steps of the information query method of the embodiment of the present invention.
Embodiment
Below by specific embodiment, the present invention is described further.
Fig. 1 is the flow chart of steps of ciphered compressed method of the search engine index of the embodiment of the present invention, is described as follows:
1. method of operating set up in index:
Use the inverted entry principle, the information content is carried out to word segmentation processing and filter high frequency words and process, word and document code are converted to digital form, form Numerical Index.
For example, the information content is:
D1: " existing " tearing open " word on Chinese embassy in the U.S. courtyard door pillar "
D2: " Cheng Long, Mr. Li Lianjie have represented the working direction of Kung fu culture "
D3: " Chinese Communist Party has represented the demand for development of "Three Representatives" "
Through word segmentation processing with after filtering the high frequency words processing, can build following inverted index document:
China-> D1,1; D2,1; D3,2;
Cheng Long-> D2,1;
Representative-> D2,1; D3,1;
2. index compression method
To by the Numerical Index that inverted entry is set up, adopt elongated compression method to compress, and can adopt any one method compression in Unary, Elias, Golomb in elongated compression, the principle of elongated compression as described in the background art.
3. bit encryption method
The index that will compress with elongated compression method, adopt the Base64 algorithm to be encrypted wherein keyword, by the key after encryption as final index.
Base64 is modal be used to one of coded system of transmitting the 8Bit syllabified code on network, is a kind of bidirectional encipher mode.The Base64 coding is used under the HTTP environment and transmits long identification information.For example, in Java Persistence system Hibernate, just adopted Base64 that long unique identifier (being generally the UUID of 128-bit) is encoded to a character string, as the parameter in HTTP list and HTTP GET URL.In other application programs, also usually need binary data coding is comprised to hiding form fields for being suitable for being placed on URL() in form.At this moment, adopt the Base64 coding not only more brief, have not readable property simultaneously yet, namely coded data can with the naked eye directly not seen by the people.The algorithm of encrypting has a lot, and unidirectional, two-way, symmetry and asymmetric arithmetic are arranged, and unidirectionally is considered to a kind of unsafe.Base64 is more suitable for Internet Transmission relatively, and has safe characteristic, the form of the information such as document, picture with coding can be left in container, helps to save space.
For example: index content is:
China-> D1,1; D2,1; D3,2;
Cheng Long-> D2,1;
Representative-> D2,1; D3,1;
After the base64 coding, be:
5Lit5Zu9->D1,1;D2,1;D3,2;
5oiQ6b6Z->D2,1;
5Luj6KGo->D2,1;D3,1;
In index file, have a large amount of keywords, the size of whole document is more much larger than uncoded file after the Base64 coding.During inquiry, key word of the inquiry, equally with the Base64 coding, is then retrieved or inquired about, as shown in Figure 2.
Be exactly more than the method that adopts the bit encryption algorithm further to compress by the index that inverted entry principle, elongated compress technique produce, the bit encryption compression method of the cryptographic algorithm of ascending the throne+inverted entry principle+elongated compress technique combination.Usually to the compression of index, generally just adopt the inverted entry principle to divide glossarial index, then use elongated compression.The present invention is excessive for solving index file, the solution that inquiry response speed provides slowly, and this scheme can further be saved taking up room of server disk and internal memory, accelerates response speed.By the index after above-mentioned processing and original the contrast, in the situation that keep original index operation efficiency constant, the compressibility of index can obviously improve, and by test, reaches as high as 50%, and index content is more much longer, and compression effectiveness is more remarkable.
Above embodiment is only in order to technical scheme of the present invention to be described but not be limited; those of ordinary skill in the art can modify or be equal to replacement technical scheme of the present invention; and not breaking away from the spirit and scope of the present invention, protection scope of the present invention should be as the criterion so that claim is described.

Claims (6)

1. the ciphered compressed method of a search engine index, its step comprises:
1) adopt the inverted entry method to set up Numerical Index to information resources;
2) adopt elongated compression method to compress processing to described Numerical Index;
3) adopt the index after the Base64 cryptographic algorithm is processed compression to be encrypted, by the key after encryption as final index.
2. the method for claim 1 is characterized in that: described inverted entry method comprises word segmentation processing and filters high frequency words and process.
3. the method for claim 1 is characterized in that: described elongated compression method is a kind of in following: Unary method, Elias method, Golomb method.
4. information retrieval method, its step comprises:
1) adopt the inverted entry method to set up Numerical Index to information resources;
2) adopt elongated compression method to compress processing to described Numerical Index;
3) adopt the index after the Base64 cryptographic algorithm is processed compression to be encrypted, by the key after encryption as final index;
4) keyword to be checked is encrypted with the Base64 cryptographic algorithm, then according to described final index, carries out information retrieval.
5. method as claimed in claim 4 is characterized in that: described inverted entry method comprises word segmentation processing and filters high frequency words and process.
6. method as claimed in claim 4 is characterized in that: described elongated compression method is a kind of in following: Unary method, Elias method, Golomb method.
CN2013103288463A 2013-07-31 2013-07-31 Encryption compressing method and information searching method for index of search engine Pending CN103399913A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013103288463A CN103399913A (en) 2013-07-31 2013-07-31 Encryption compressing method and information searching method for index of search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103288463A CN103399913A (en) 2013-07-31 2013-07-31 Encryption compressing method and information searching method for index of search engine

Publications (1)

Publication Number Publication Date
CN103399913A true CN103399913A (en) 2013-11-20

Family

ID=49563541

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103288463A Pending CN103399913A (en) 2013-07-31 2013-07-31 Encryption compressing method and information searching method for index of search engine

Country Status (1)

Country Link
CN (1) CN103399913A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110908998A (en) * 2019-11-13 2020-03-24 广联达科技股份有限公司 Data storage and search method, system and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7007015B1 (en) * 2002-05-01 2006-02-28 Microsoft Corporation Prioritized merging for full-text index on relational store
CN101520800A (en) * 2009-03-27 2009-09-02 华中科技大学 Cryptogram-based safe full-text indexing and retrieval system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7007015B1 (en) * 2002-05-01 2006-02-28 Microsoft Corporation Prioritized merging for full-text index on relational store
CN101520800A (en) * 2009-03-27 2009-09-02 华中科技大学 Cryptogram-based safe full-text indexing and retrieval system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
席齐: "基于Lucene的网页抓取与检索系统", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 March 2012 (2012-03-15) *
苏潭英: "面向中文的数据库全文检索及其相关安全技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 June 2008 (2008-06-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110908998A (en) * 2019-11-13 2020-03-24 广联达科技股份有限公司 Data storage and search method, system and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN104408177B (en) Cipher text retrieval method based on cloud document system
CN105426709A (en) JPEG image information hiding based private information communication method and system
CN100492368C (en) Mobile terminal apparatus electronic file storage and management method
CN104657362A (en) Method and device for storing and querying data
US20090115646A1 (en) Data processing system and method
CN104753540A (en) Data compression method, data decompression method and device
CN103873860A (en) Document transmission method and device
CN104408100B (en) The compression method of structured web site daily record
CN109165144A (en) A kind of security log compression storage and search method based on variable-length record
CN113111090B (en) Multidimensional data query method based on order-preserving encryption
CN103731154B (en) Data compression algorithm based on semantic analysis
CN108737353B (en) Data encryption method and device based on data analysis system
CN101477539B (en) Information acquisition method and device
CN103701470B (en) Stream intelligence prediction differencing and compression algorithm and corresponding control device
CN103399913A (en) Encryption compressing method and information searching method for index of search engine
CN104767710B (en) The transmission payload extracting method of HTTP block transmissions coding based on DFA
CN106789938B (en) Method for monitoring search trace of browser at mobile phone end in real time
CN111414341B (en) Data normalization description method in Internet of things environment
CN114461768A (en) Homomorphic encryption-based multi-keyword file encryption retrieval method and system
CN112417843B (en) IDcode identification analysis system and implementation method thereof
CN109923549B (en) Searchable symmetric encryption system and method for processing inverted index
Jain et al. An efficient compression algorithm (ECA) for text data
CN102801430B (en) Compression algorithm for Chinese parameters of URL
KR101315683B1 (en) Encrypting and decrypting method without causing change of data size and type
US20070280474A1 (en) Encryption Method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20131120

RJ01 Rejection of invention patent application after publication