CN103399913A - Encryption compressing method and information searching method for index of search engine - Google Patents
Encryption compressing method and information searching method for index of search engine Download PDFInfo
- Publication number
- CN103399913A CN103399913A CN2013103288463A CN201310328846A CN103399913A CN 103399913 A CN103399913 A CN 103399913A CN 2013103288463 A CN2013103288463 A CN 2013103288463A CN 201310328846 A CN201310328846 A CN 201310328846A CN 103399913 A CN103399913 A CN 103399913A
- Authority
- CN
- China
- Prior art keywords
- index
- encryption
- compression
- adopt
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention relates to an encryption compressing method for an index of a search engine and a method for searching information by utilizing the encryption compressing method. First, a digital index is established for information resources through an inverted file method; then compression processing is conducted on the digital index by adopting a variable length compressing method; encryption processing is conducted on the index after compression processing by adopting the Base 64 encryption algorithm, and the secret key serves as a final index. Encryption processing is conducted on a key word to be searched through the Base 64 encryption algorithm, and then information indexing is conducted according to the final index. On the basis of the existing mature compression technology, with combination of a mature encryption algorithm, an effective compressing method is formed, fast and accurate searching of the information can be achieved, and server resources are effectively saved.
Description
Technical field
The present invention relates to a kind of ciphered compressed method, relate to search engine index bit encryption compression method, and apply the method and carry out method for information retrieval, belong to the Information Technology Agreement field.
Background technology
Flourish along with internet, people increase severely to the demand of quantity of information thereupon, and the approach of people's obtaining information is also more and more.
Search engine, as the Core Feature of network information search, is brought into play huge effect in daily life.At present domestic have a lot of searching products, and domestic internet scale is big, and quantity of information is big has also brought no small challenge to search technique, how to accomplish more fast, more accurate, the resource-saving problem that also need to solve with regard to becoming search engine businessman more.
Setting up index is one of search engine core technology, and the purpose of setting up index is the inquiry that can respond fast the user.The most frequently used index data structure of search engine is inverted entry, and the principle of inverted entry is in fact quite simple, for convenient, processes, and tends to a word and document code and is converted to digital form.
Index is compressed with to a lot of benefits: disk space and internal memory such as reducing index and take, can reduce I/O read-write amount, can make inquiry response speed quickening etc.In order to increase compression effectiveness, generally before compressing, first rewrite index content, at first the numerical value of inverted index is sorted according to size, then with difference but not actual value represents (d-gap); This is the work that will do before each compression algorithm is carried out.
Present compression method can be divided into regular length and elongated compression.
1. the compression method of regular length
Typical method is the bit aligned compression, and this method is take Byte as coding unit, unlike elongated compressed encoding all take bit as coding unit.For the numeral that will compress, generally with two bits, represent length, other bit binary codes express numerical value itself, as shown below:
Numerical range two bit compression sizes
Comprise fixed length with elongated index compression, a basic assumption being arranged, exactly: most of numerical value that compress are all smaller, thus after compression, take up room do not understand many.The compressibility of this method is the 10-20% of original, uncompressed index.
2. elongated compression method
The a.Unary compression method
For the numerical value of N that will compress, with N+1 bits, represent, wherein top N is 1, last position is 0 as end mark.Such as:
5:111110
10:11111111110
B.Elias compression method (gamma& Delta)
For a numerical value X that will compress, with log2 (x), be decomposed into two numerical value, one is N=log2 (X), with N individual 1, represent this part, another one is remainder k=X-2powor ((log2 (X))), this part numerical value k separates with 0 with binary coding representation (length equals N), centre;
Such as for numerical value 2, its compressed encoding is 100, because log2 (2)=1, residue is 0; One of middle insertion cuts apart 0;
Such as for numerical value 10, N=log2 (10)=3, so first part is: 111; 10-2(3)=2, so the second part is: 010; Centre is cut apart with 0, so be: 1110010.
The c.Golomb compression method
For a numerical value X that will compress, with formula, decompose: X=q*b+r+1; (0=<r<b)
Wherein, b is called bucket size, can according to circumstances specifically arrange.Suppose b=3, so for the numerical value 10 that will compress: 10=3*3+0+1; (q=3, b=3, r=0).
First encodes to factoring q, and its coding method is similar to the unary coding; Such as in 10=3*3+0+1, q=1110;
Second portion is to encode for surplus factor r; Still adopt binary coding, code length is that log2 (b) gets integer or the upper integer-1 of log2 (b); For top r=0, it is encoded to 0.
Golomb just is that bucket size with respect to the benefit of Elias method, and this value can be set, and can adjust bucket size according to the distribution of the numerical value that will compress inside index and obtain better compression effectiveness.
D. mix and use
General the distribution of its numerical value is different, and its characteristics are respectively arranged for different index territories, and the numeric distribution attribute, can take to mix Compression Strategies by analysis.Such as D-gap uses the Golomb compression, tf uses the Gamma compression.
Adopt index compression can bring a lot of benefits, so practical search engine all can adopt the index compression technology, but index is compressed also and can bring problem, that needs more calculated amount than not compressing exactly.
Summary of the invention
On the basis of prior art, the object of the present invention is to provide a kind of more efficient index compression method, and application the method carries out method for information retrieval, with realize to information fast, accurately search, effectively save server resource.
The technical solution used in the present invention is:
A kind of ciphered compressed method of search engine index, its step comprises:
1) adopt the inverted entry method to set up Numerical Index to information resources;
2) adopt elongated compression method to compress processing to described Numerical Index;
3) adopt the index after the Base64 cryptographic algorithm is processed compression to be encrypted, by the key after encryption as final index.
Further, described inverted entry method comprises word segmentation processing and filters high frequency words and process.
Further, described elongated compression method is a kind of in following: Unary method, Elias method, Golomb method.
A kind of information retrieval method, its step comprises:
1) adopt the inverted entry method to set up Numerical Index to information resources;
2) adopt elongated compression method to compress processing to described Numerical Index;
3) adopt the index after the Base64 cryptographic algorithm is processed compression to be encrypted, by the key after encryption as final index;
4) keyword to be checked is encrypted with the Base64 cryptographic algorithm, then according to described final index, carries out information retrieval.
Further, described inverted entry method comprises word segmentation processing and filters high frequency words and process.
Further, described elongated compression method is a kind of in following: Unary method, Elias method, Golomb method.
Of the present invention take existing ripe compress technique as basis, in conjunction with ripe cryptographic algorithm, be combined into the effective compression method of a cover, by the test compression rate, reach as high as 50%(different according to the different compressibilitys of index information length).Application the method can realize to information fast, accurately search, effectively save server resource.
The accompanying drawing explanation
Fig. 1 is the flow chart of steps of ciphered compressed method of the search engine index of the embodiment of the present invention.
Fig. 2 is the flow chart of steps of the information query method of the embodiment of the present invention.
Embodiment
Below by specific embodiment, the present invention is described further.
Fig. 1 is the flow chart of steps of ciphered compressed method of the search engine index of the embodiment of the present invention, is described as follows:
1. method of operating set up in index:
Use the inverted entry principle, the information content is carried out to word segmentation processing and filter high frequency words and process, word and document code are converted to digital form, form Numerical Index.
For example, the information content is:
D1: " existing " tearing open " word on Chinese embassy in the U.S. courtyard door pillar "
D2: " Cheng Long, Mr. Li Lianjie have represented the working direction of Kung fu culture "
D3: " Chinese Communist Party has represented the demand for development of "Three Representatives" "
Through word segmentation processing with after filtering the high frequency words processing, can build following inverted index document:
China-> D1,1; D2,1; D3,2;
Cheng Long-> D2,1;
Representative-> D2,1; D3,1;
2. index compression method
To by the Numerical Index that inverted entry is set up, adopt elongated compression method to compress, and can adopt any one method compression in Unary, Elias, Golomb in elongated compression, the principle of elongated compression as described in the background art.
3. bit encryption method
The index that will compress with elongated compression method, adopt the Base64 algorithm to be encrypted wherein keyword, by the key after encryption as final index.
Base64 is modal be used to one of coded system of transmitting the 8Bit syllabified code on network, is a kind of bidirectional encipher mode.The Base64 coding is used under the HTTP environment and transmits long identification information.For example, in Java Persistence system Hibernate, just adopted Base64 that long unique identifier (being generally the UUID of 128-bit) is encoded to a character string, as the parameter in HTTP list and HTTP GET URL.In other application programs, also usually need binary data coding is comprised to hiding form fields for being suitable for being placed on URL() in form.At this moment, adopt the Base64 coding not only more brief, have not readable property simultaneously yet, namely coded data can with the naked eye directly not seen by the people.The algorithm of encrypting has a lot, and unidirectional, two-way, symmetry and asymmetric arithmetic are arranged, and unidirectionally is considered to a kind of unsafe.Base64 is more suitable for Internet Transmission relatively, and has safe characteristic, the form of the information such as document, picture with coding can be left in container, helps to save space.
For example: index content is:
China-> D1,1; D2,1; D3,2;
Cheng Long-> D2,1;
Representative-> D2,1; D3,1;
After the base64 coding, be:
5Lit5Zu9->D1,1;D2,1;D3,2;
5oiQ6b6Z->D2,1;
5Luj6KGo->D2,1;D3,1;
In index file, have a large amount of keywords, the size of whole document is more much larger than uncoded file after the Base64 coding.During inquiry, key word of the inquiry, equally with the Base64 coding, is then retrieved or inquired about, as shown in Figure 2.
Be exactly more than the method that adopts the bit encryption algorithm further to compress by the index that inverted entry principle, elongated compress technique produce, the bit encryption compression method of the cryptographic algorithm of ascending the throne+inverted entry principle+elongated compress technique combination.Usually to the compression of index, generally just adopt the inverted entry principle to divide glossarial index, then use elongated compression.The present invention is excessive for solving index file, the solution that inquiry response speed provides slowly, and this scheme can further be saved taking up room of server disk and internal memory, accelerates response speed.By the index after above-mentioned processing and original the contrast, in the situation that keep original index operation efficiency constant, the compressibility of index can obviously improve, and by test, reaches as high as 50%, and index content is more much longer, and compression effectiveness is more remarkable.
Above embodiment is only in order to technical scheme of the present invention to be described but not be limited; those of ordinary skill in the art can modify or be equal to replacement technical scheme of the present invention; and not breaking away from the spirit and scope of the present invention, protection scope of the present invention should be as the criterion so that claim is described.
Claims (6)
1. the ciphered compressed method of a search engine index, its step comprises:
1) adopt the inverted entry method to set up Numerical Index to information resources;
2) adopt elongated compression method to compress processing to described Numerical Index;
3) adopt the index after the Base64 cryptographic algorithm is processed compression to be encrypted, by the key after encryption as final index.
2. the method for claim 1 is characterized in that: described inverted entry method comprises word segmentation processing and filters high frequency words and process.
3. the method for claim 1 is characterized in that: described elongated compression method is a kind of in following: Unary method, Elias method, Golomb method.
4. information retrieval method, its step comprises:
1) adopt the inverted entry method to set up Numerical Index to information resources;
2) adopt elongated compression method to compress processing to described Numerical Index;
3) adopt the index after the Base64 cryptographic algorithm is processed compression to be encrypted, by the key after encryption as final index;
4) keyword to be checked is encrypted with the Base64 cryptographic algorithm, then according to described final index, carries out information retrieval.
5. method as claimed in claim 4 is characterized in that: described inverted entry method comprises word segmentation processing and filters high frequency words and process.
6. method as claimed in claim 4 is characterized in that: described elongated compression method is a kind of in following: Unary method, Elias method, Golomb method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013103288463A CN103399913A (en) | 2013-07-31 | 2013-07-31 | Encryption compressing method and information searching method for index of search engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013103288463A CN103399913A (en) | 2013-07-31 | 2013-07-31 | Encryption compressing method and information searching method for index of search engine |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103399913A true CN103399913A (en) | 2013-11-20 |
Family
ID=49563541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013103288463A Pending CN103399913A (en) | 2013-07-31 | 2013-07-31 | Encryption compressing method and information searching method for index of search engine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103399913A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110908998A (en) * | 2019-11-13 | 2020-03-24 | 广联达科技股份有限公司 | Data storage and search method, system and computer readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7007015B1 (en) * | 2002-05-01 | 2006-02-28 | Microsoft Corporation | Prioritized merging for full-text index on relational store |
CN101520800A (en) * | 2009-03-27 | 2009-09-02 | 华中科技大学 | Cryptogram-based safe full-text indexing and retrieval system |
-
2013
- 2013-07-31 CN CN2013103288463A patent/CN103399913A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7007015B1 (en) * | 2002-05-01 | 2006-02-28 | Microsoft Corporation | Prioritized merging for full-text index on relational store |
CN101520800A (en) * | 2009-03-27 | 2009-09-02 | 华中科技大学 | Cryptogram-based safe full-text indexing and retrieval system |
Non-Patent Citations (2)
Title |
---|
席齐: "基于Lucene的网页抓取与检索系统", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 March 2012 (2012-03-15) * |
苏潭英: "面向中文的数据库全文检索及其相关安全技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, 15 June 2008 (2008-06-15) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110908998A (en) * | 2019-11-13 | 2020-03-24 | 广联达科技股份有限公司 | Data storage and search method, system and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104408177B (en) | Cipher text retrieval method based on cloud document system | |
CN105426709A (en) | JPEG image information hiding based private information communication method and system | |
CN100492368C (en) | Mobile terminal apparatus electronic file storage and management method | |
CN104657362A (en) | Method and device for storing and querying data | |
US20090115646A1 (en) | Data processing system and method | |
CN104753540A (en) | Data compression method, data decompression method and device | |
CN103873860A (en) | Document transmission method and device | |
CN104408100B (en) | The compression method of structured web site daily record | |
CN109165144A (en) | A kind of security log compression storage and search method based on variable-length record | |
CN113111090B (en) | Multidimensional data query method based on order-preserving encryption | |
CN103731154B (en) | Data compression algorithm based on semantic analysis | |
CN108737353B (en) | Data encryption method and device based on data analysis system | |
CN101477539B (en) | Information acquisition method and device | |
CN103701470B (en) | Stream intelligence prediction differencing and compression algorithm and corresponding control device | |
CN103399913A (en) | Encryption compressing method and information searching method for index of search engine | |
CN104767710B (en) | The transmission payload extracting method of HTTP block transmissions coding based on DFA | |
CN106789938B (en) | Method for monitoring search trace of browser at mobile phone end in real time | |
CN111414341B (en) | Data normalization description method in Internet of things environment | |
CN114461768A (en) | Homomorphic encryption-based multi-keyword file encryption retrieval method and system | |
CN112417843B (en) | IDcode identification analysis system and implementation method thereof | |
CN109923549B (en) | Searchable symmetric encryption system and method for processing inverted index | |
Jain et al. | An efficient compression algorithm (ECA) for text data | |
CN102801430B (en) | Compression algorithm for Chinese parameters of URL | |
KR101315683B1 (en) | Encrypting and decrypting method without causing change of data size and type | |
US20070280474A1 (en) | Encryption Method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20131120 |
|
RJ01 | Rejection of invention patent application after publication |