CN103207835A - Mass data storage method through self-adaptive Range partitions - Google Patents

Mass data storage method through self-adaptive Range partitions Download PDF

Info

Publication number
CN103207835A
CN103207835A CN2013101305375A CN201310130537A CN103207835A CN 103207835 A CN103207835 A CN 103207835A CN 2013101305375 A CN2013101305375 A CN 2013101305375A CN 201310130537 A CN201310130537 A CN 201310130537A CN 103207835 A CN103207835 A CN 103207835A
Authority
CN
China
Prior art keywords
subregion
section
mass data
data
span
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013101305375A
Other languages
Chinese (zh)
Inventor
张志远
刘金晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DONGGUAN MUNICIPAL PUBLIC SECURITY BUREAU
Beijing Ruian Technology Co Ltd
Original Assignee
DONGGUAN MUNICIPAL PUBLIC SECURITY BUREAU
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DONGGUAN MUNICIPAL PUBLIC SECURITY BUREAU, Beijing Ruian Technology Co Ltd filed Critical DONGGUAN MUNICIPAL PUBLIC SECURITY BUREAU
Priority to CN2013101305375A priority Critical patent/CN103207835A/en
Publication of CN103207835A publication Critical patent/CN103207835A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to a mass data storage method through self-adaptive Range partitions. The mass data storage method includes acquiring range information of various level of mass data, determining a level number n which is an integer bigger than 2; determining a total level number m for partitioning, setting a sampling standard number pi0 for the ith level and a maximum sampling number pi1 of each partition for the ith level; dividing evenly sampling range of the ith level with a standard of pi0, partitioning; judging whether the sampling number of the last partition adding pi0 is smaller than pi1 or not, if so, combining the last partition with the previous partition, and if not, taking the last partition as an individual partition; and finally performing mass data storage based on partitioning results. The m is an integer from 1 to n, and the i is an integer from (n-m+1) to n. The mass data storage method has self-adaption and can improve storage and inquiry properties of mass data.

Description

A kind of method by adaptive Range subregion storage mass data
Technical field
The invention belongs to the Computer Database field, be specifically related to a kind of method by adaptive Range subregion storage mass data, can improve storage and the query performance of mass data.
Background technology
In the database application technology, the storage of mass data generally all can be used partitioning technique.This technology is that a very big table is divided into a plurality of little tables and stores different zones respectively into according to certain rule, table so in logic, during physical store, can be as many tables, be stored in different positions, simplify the management activity of database, but also can improve application performance.Because optimizer will be known the value range that uses as the subregion basis, so it just can only use specific subregion directly to inquire about when access list.Because in query procedure, only browse low volume data, just improved query performance naturally.Because external interface is still a table, for the user, application is transparent, the existence of their imperceptible subregion simultaneously.Therefore, showing partitioning technique greatly uses very extensive in mass data storage.
In the Oracle the inside, the subregion of table is divided by " subregion key ".The subregion key refers to some row, and these row have determined the subregion at certain delegation place.The partitioned mode that Oracle10g supports comprises range subregion, List subregion, Hash subregion and combination subregion.All kinds of subregions respectively have characteristics, respectively have superiority.
The Range subregion is oracle the most classical partitioning algorithm the earliest, and the numerical range of the subregion key assignments by appointment is stored on the corresponding subregion data.By the scope subregion, we know which data is to have in which subregion therefore more convenient large-scale data manipulation.The shortcoming of Range subregion is the data skewness, causes data access performance to descend.
The List subregion is similar with the range subregion, and difference is that the range subregion is that continuous space by data carries out subregion.And the List subregion is to carry out subregion by the discrete value of record.The value that only is adapted to the subregion key is the situation that limited numerical value is gathered.
The Hash subregion is just in time opposite with the range subregion, and its biggest advantage is exactly that data are evenly distributed to each subregion, and data access performance is superior.But its shortcoming is to be not suitable for large batch of data management operations, and the performance of range query does not have the good of range subregion yet.
The combination subregion is the multilayer subregion, such as at first using the range subregion, then each subregion is being carried out the hash subregion again.The existing relative merits of the selection of the subregion of ground floor still exist, and maintenance cost is higher.
In the application scenarios of reality, have such situation: the data volume of certain table is very big, and is all very frequent based on range query, the renewal operation of certain key assignments, also knows the upward span of certain key assignments row of this table simultaneously.Therefore, be chosen on this key assignments and use the range partitioned mode to come data are carried out distributed storage.But the skewness weighing apparatus of data on each key assignments, and for different enforcement environment, the distribution situation of primary data on each key assignments in advance.In this case, if adopt fixing subregion span, under different enforcement environment, the distribution meeting of data is unbalanced so, does not embody the superiority of partitioning technique.If with scope establish too small, balanced in the subregion the inside relative data that data are being arranged, but the subregion coverage is low excessively; If with scope establish excessive, the subregion coverage is bigger comparatively speaking, but data are unbalanced.Also all lost the meaning of using subregion.
Summary of the invention
At above-mentioned technical matters, the present invention proposes a kind of method by adaptive Range subregion storage mass data, each span that can intelligent formation range wait, no matter make that data can both be evenly distributed in each subregion under the environment when which type of system implement at, embody the superiority of big table partitioning technique in mass data processing.
For achieving the above object, the present invention adopts following technical scheme:
A kind of method by adaptive Range subregion storage mass data, as shown in Figure 1, its step comprises:
1) the various level range information of obtaining mass data is also determined level number n, and n is the integer greater than 2;
2) determine to carry out total number of levels m of subregion, set value preferred number pi0 and each value number pi1 that divides section to allow at most of i layer of i layer subregion, wherein m is the integer in [1, n], and i is the integer in [(n-m+1), n];
3) be the span of the automatically even five equilibrium i layer of standard with pi0, carry out subregion; Divide section for last, judge that whether its value quantity adds pi0 less than pi1, if less than, then itself and previous minute section are merged into a branch section, otherwise with it as an independent partitions section;
4) carry out mass data storage based on the scoring area result of step 3) institute.
In the step 1), the corresponding level of outermost span is 1, and the corresponding level of inferior outer field span is 2, and the rest may be inferred, the final span respective layer progression n that uses.Because data may occur any position in total span, therefore, total span is that the first order is inevitable choice.In case otherwise the value of certain bar data is in any subregion, data can't be write in the table so.
Preferably, pi1 is not more than 150% of pi0.
Preferably, only subregion is carried out in the big area of number of levels, to obtain data balancing effect preferably.
In the range subregion implementation method of the present invention, be standard with the pi0 that sets, the span of i level is evenly divided, and the data volume in the non-standard segment can not surpass pi1 at most.Be example with i level span, suppose to have two scopes: { [V1, V2], [V3, V4] }, its corresponding partitioned parameters is pi0, pi1.The meaning of pi0 is with V1 to be starting point, is standard with pi0 numerical value, and [V1, V2] and [V3, V4] is divided into a plurality of continuous segments, and each segment comprises pi0 value.The meaning of pi1 is when marking off last section, if the value quantity of this section very little, just this section is merged in the last section.But the value quantity of this non-standard segment can not surpass pi1, if surpassed pi1, with regard to nonjoinder, the value quantity of so final last section can be less than pi0.
The present invention has following advantage and good effect;
1. the subregion scope that adopts method of the present invention to obtain is to generate automatically according to the real data that difference is implemented environment, therefore, the present invention can be evenly distributed in data each different subregion as far as possible, has adaptivity, has demonstrated fully the superiority of partitioning technique.
2. the present invention is divided into the multilayer level with span, can be according to the real data situation of varying environment, concentrate the data in those big spans of data volume are divided by canonical parameter, and can strengthen the granularity of dividing according to correlation parameter to the less relatively span of data volume, thereby the least possible foundation does not have the subregion of data substantially, guarantee that each subregion has data as much as possible, reduce total subregion number, the balanced parts of data on each subregion, and guaranteed that range query drops on the less subregion as much as possible, improved query performance and the dml operating performance of mass data.
Description of drawings
Fig. 1 is the flow chart of steps of the inventive method.
Embodiment
In order to make those skilled in the art person better understand the present invention, the present invention is described in further detail with certain data instance of certain operator below.In the mass data storage means of the present invention, focus on the realization of adaptive Range subregion, therefore below emphasis is described the subregion process.The subregion key of these table data is the Ip address.
1) division of number of levels.
In this example, total span of this subregion key is all values of ip, i.e. world's IP address range.The IP scope is successively divided to local at different levels by world IP, be divided into different number of levels n altogether; The number of levels of IP has embodied the complete degree of IP address details, and the n value is more big, can be more accurate to the IP address that IP divides time-like to obtain.Scope as world IP is 0.0.0.0~255.255.255.255, is 1 grade; The IP of China is a subclass in IP storehouse, the world, is 2 grades; The IP storehouse of a certain provinces and cities is again a subclass of Chinese IP scope, is 3 grades; An area inside the province is again the subclass in IP storehouse inside the province, is 4 grades; And the like.
The judgement of the different layers progression of IP scope can be divided the operating position of IP by concrete area.If implement environment in Beijing, 2 kinds of situations of setting Beijing IP number of levels are arranged: obtain China and Pekinese's IP address range, number of levels is 3; Looking all outside Beijing IP is a scope, only obtains Pekinese's IP address range, and number of levels is 2 so.Implement environment when Beijing, these table storage Pekinese data are at most inevitable, and the data of storing domestic other provinces and cities are lacked relatively, store external data relatively still less.If selecting number of levels is 2, the IP of domestic other provinces and cities can only be distributed together with external IP so, can cause domestic data skewness like this.Therefore, selecting number of levels is 3, imports China and Pekinese's IP address range respectively, and the IP to the IP of Pekinese and domestic other provinces and cities carries out subregion with different standards respectively.Like this, the IP of domestic other provinces and cities has allocation rule equally and has participated in distribution, with regard to relative remission the data unbalance degree that distribute.
2) span: the basis of subregion.
Certainly exist a span for the subregion of selecting a certain number of levels.Also be a scope as 0.0.0.0~255.255.255.255, only it is total IP section in the world.The scope of IP is determined by number of levels n.In this example, if adopt three IP scopes to divide---total IP scope, IP scope, the IP of the Pekinese scope of China, parameter is 3 so.If only need be just enough to two IP scopes, parameter be 2 so, and the IP scope that participates in dividing should be total IP section and the IP of Pekinese section.
3) the value quantity parameter that section post comprises is set.
The value quantity parameter that section post comprises is: [pi0, pi1; Pn0, pn1], wherein i refers to the number of levels of span, this group parameter has been represented the interior value number that comprises of carrying data partition, embodies the degree of refinement of subregion.Pi0((n-m+1 wherein)≤and value preferred number in the value representation i layer subregion of i≤n), as standard span is divided.Because total total n layer may not need every layer and all divide scope, therefore, m represents the total number of plies that need divide.N=5 for example, m=3, the expression scope that is of five storeys altogether, but divide have only 3,4,5 these 3 scopes, namely outermost layer, inferior skin are not divided.
The pi0 value is more big, represents that the value that each subregion comprises is more many, and each subregion is can the logging data amount just more big, but total total points section quantity reduces relatively.The pi0 value is more little, represents that the value number that each subregion comprises is more few, and data are just few, but total branch sector number amount is with regard to corresponding increase.Pi1 is the maximum permissible value numbers of i layer subregion, and this parameter mainly is in order to prevent in certain span, and the value quantity that is divided into that last in each height section section does not successively far reach the value of pi0 and becomes the situation of section separately.By this parameter, if though the value quantity after last that segment and the preceding paragraph merge in the span has surpassed standard pi0, as long as but be no more than pi1---the value number of maximum permissions of setting, just can be not with that last segment section of one-tenth separately, become a section but merge with the preceding paragraph.The pi0 value is implemented environment according to difference and is carried out initial setting.The interior data of subregion are good to be no more than 1,000,000 grades in general, so the value of pi0 is determined according to the assessment of scenario of estimating of data.Because the pi1 parameter only acts on the last division of initial span section, so the number of times of using is not too many, the pi1 of setting generally is not greater than 150% of pi0.Can set according to real data situation and developer's experience in the time of concrete the setting.
4) number of levels of clear and definite subregion.
The data that are not each level in practice need subregion, generally are subregion is carried out in the more big area of number of levels, and the data balancing effect behind the subregion is more good.
For example, known China and Pekinese's IP scope are a plurality of not overlapping and discontinuous section, i.e. n=3.Make m=2, then need Beijing described IP scope [p20, p21; P30, p31] divide.Set p20=20000 herein, p21=30000; P30=5000, p31=8000.Domestic data are divided according to the standard that average 20000 IP are no more than 30000 IP at most, and the IP of Pekinese divides according to the standard that average 5000 IP are no more than 8000 IP at most, and external IP section is then no longer done segmentation.
An example that the IP section in somewhere is carried out subregion is provided below:
Adopt three IP scopes to divide: world IP scope, the IP scope of China and the IP scope in somewhere, so n=3.The scope of China IP is 371 discontinuous section, and the IP scope of certain province is 27 discontinuous section.IP quantitative criteria amount and maximum that the setting section post comprises are [20000,30000; 5000,8000].
Respectively take out an IP section from regional and domestic two-stage below and carry out partition description, one of them IP section of China is 124.220.0.0~124.240.191.155, and one of them IP section in somewhere is 124.226.0.0~124.227.255.255.
The subregion step is as follows in detail:
1. the IP section 124.226.0.0~124.227.255.255 in area is divided into each segment according to the standard of [5000,8000], as shown in table 1:
The partition table of table 1.IP section 124.226.0.0~124.227.255.255
Sequence number IP From By IP The IP quantity that this section comprises
1 124.226.0.0 124.226.19.135 5000
2 124.226.19.136 124.226.39.15 5000
3 124.226.39.16 5000
? 5000
26 124.227.232.72 124.227.251.207 5000
27 124.227.251.208 124.227.255.255 1072
Can be seen that by table 1 according to the standard of [5000,8000], this IP section evenly has been divided into 26 segments, the IP quantity that each section comprises is 5000.Because the 27th section is had only 1072 IP, the standard number much smaller than 5000,1072+5000=6072, the maximum less than 8000 allows IP quantity.Therefore, finally this IP section has been divided into 26 sections, and preceding 25 all comprise 25 IP, last section, and namely the 26th section comprises 6072 IP.
2. the IP section of China of 124.220.0.0~124.240.191.155 is divided according to the standard of [20000,30000].Because the IP of the first step is included within this segment limit, therefore, in fact this IP section has been cut into two discontinuous IP sections by the first step, is 124.220.0.0~124.225.255.255 and 124.228.0.0~124.240.191.155.As shown in table 2:
The partition table of table 2.IP section 124.220.0.0~124.240.191.155
Can be seen that by table 2 finally this big IP section has been split into 62 segments, except the 20th section and the 62nd section, remaining each section comprises IP20000.And if these two special segments merge itself and a last section of closing on, just surpassed the constraint of maximum no more than 30000 IP, therefore, it is not merged, become two independently sections separately.
3. the IP section in each different range is divided according to method of the present invention, 3841 subregions of final acquisition, each subregion or comprise 5000 IP of this area, maximum is no more than 8000, perhaps comprise 20000 domestic other regional IP, maximum is no more than 30000, perhaps comprises a complete external IP section, and quantity without limits.
Adopt subregion implementation method of the present invention, in the actual enforcement project of IP as the subregion key, final this table can logging data subregion reached 2135, the data coverage rate of subregion (being the number percent that the subregion number that has data accounts for the subregion sum) has brought up to present 55.6% by former 4.8%.And reach in 3,000 ten thousand in this table data volume, carrying out the time that 3,000 ten thousand batch data upgrades only needed about 20 minutes, created the time of table compared to existing technology, had shortened 33% at least, had improved speed.
Experiment showed, that technical scheme of the present invention can be implemented in the adaptive dynamic range division of Range subregion under the varying environment preferably, make each subregion that is distributed in tables of data of data balancing, optimize performance, storage organization and the query performance of data loading.
Above-mentioned its purpose of disclosed embodiment of the present invention is to help to understand content of the present invention and implement according to this.The present invention should not be limited to the disclosed content of this instructions most preferred embodiment, and the scope of protection of present invention is as the criterion with the scope that claims define.

Claims (5)

1. method by adaptive Range subregion storage mass data, its step comprises:
1) obtain the various level range information of mass data, and definite level number n, n is the integer greater than 2;
2) determine to carry out total number of levels m of subregion, set value preferred number pi0 and each value number pi1 that divides section to allow at most of i layer of i layer subregion, wherein m is the integer in [1, n], and i is the integer in [(n-m+1), n];
3) be the span of the automatically even five equilibrium i layer of standard with pi0, carry out subregion; Divide section for last, judge that whether its value quantity adds pi0 less than pi1, if less than, then itself and previous minute section are merged into a branch section, otherwise with it as an independent partitions section;
4) carry out mass data storage based on the scoring area result of step 3) institute.
2. the method for claim 1, it is characterized in that: in the step 1), the corresponding level of outermost span is 1, and the corresponding level of inferior outer field span is 2, and the rest may be inferred, the final span respective layer progression of using is n.
3. the method for claim 1, it is characterized in that: pi1 is not more than 150% of pi0.
4. the method for claim 1 is characterized in that: only subregion is carried out in the big area of number of levels, to obtain data balancing effect preferably.
5. the method for claim 1 is characterized in that: adopt the Ip address as the subregion key.
CN2013101305375A 2013-04-15 2013-04-15 Mass data storage method through self-adaptive Range partitions Pending CN103207835A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013101305375A CN103207835A (en) 2013-04-15 2013-04-15 Mass data storage method through self-adaptive Range partitions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013101305375A CN103207835A (en) 2013-04-15 2013-04-15 Mass data storage method through self-adaptive Range partitions

Publications (1)

Publication Number Publication Date
CN103207835A true CN103207835A (en) 2013-07-17

Family

ID=48755064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013101305375A Pending CN103207835A (en) 2013-04-15 2013-04-15 Mass data storage method through self-adaptive Range partitions

Country Status (1)

Country Link
CN (1) CN103207835A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744617A (en) * 2013-12-20 2014-04-23 北京奇虎科技有限公司 Merging and compressing method and device for data files in key-value storage system
CN104156400A (en) * 2014-07-22 2014-11-19 中国科学院信息工程研究所 Storage method and device of mass network flow data
CN104461920A (en) * 2014-12-09 2015-03-25 杭州华为数字技术有限公司 Method and device for storing data
CN105512268A (en) * 2015-12-03 2016-04-20 曙光信息产业(北京)有限公司 Data query method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101145158A (en) * 2007-06-06 2008-03-19 中兴通讯股份有限公司 Data base table partition method
CN101572625A (en) * 2009-04-24 2009-11-04 北京锐安科技有限公司 IP partition method
US7774304B2 (en) * 2005-01-31 2010-08-10 International Business Machines Corporation Method, apparatus and program storage device for managing buffers during online reorganization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7774304B2 (en) * 2005-01-31 2010-08-10 International Business Machines Corporation Method, apparatus and program storage device for managing buffers during online reorganization
CN101145158A (en) * 2007-06-06 2008-03-19 中兴通讯股份有限公司 Data base table partition method
CN101572625A (en) * 2009-04-24 2009-11-04 北京锐安科技有限公司 IP partition method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103744617A (en) * 2013-12-20 2014-04-23 北京奇虎科技有限公司 Merging and compressing method and device for data files in key-value storage system
CN103744617B (en) * 2013-12-20 2016-09-28 北京奇虎科技有限公司 The merging compression method of a kind of key-value storage Data File and device
CN104156400A (en) * 2014-07-22 2014-11-19 中国科学院信息工程研究所 Storage method and device of mass network flow data
CN104156400B (en) * 2014-07-22 2017-07-11 中国科学院信息工程研究所 The storage method and device of a kind of mass network flow data
CN104461920A (en) * 2014-12-09 2015-03-25 杭州华为数字技术有限公司 Method and device for storing data
CN104461920B (en) * 2014-12-09 2019-04-12 杭州华为数字技术有限公司 A kind of method and device of storing data
CN105512268A (en) * 2015-12-03 2016-04-20 曙光信息产业(北京)有限公司 Data query method and device
CN105512268B (en) * 2015-12-03 2019-05-10 曙光信息产业(北京)有限公司 A kind of data query method and device

Similar Documents

Publication Publication Date Title
CN104348679B (en) A kind of methods, devices and systems of point of bucket test
CN102929989B (en) The load-balancing method of a kind of geographical spatial data on cloud computing platform
JP6243045B2 (en) Graph data query method and apparatus
CN101741907A (en) Method and system for balancing server load and main server
CN108897761A (en) A kind of clustering storage method and device
CN105354255A (en) Data query method and apparatus
CN103064890A (en) Global position system (GPS) mass data processing method
CN104657430A (en) Method and system for data acquisition
US10904107B2 (en) Service resource management system and method thereof
CN110362380A (en) A kind of multiple-objection optimization virtual machine deployment method in network-oriented target range
CN108268614B (en) Distributed management method for forest resource spatial data
CN103207835A (en) Mass data storage method through self-adaptive Range partitions
CN104021205A (en) Method and device for establishing microblog index
CN105897887A (en) Clouding computing-based remote sensing satellite big data processing system and method
CN106407191A (en) Data processing method and server
CN106897281B (en) Log fragmentation method and device
CN105138638A (en) Database distribution method based on application layer
CN109150964A (en) A kind of transportable data managing method and services migrating method
CN104461736B (en) Resource allocation and searching method, resource allocation and search system and Cloud Server
CN103365923A (en) Method and device for assessing partition schemes of database
US10482076B2 (en) Single level, multi-dimension, hash-based table partitioning
CN105991571B (en) A kind of information processing method and device
CN102855278B (en) A kind of emulation mode and system
CN107273443A (en) A kind of hybrid index method based on big data model metadata
CN104102557A (en) Cloud computing platform data backup method based on clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130717