CN103207835A

CN103207835A - Mass data storage method through self-adaptive Range partitions

Info

Publication number: CN103207835A
Application number: CN2013101305375A
Authority: CN
Inventors: 张志远; 刘金晶
Original assignee: DONGGUAN MUNICIPAL PUBLIC SECURITY BUREAU; Beijing Ruian Technology Co Ltd
Current assignee: DONGGUAN MUNICIPAL PUBLIC SECURITY BUREAU; Beijing Ruian Technology Co Ltd
Priority date: 2013-04-15
Filing date: 2013-04-15
Publication date: 2013-07-17

Abstract

The invention relates to a mass data storage method through self-adaptive Range partitions. The mass data storage method includes acquiring range information of various level of mass data, determining a level number n which is an integer bigger than 2; determining a total level number m for partitioning, setting a sampling standard number pi0 for the ith level and a maximum sampling number pi1 of each partition for the ith level; dividing evenly sampling range of the ith level with a standard of pi0, partitioning; judging whether the sampling number of the last partition adding pi0 is smaller than pi1 or not, if so, combining the last partition with the previous partition, and if not, taking the last partition as an individual partition; and finally performing mass data storage based on partitioning results. The m is an integer from 1 to n, and the i is an integer from (n-m+1) to n. The mass data storage method has self-adaption and can improve storage and inquiry properties of mass data.

Description

A kind of method by adaptive Range subregion storage mass data

Technical field

The invention belongs to the Computer Database field, be specifically related to a kind of method by adaptive Range subregion storage mass data, can improve storage and the query performance of mass data.

Background technology

In the database application technology, the storage of mass data generally all can be used partitioning technique.This technology is that a very big table is divided into a plurality of little tables and stores different zones respectively into according to certain rule, table so in logic, during physical store, can be as many tables, be stored in different positions, simplify the management activity of database, but also can improve application performance.Because optimizer will be known the value range that uses as the subregion basis, so it just can only use specific subregion directly to inquire about when access list.Because in query procedure, only browse low volume data, just improved query performance naturally.Because external interface is still a table, for the user, application is transparent, the existence of their imperceptible subregion simultaneously.Therefore, showing partitioning technique greatly uses very extensive in mass data storage.

In the Oracle the inside, the subregion of table is divided by " subregion key ".The subregion key refers to some row, and these row have determined the subregion at certain delegation place.The partitioned mode that Oracle10g supports comprises range subregion, List subregion, Hash subregion and combination subregion.All kinds of subregions respectively have characteristics, respectively have superiority.

The Range subregion is oracle the most classical partitioning algorithm the earliest, and the numerical range of the subregion key assignments by appointment is stored on the corresponding subregion data.By the scope subregion, we know which data is to have in which subregion therefore more convenient large-scale data manipulation.The shortcoming of Range subregion is the data skewness, causes data access performance to descend.

The List subregion is similar with the range subregion, and difference is that the range subregion is that continuous space by data carries out subregion.And the List subregion is to carry out subregion by the discrete value of record.The value that only is adapted to the subregion key is the situation that limited numerical value is gathered.

The Hash subregion is just in time opposite with the range subregion, and its biggest advantage is exactly that data are evenly distributed to each subregion, and data access performance is superior.But its shortcoming is to be not suitable for large batch of data management operations, and the performance of range query does not have the good of range subregion yet.

The combination subregion is the multilayer subregion, such as at first using the range subregion, then each subregion is being carried out the hash subregion again.The existing relative merits of the selection of the subregion of ground floor still exist, and maintenance cost is higher.

In the application scenarios of reality, have such situation: the data volume of certain table is very big, and is all very frequent based on range query, the renewal operation of certain key assignments, also knows the upward span of certain key assignments row of this table simultaneously.Therefore, be chosen on this key assignments and use the range partitioned mode to come data are carried out distributed storage.But the skewness weighing apparatus of data on each key assignments, and for different enforcement environment, the distribution situation of primary data on each key assignments in advance.In this case, if adopt fixing subregion span, under different enforcement environment, the distribution meeting of data is unbalanced so, does not embody the superiority of partitioning technique.If with scope establish too small, balanced in the subregion the inside relative data that data are being arranged, but the subregion coverage is low excessively; If with scope establish excessive, the subregion coverage is bigger comparatively speaking, but data are unbalanced.Also all lost the meaning of using subregion.

Summary of the invention

At above-mentioned technical matters, the present invention proposes a kind of method by adaptive Range subregion storage mass data, each span that can intelligent formation range wait, no matter make that data can both be evenly distributed in each subregion under the environment when which type of system implement at, embody the superiority of big table partitioning technique in mass data processing.

For achieving the above object, the present invention adopts following technical scheme:

A kind of method by adaptive Range subregion storage mass data, as shown in Figure 1, its step comprises:

1) the various level range information of obtaining mass data is also determined level number n, and n is the integer greater than 2;

2) determine to carry out total number of levels m of subregion, set value preferred number pi0 and each value number pi1 that divides section to allow at most of i layer of i layer subregion, wherein m is the integer in [1, n], and i is the integer in [(n-m+1), n];

3) be the span of the automatically even five equilibrium i layer of standard with pi0, carry out subregion; Divide section for last, judge that whether its value quantity adds pi0 less than pi1, if less than, then itself and previous minute section are merged into a branch section, otherwise with it as an independent partitions section;

4) carry out mass data storage based on the scoring area result of step 3) institute.

In the step 1), the corresponding level of outermost span is 1, and the corresponding level of inferior outer field span is 2, and the rest may be inferred, the final span respective layer progression n that uses.Because data may occur any position in total span, therefore, total span is that the first order is inevitable choice.In case otherwise the value of certain bar data is in any subregion, data can't be write in the table so.

Preferably, pi1 is not more than 150% of pi0.

Preferably, only subregion is carried out in the big area of number of levels, to obtain data balancing effect preferably.

In the range subregion implementation method of the present invention, be standard with the pi0 that sets, the span of i level is evenly divided, and the data volume in the non-standard segment can not surpass pi1 at most.Be example with i level span, suppose to have two scopes: { [V1, V2], [V3, V4] }, its corresponding partitioned parameters is pi0, pi1.The meaning of pi0 is with V1 to be starting point, is standard with pi0 numerical value, and [V1, V2] and [V3, V4] is divided into a plurality of continuous segments, and each segment comprises pi0 value.The meaning of pi1 is when marking off last section, if the value quantity of this section very little, just this section is merged in the last section.But the value quantity of this non-standard segment can not surpass pi1, if surpassed pi1, with regard to nonjoinder, the value quantity of so final last section can be less than pi0.

The present invention has following advantage and good effect;

1. the subregion scope that adopts method of the present invention to obtain is to generate automatically according to the real data that difference is implemented environment, therefore, the present invention can be evenly distributed in data each different subregion as far as possible, has adaptivity, has demonstrated fully the superiority of partitioning technique.

2. the present invention is divided into the multilayer level with span, can be according to the real data situation of varying environment, concentrate the data in those big spans of data volume are divided by canonical parameter, and can strengthen the granularity of dividing according to correlation parameter to the less relatively span of data volume, thereby the least possible foundation does not have the subregion of data substantially, guarantee that each subregion has data as much as possible, reduce total subregion number, the balanced parts of data on each subregion, and guaranteed that range query drops on the less subregion as much as possible, improved query performance and the dml operating performance of mass data.

Description of drawings

Fig. 1 is the flow chart of steps of the inventive method.

Embodiment

In order to make those skilled in the art person better understand the present invention, the present invention is described in further detail with certain data instance of certain operator below.In the mass data storage means of the present invention, focus on the realization of adaptive Range subregion, therefore below emphasis is described the subregion process.The subregion key of these table data is the Ip address.

1) division of number of levels.

In this example, total span of this subregion key is all values of ip, i.e. world's IP address range.The IP scope is successively divided to local at different levels by world IP, be divided into different number of levels n altogether; The number of levels of IP has embodied the complete degree of IP address details, and the n value is more big, can be more accurate to the IP address that IP divides time-like to obtain.Scope as world IP is 0.0.0.0～255.255.255.255, is 1 grade; The IP of China is a subclass in IP storehouse, the world, is 2 grades; The IP storehouse of a certain provinces and cities is again a subclass of Chinese IP scope, is 3 grades; An area inside the province is again the subclass in IP storehouse inside the province, is 4 grades; And the like.

The judgement of the different layers progression of IP scope can be divided the operating position of IP by concrete area.If implement environment in Beijing, 2 kinds of situations of setting Beijing IP number of levels are arranged: obtain China and Pekinese's IP address range, number of levels is 3; Looking all outside Beijing IP is a scope, only obtains Pekinese's IP address range, and number of levels is 2 so.Implement environment when Beijing, these table storage Pekinese data are at most inevitable, and the data of storing domestic other provinces and cities are lacked relatively, store external data relatively still less.If selecting number of levels is 2, the IP of domestic other provinces and cities can only be distributed together with external IP so, can cause domestic data skewness like this.Therefore, selecting number of levels is 3, imports China and Pekinese's IP address range respectively, and the IP to the IP of Pekinese and domestic other provinces and cities carries out subregion with different standards respectively.Like this, the IP of domestic other provinces and cities has allocation rule equally and has participated in distribution, with regard to relative remission the data unbalance degree that distribute.

2) span: the basis of subregion.

Certainly exist a span for the subregion of selecting a certain number of levels.Also be a scope as 0.0.0.0～255.255.255.255, only it is total IP section in the world.The scope of IP is determined by number of levels n.In this example, if adopt three IP scopes to divide---total IP scope, IP scope, the IP of the Pekinese scope of China, parameter is 3 so.If only need be just enough to two IP scopes, parameter be 2 so, and the IP scope that participates in dividing should be total IP section and the IP of Pekinese section.

3) the value quantity parameter that section post comprises is set.

The value quantity parameter that section post comprises is: [pi0, pi1; Pn0, pn1], wherein i refers to the number of levels of span, this group parameter has been represented the interior value number that comprises of carrying data partition, embodies the degree of refinement of subregion.Pi0((n-m+1 wherein)≤and value preferred number in the value representation i layer subregion of i≤n), as standard span is divided.Because total total n layer may not need every layer and all divide scope, therefore, m represents the total number of plies that need divide.N=5 for example, m=3, the expression scope that is of five storeys altogether, but divide have only 3,4,5 these 3 scopes, namely outermost layer, inferior skin are not divided.

The pi0 value is more big, represents that the value that each subregion comprises is more many, and each subregion is can the logging data amount just more big, but total total points section quantity reduces relatively.The pi0 value is more little, represents that the value number that each subregion comprises is more few, and data are just few, but total branch sector number amount is with regard to corresponding increase.Pi1 is the maximum permissible value numbers of i layer subregion, and this parameter mainly is in order to prevent in certain span, and the value quantity that is divided into that last in each height section section does not successively far reach the value of pi0 and becomes the situation of section separately.By this parameter, if though the value quantity after last that segment and the preceding paragraph merge in the span has surpassed standard pi0, as long as but be no more than pi1---the value number of maximum permissions of setting, just can be not with that last segment section of one-tenth separately, become a section but merge with the preceding paragraph.The pi0 value is implemented environment according to difference and is carried out initial setting.The interior data of subregion are good to be no more than 1,000,000 grades in general, so the value of pi0 is determined according to the assessment of scenario of estimating of data.Because the pi1 parameter only acts on the last division of initial span section, so the number of times of using is not too many, the pi1 of setting generally is not greater than 150% of pi0.Can set according to real data situation and developer's experience in the time of concrete the setting.

4) number of levels of clear and definite subregion.

The data that are not each level in practice need subregion, generally are subregion is carried out in the more big area of number of levels, and the data balancing effect behind the subregion is more good.

For example, known China and Pekinese's IP scope are a plurality of not overlapping and discontinuous section, i.e. n=3.Make m=2, then need Beijing described IP scope [p20, p21; P30, p31] divide.Set p20=20000 herein, p21=30000; P30=5000, p31=8000.Domestic data are divided according to the standard that average 20000 IP are no more than 30000 IP at most, and the IP of Pekinese divides according to the standard that average 5000 IP are no more than 8000 IP at most, and external IP section is then no longer done segmentation.

An example that the IP section in somewhere is carried out subregion is provided below:

Adopt three IP scopes to divide: world IP scope, the IP scope of China and the IP scope in somewhere, so n=3.The scope of China IP is 371 discontinuous section, and the IP scope of certain province is 27 discontinuous section.IP quantitative criteria amount and maximum that the setting section post comprises are [20000,30000; 5000,8000].

Respectively take out an IP section from regional and domestic two-stage below and carry out partition description, one of them IP section of China is 124.220.0.0～124.240.191.155, and one of them IP section in somewhere is 124.226.0.0～124.227.255.255.

The subregion step is as follows in detail:

1. the IP section 124.226.0.0～124.227.255.255 in area is divided into each segment according to the standard of [5000,8000], as shown in table 1:

The partition table of table 1.IP section 124.226.0.0～124.227.255.255

Sequence number	IP From	By IP	The IP quantity that this section comprises
				1	124.226.0.0	124.226.19.135	5000
2	124.226.19.136	124.226.39.15	5000
				3	124.226.39.16	…	5000
…	?	…	5000
				26	124.227.232.72	124.227.251.207	5000
27	124.227.251.208	124.227.255.255	1072

Can be seen that by table 1 according to the standard of [5000,8000], this IP section evenly has been divided into 26 segments, the IP quantity that each section comprises is 5000.Because the 27th section is had only 1072 IP, the standard number much smaller than 5000,1072+5000=6072, the maximum less than 8000 allows IP quantity.Therefore, finally this IP section has been divided into 26 sections, and preceding 25 all comprise 25 IP, last section, and namely the 26th section comprises 6072 IP.

2. the IP section of China of 124.220.0.0～124.240.191.155 is divided according to the standard of [20000,30000].Because the IP of the first step is included within this segment limit, therefore, in fact this IP section has been cut into two discontinuous IP sections by the first step, is 124.220.0.0～124.225.255.255 and 124.228.0.0～124.240.191.155.As shown in table 2:

The partition table of table 2.IP section 124.220.0.0～124.240.191.155

Can be seen that by table 2 finally this big IP section has been split into 62 segments, except the 20th section and the 62nd section, remaining each section comprises IP20000.And if these two special segments merge itself and a last section of closing on, just surpassed the constraint of maximum no more than 30000 IP, therefore, it is not merged, become two independently sections separately.

3. the IP section in each different range is divided according to method of the present invention, 3841 subregions of final acquisition, each subregion or comprise 5000 IP of this area, maximum is no more than 8000, perhaps comprise 20000 domestic other regional IP, maximum is no more than 30000, perhaps comprises a complete external IP section, and quantity without limits.

Adopt subregion implementation method of the present invention, in the actual enforcement project of IP as the subregion key, final this table can logging data subregion reached 2135, the data coverage rate of subregion (being the number percent that the subregion number that has data accounts for the subregion sum) has brought up to present 55.6% by former 4.8%.And reach in 3,000 ten thousand in this table data volume, carrying out the time that 3,000 ten thousand batch data upgrades only needed about 20 minutes, created the time of table compared to existing technology, had shortened 33% at least, had improved speed.

Experiment showed, that technical scheme of the present invention can be implemented in the adaptive dynamic range division of Range subregion under the varying environment preferably, make each subregion that is distributed in tables of data of data balancing, optimize performance, storage organization and the query performance of data loading.

Above-mentioned its purpose of disclosed embodiment of the present invention is to help to understand content of the present invention and implement according to this.The present invention should not be limited to the disclosed content of this instructions most preferred embodiment, and the scope of protection of present invention is as the criterion with the scope that claims define.

Claims

1. method by adaptive Range subregion storage mass data, its step comprises:

1) obtain the various level range information of mass data, and definite level number n, n is the integer greater than 2;

2. the method for claim 1, it is characterized in that: in the step 1), the corresponding level of outermost span is 1, and the corresponding level of inferior outer field span is 2, and the rest may be inferred, the final span respective layer progression of using is n.

3. the method for claim 1, it is characterized in that: pi1 is not more than 150% of pi0.

4. the method for claim 1 is characterized in that: only subregion is carried out in the big area of number of levels, to obtain data balancing effect preferably.

5. the method for claim 1 is characterized in that: adopt the Ip address as the subregion key.