CN101566930B

CN101566930B - Virtual disk drive system and method

Info

Publication number: CN101566930B
Application number: CN 200910004737
Authority: CN
Inventors: P·E·索兰; J·P·圭德; L·E·阿兹曼; M·J·克莱姆
Original assignee: Compellent Technologies Inc
Current assignee: DELL International Ltd
Priority date: 2003-08-14
Filing date: 2004-08-13
Publication date: 2013-10-16
Anticipated expiration: 2024-08-13
Also published as: CN101566928B; CN101566929B; CN101566930A; CN100478865C; CN101566929A; CN101566928A; CN1849577A

Abstract

A disk drive system and method capable of dynamically allocating data is provided. The disk drive system may include a RAID subsystem having a pool of storage, for example a page pool of storage that maintains a free list of RAIDs, or a matrix of disk storage blocks that maintain a null list of RAIDs, and a disk manager having at least one disk storage system controller. The RAID subsystem and disk manager dynamically allocate data across the pool of storage and a plurality of disk drives based on RAID-to-disk mapping. The RAID subsystem and disk manager determine whether additional disk drives are required, and a notification is sent if the additional disk drives are required. Dynamic data allocation and data progression allow a user to acquire a disk drive later in time when it is needed. Dynamic data allocation also allows efficient data storage of snapshots/point-in-time copies of virtual volume pool of storage, instant data replay and data instant fusion for data backup, recoveryetc., remote data storage, and data progression, etc.

Description

Virtual disk drive system and method

The application be that August 13, international application no in 2004 are that PCT//US2004/026499, China national application number are 200480026308.8 the applying date, denomination of invention divides an application for the patented claim of " virtual disk drive system and method ".

Technical field

The present invention relates generally to disk drive system and method, especially design has the disk drive system such as abilities such as Dynamic Data Allocation and disc driver are virtual.

Background technology

Existing disk drive system is to design in so a kind of mode: so that the virtual volume data space is associated for the storage data statically with the physical disk with specific size and position.Accurate device and big or small so that the storage data of the virtual volume of data space need to be understood and monitor/controlled to these disk drive system.In addition, system often needs larger data space, in order to add more RAID equipment.Yet these additional RAID equipment are expensive usually, and are not required before the extra data space of actual needs.

Figure 14 A shows and comprises the virtual volume data space that is associated with the physical disk with specific size and position for storage, read/write and/or recover the existing disk drive system of data.Ad-hoc location and the size of the virtual volume of disk drive system based on data storage space are come statically distribute data.The result is, will not use the data space that empties, and obtain in advance extra, be expensive data storage device sometimes, for example RAID equipment is for the data in storage, read/write and/or the recovery system.After a while, just need to and/or use these extra data spaces.

Thereby, have the demand to improved disk drive system and method.Also there is the demand to effective, Dynamic Data Allocation and disc driver room and time management system and method.

Summary of the invention

The invention provides can the dynamic assignment data improved disk drive system and method.This disk drive system can comprise the RAID subsystem of the matrix that contains disk block and the disk administrator that contains at least one disk storage system controller.Matrix and a plurality of disc driver that RAID subsystem and disk administrator are striden disk block based on the RAID-Disk Mapping come dynamically distribute data.RAID subsystem and disk administrator determine whether to need other disc driver, and if need other disc driver then send notice.Dynamic Data Allocation allows the user to obtain when needed disc driver after a while.Dynamic Data Allocation also allows the valid data storage to the virtual volume matrix of disk block or the snapshot in pond/time point copy, the instant data playback and the data that are used for data backup, recovery etc. merge immediately, remote data storage and data staging management (dataprogression) etc.Owing to will buy after a while more cheap disc driver, so the data staging management allowing also postponement to buy more cheap disc driver.

In individual embodiment, provide the matrix of virtual volume or disk block or pond to be associated with physical disk.The matrix of virtual volume or disk block or pond are dynamically monitored/are controlled by a plurality of disk storage system controllers.In one embodiment, the size of each virtual volume can be acquiescence or can be by user's predefine, and the position of each virtual volume is defaulted as sky.Before distribute data, virtual volume is empty.Can be in any grid in matrix or pond distribute data (for example, in case in grid distribute data, be one " point " in this grid).In case delete this data, this virtual volume is again available, is designated as " sky ".Therefore, can be on the basis of demand obtaining after a while extra data space and be expensive disk storage device, for example RAID equipment sometimes.

In individual embodiment, disk administrator can be managed a plurality of disk storage system controllers, and a plurality of redundancy magnetic disk controller system memories can be implemented to cover the fault on the operated disk storage system controller.

In one embodiment, the RAID subsystem comprises in each RAID type the combination of at least one, RAID type such as RAID-0, RAID-1, RAID-5 and RAID-10.Be appreciated that and in the RAID subsystem of replacing, use other RAID type, such as RAID-3, RAID-4, RAID-6 and RAID-7 etc.

The present invention also provides the Dynamic Data Allocation method, and it may further comprise the steps: provide the default size of logical block or disk block, so that the disk space of RAID subsystem forms the matrix of disk block; In the matrix of this disk block, write data and distribute data; Determine the occupancy of the disk space of RAID subsystem based on the historical occupancy of the disk space of RAID subsystem; Need to determine whether extra disc driver; And if need extra disc driver then send notice to the RAID subsystem.In one embodiment, notice sends by Email.

One of advantage of disk drive system of the present invention is that the disk that the RAID subsystem can be striden virtual quantity uses the RAID technology.Remaining storage space can be for free.The occupancy of the storage space by monitoring storage space and definite RAID subsystem, the user needn't obtain costliness but useless a large amount of drivers when buying.Therefore, when the actual needs driver, add driver and will reduce significantly the total cost of disc driver with the cumulative demand that satisfies storage space.Simultaneously, basically improved the efficient that disk is used.

Another advantage of the present invention is that this disk storage system controller is general to any computer file system, and not only is used for the certain computer file system.

The present invention also provides the method for data instant replay.In one embodiment, data instant replay method may further comprise the steps: provide the default size of logical block or disk block, so that the disk space of RAID subsystem forms the matrix of storage page pool or disk block; With predetermined time the interval automatically generate the snapshot of the matrix of the snapshot of volume of storage page pool or disk block; And the allocation index of the snapshot of the matrix of store storage page pool or disk block or increment, so that the allocation index that the snapshot of disk storage block matrix or increment can pass through to store is located immediately.

Data instant replay method is with the user-defined time interval, user configured dynamic time stamp (for example, every a few minutes or several hours etc.) or automatically generated the snapshot of RAID subsystem by time of server indication.Just in case the system failure or virus attack occur, these virtual snapshots that add time stamp allow large approximate number minute or hour in etc. data instant replay and data instant recovery.This technology is also referred to as the instant replay fusion, that is, merge in time collapse or attack data not long ago, and can immediately use the snapshot of storing before collapse or the attack to be used for the operation in future.

In one embodiment, can store snapshot at local RAID subsystem or in long-range RAID subsystem, if so that when the Major Systems collapse occured such as the attack of terrorism etc., the integrality of data was unaffected, but and instant recovery data.

Another advantage of data instant replay method is, snapshot can be used for test, and simultaneity factor keeps its operation.Real time data can be used for real-time testing.

The present invention also provides the system of data instant replay, and it comprises the RAID subsystem and has the disk administrator of at least one disk storage system controller.In one embodiment, RAID subsystem and disk administrator are striden the automatic distribute data of disk space of a plurality of drivers based on the RAID-Disk Mapping, and wherein the disk space of RAID subsystem forms the matrix of disk block.The disk storage system controller with predetermined time the interval automatically generate the snapshot of the matrix of disk block, and the allocation index of the snapshot of the matrix of memory disk storage block or increment, so that this snapshot or the increment of the matrix of the instant positioning disk storage block of the allocation index that can pass through to store.

In one embodiment, the disk storage system controller frequency that monitoring data uses from the snapshot of the matrix of disk block, and use aging rule, so that the data of less use or access are moved in the more not expensive RAID subsystem.Similarly, when the data that are arranged in more not expensive RAID subsystem will be used more continually, controller with this data mobile to expensive RAID subsystem.Thereby the user can select desired RAID subsystem briefcase to satisfy the storage demand of himself.Thereby the cost of disk drive system can reduce significantly, and is dynamically controlled by the user.

By following detailed description, to those skilled in the art, these and other feature and advantage of the present invention will be apparent, shown in describe in detail and described illustrative embodiment of the present invention, comprise be used to implementing the optimal mode of conceiving of the present invention.Can recognize, the present invention can various obvious aspect in revise, but all do not deviate from the spirit and scope of the present invention.Thereby it is illustrative and nonrestrictive in essence that accompanying drawing and describe in detail will be illustrated as.

Description of drawings

Fig. 1 shows an embodiment according to the disk drive system in the computer environment of principle of the present invention.

Fig. 2 shows according to principle of the present invention, has an embodiment for the Dynamic Data Allocation of the storage page pool of the RAID subsystem of disc driver.

The routine data that Fig. 2 A shows in the RAID subsystem of disk drive system distributes.

Fig. 2 B shows according to the data allocations in the RAID subsystem of the disk drive system of principle of the present invention.

Fig. 2 C shows the Dynamic Data Allocation method according to principle of the present invention.

Fig. 3 A and 3B are according to principle of the present invention, and the disk block of RAID subsystem is at the schematic diagram of snapshot at a plurality of time intervals place.

Fig. 4 is according to principle of the present invention, the schematic diagram of the instant fusion function of data of the snapshot of the disk block by using the RAID subsystem.

Fig. 5 is according to principle of the present invention, the local-remote data Replica of the snapshot of the disk block by using the RAID subsystem and the schematic diagram of instant replay function.

Fig. 6 shows according to principle of the present invention, carries out I/O and with the schematic diagram of the snapshot of a plurality of RAID equipment serial connection rolling with same RAID interface.

Fig. 7 shows an embodiment according to the snapshot structure of principle of the present invention.

Fig. 8 shows an embodiment according to the PITC life cycle of principle of the present invention.

Fig. 9 shows according to principle of the present invention, has an embodiment of the PITC list structure of multiple index.

Figure 10 shows an embodiment according to the recovery of the PITC table of principle of the present invention.

Figure 11 shows according to principle of the present invention, has an embodiment of the process of writing of own pagination row and non-own pagination row.

Figure 12 shows the exemplary snapshot operation according to principle of the present invention.

Figure 13 A shows and contains the virtual data storage space that is associated with the physical disk with specific size and position for the existing disk drive system of static allocation data.

Figure 13 B shows the volume logical block mapping in the existing disk drive system of Figure 13 A.

Figure 14 A shows according to principle of the present invention, contains disk block virtual volume matrix for an embodiment of the disk drive system of the data in the dynamic allocation system.

Figure 14 B shows an embodiment of the Dynamic Data Allocation in the disk block virtual volume matrix as shown in Figure 14 A.

Figure 14 C shows according to principle of the present invention, the schematic diagram that the volume of an embodiment of stored virtual volume page pool-the RAID page or leaf remaps.

Figure 15 shows according to principle of the present invention, is mapped to the example of three disc drivers of a plurality of disk blocks of RAID subsystem.

Figure 16 shows after three disc drivers interpolation disc drivers as shown in Figure 15, the example that remaps of disc driver storage block.

Figure 17 shows an embodiment according to the addressable data page in the data staging bookkeeping of principle of the present invention.

Figure 18 shows the process flow diagram according to an embodiment of the data staging bookkeeping of principle of the present invention.

Figure 19 shows an embodiment according to the page compression layout of principle of the present invention.

Figure 20 shows an embodiment according to the management of the data staging in the senior disk drive system of principle of the present invention.

Figure 21 shows an embodiment according to the stream of the external data in the subsystem of principle of the present invention.

Figure 22 shows an embodiment of the internal data flow in the subsystem.

Figure 23 shows an embodiment of each subsystem of independent maintenance coherence.

Figure 24 shows an embodiment according to the mixing RAID waterfall type data staging management of principle of the present invention.

Figure 25 shows according to principle of the present invention, an embodiment of a plurality of free-lists of storage page pool.

Figure 26 shows an embodiment according to the database example of principle of the present invention.

Figure 27 shows an embodiment according to the MRI reflection example of principle of the present invention.

Embodiment

The invention provides can the dynamic assignment data improved disk drive system and method.Disk drive system can comprise and contains the storage page pool of safeguarding the RAID free-lists or the RAID subsystem of disk storage block matrix, and the disk administrator that contains at least one disk storage system controller.RAID subsystem and disk administrator are striden dynamically distribute data of storage page pool or disk storage block matrix and a plurality of disc driver based on the RAID-Disk Mapping.RAID subsystem and disk administrator determine whether to need other disc driver, and if need other disc driver then send notice.Dynamic Data Allocation allows the user to obtain disc driver when needing disc driver after a while.Dynamic Data Allocation also allows the valid data storage to the virtual volume matrix of disk block or the snapshot in pond/time point copy, and the instant data playback and the data that are used for data backup, recovery etc. merge remote data storage and data staging management etc. immediately.Owing to can buy after a while than the inexpensive disk driver, the data staging management also allows to postpone buys cheap disc driver.

Fig. 1 shows an embodiment according to the disk drive system 100 in the computer environment 102 of principle of the present invention.As shown in fig. 1, disk drive system 100 comprises RAID subsystem 104 and the disk administrator 106 with at least one disk storage system controller (Figure 16).RAID subsystem 104 and disk administrator 106 are striden the disk space dynamic assignment data of a plurality of disc drivers 108 based on the RAID-Disk Mapping.In addition, RAID subsystem 104 and disk administrator 106 can need determine whether other disc driver based on the data allocations of striding disk space.If need other disc driver, then send notice to the user, if so that expectation then can add other disk space.

According to principle of the present invention, in one embodiment, figure 2 illustrates and have Dynamic Data Allocation the disk drive system 100 of (or be called " disc driver is virtual "), in another embodiment, this system has been shown in Figure 14 A and 14B.As shown in Figure 2, disk storage system 110 comprises storage page pool 112, namely comprises the data storage pool of the data space tabulation that can freely store data.Page pool 112 is safeguarded the free-lists of RAID equipment 114, and distributes based on user's request management read/write.The disk storage volume 116 that the user is asked sends to page pool 112 to obtain storage space.The identical different memory device class of (for example, RAID 10, RAID 5, RAID 0 etc.) that each volume can ask to have identical or different RAID grade.

Another embodiment of Dynamic Data Allocation of the present invention is shown in Figure 14 A and the 14B, wherein according to principle of the present invention, contain a plurality of disk storage system controllers 1402 and by the data in disk storage system 1400 dynamic allocation system of the matrix of the disk block 1404 of these a plurality of disk storage system controllers 1402 controls.Provide the matrix of virtual volume or piece 1404 to be used for being associated with physical disk.The matrix of virtual volume or piece 1404 is by a plurality of disk storage system controller 1402 dynamic surveillance/controls.In one embodiment, but the size of each virtual volume 1404 of predefine, 2M byte for example, the position of each virtual volume 1404 is defaulted as sky.Before distribute data, each in the virtual volume 1404 is all sky.Can be in any grid in matrix or pond distribute data (for example, in case in grid, distributed data, being " point " in this grid).In case the deletion data, this virtual volume 1404 is again available, is designated as " sky ".Therefore, can be on the basis of demand obtaining after a while extra and be expensive disk storage device, for example RAID equipment sometimes.

Thereby the disk that the RAID subsystem can be striden virtual quantity uses the RAID technology.Remaining storage space can be for free.The occupancy of the storage space by monitoring storage space and definite RAID subsystem, the user needn't obtain costliness but useless a large amount of drivers when buying.Therefore, when the actual needs driver, add driver and will reduce significantly the total cost of disc driver with the cumulative demand that satisfies storage space.Simultaneously, basically improved service efficiency to disk.

Equally, the Dynamic Data Allocation of disk drive system of the present invention allow the snapshot/time point copy to stored virtual volume page pool or disk block virtual volume matrix the valid data storage, be used for that data are recovered and the instant data playback of remote data storage and data merge immediately and the data staging management.

Will be in the following above feature and advantage that discuss in detail by Dynamic Data Allocation system and method and the realization gained in disk drive system 100 thereof.

Dynamic Data Allocation

The routine data that Fig. 2 A shows in the RAID subsystem of disk drive system distributes, and the data space that wherein empties is captive and can not be assigned with for data and stores.

Fig. 2 B shows according to the data allocations in the RAID subsystem of the disk drive system of principle of the present invention, wherein can mix to form page pool, for example single page pool in one embodiment of the present of invention for the data that the empty storage of data storage.

Fig. 2 C shows the Dynamic Data Allocation method 200 according to principle of the present invention.Dynamic Data Allocation method 200 comprises the default size of definition logical block or disk block so that the disk space of RAID subsystem forms the step 202 of the matrix of disk block; And disk block is designated as the step 204 of writing data and distribute data in the disk block of this matrix of " sky " therein.The method also comprises the step 206 of occupancy of determining the disk space of RAID subsystem based on the historical occupancy of the disk space of RAID subsystem; And determine whether to need other disc driver, and if need then send the step 208 of notice to the RAID subsystem.In one embodiment, notice sends by Email.In addition, the disk storage block size can be set as acquiescence or can be changed by the user.

In one embodiment, Dynamic Data Allocation is also referred to as " virtual " or " disk space is virtual " sometimes, and its per second is processed a large amount of read and write requests effectively.This architecture can require the direct calls cache subsystem of interrupt handling routine.Because Dynamic Data Allocation is to request queue, it may not optimization request, but it can once have a large amount of pending requests.

Dynamic Data Allocation also can maintaining data integrity and the content of protected data in case any controller failure.For this reason, Dynamic Data Allocation writes RAID equipment for reliable memory with status information.

Dynamic Data Allocation can also be safeguarded the order of read-write requests, and finishes the request of reading or writing according to the accurate order that receives request.Dynamic Data Allocation allows the maximum system availability, and supported data is to the remote copy of diverse geographic location.

In addition, Dynamic Data Allocation provides the ability of recovering from data error.By snapshot, the user can check Disk State in the past.

Dynamic Data Allocation management RAID equipment also provides storage abstract in to create and the expansion main equipment.

Dynamic Data Allocation presents virtual disk equipment to server; This equipment is called as volume.For server, volume is equally worked.It can return with the different information to sequence number, but volume is worked as disc driver basically.It is abstract in to create larger dynamic volume equipment to the storage of a plurality of RAID equipment that volume provides.Volume comprises a plurality of RAID equipment, for the effective use to disk space.

Figure 21 shows existing volume logical block mapping.Figure 14 C shows according to principle of the present invention, the remapping of the volume of an embodiment of stored virtual volume page pool-RAID page or leaf.Each volume is divided into one group of page or leaf, and such as 1,2,3 etc., and each RAID is divided into one group of page or leaf.In one embodiment, scrolling size can be identical with RAID page or leaf size.Thereby an example of volume of the present invention-RAID page or leaf mapping is to use the page or leaf #1 of RAID-2 to be mapped to RAID page or leaf #1.

Dynamic Data Allocation is safeguarded the data integrity of volume.Data are written in the volume, and confirm to server.Data integrity covers the configuration of various controllers, comprises redundancy independent and by controller failure.Controller failure comprises power fail, power supply circulation, software anomaly and hard reset.Dynamic Data Allocation is not generally processed the disk drive failure that is covered by RAID.

Dynamic Data Allocation provides five-star data abstraction for controller.Its past termination is asked, and finally uses RAID equipment that data are write disk.

Dynamic Data Allocation comprises various internal subsystems:

High-speed cache---be bundled to the next smoothly read and write operation to rolling up of data plug-in unit by fast response time being provided to server and will writing.

Configuration---comprise the method for establishment, deletion, retrieval and Update Table distribution object.Provide assembly to be used to than the AS application program and create the tool box.

The data plug-in unit---depend on the volume configuration, will roll up the read and write request and be distributed to subsystems.

The RAID interface---provide the RAID device abstract to create larger volume to user and other Dynamic Data Allocation subsystem.

Copy/mirror image/exchange---will roll up data Replica to local and remote volume.In one embodiment, can only copy piece by server write.

Snapshot---the increment type roll recovery of data is provided.It is the view volume (ViewVolume) of the volume state of creating over immediately.

Agency's volume---realize being used for supporting remote copy to the request communication of long-range purpose volume.

Keep accounts---recover to ask for expense to storage, activity, performance and data that the user just distributes.

Any mistake during Dynamic Data Allocation also will configure and remarkable the change in the log.

Figure 21 shows an embodiment of the external data stream in this subsystem.External request is from front end.Request comprises, obtains volume information, read and write.All requests all contain volume ID.Volume information is processed by the volume configuration subsystem.The read and write request comprises LBA.Write request also comprises data.

Depend on the volume configuration, Dynamic Data Allocation passes to a plurality of exterior layers with request.Remote copy passes to front end with request, and the destination is long-range destination volume.The RAID interface passes to RAID with request.Copy/mirror image/exchange is returned to Dynamic Data Allocation to destination with request and rolls up.

Figure 22 shows an embodiment of the internal data flow in this subsystem.Internal data flow begins with high-speed cache.High-speed cache can place write request high-speed cache maybe request to be directly passed to the data plug-in unit.High-speed cache is supported the direct DMA from front end HBA equipment.Can finish fast request, and response is returned to server.The data plugin manager is the center of high-speed cache below request stream.To each volume, the subsystem object that it is registered for each request call.

The Dynamic Data Allocation subsystem that affects data integrity can require the support to the controller coherence.As shown in Figure 23, each subsystem independent maintenance coherence.The coherence upgrades and avoids striding coherence's link copied chunks.Cache coherency can require to copy data to reciprocity controller.

The disk storage system controller

Figure 14 A shows according to principle of the present invention, contains a plurality of disk storage system controllers 1402 and by the matrix of the disk block of a plurality of disk storage system controllers 1402 controls or virtual volume 1404 disk storage system 1400 for the data in the dynamic allocation system.Figure 14 B shows an embodiment of Dynamic Data Allocation in the virtual volume matrix of disk block or virtual volume 1404.

In an operation, disk storage system 1400 with predetermined time the interval automatically generate the snapshot of the matrix of disk block or virtual volume 1404, and the allocation index of this snapshot of the matrix of memory disk storage block or virtual volume 1404 or increment wherein, so that the allocation index that the snapshot of the matrix of disk block or virtual volume 1404 or increment can pass through to store is located immediately.

In another operation, disk storage system controller 1402 frequency that monitoring data uses from the snapshot of the matrix of disk block 1404, and use aging rule, so that the data of less use or access are moved in the more not expensive RAID subsystem.Similarly, when the data that are arranged in more not expensive RAID subsystem begin more frequently to use, controller with this data mobile to expensive RAID subsystem.Thereby the user can select desired RAID subsystem briefcase to satisfy the storage demand of himself.Thereby the cost of disk drive system can reduce significantly, and is dynamically controlled by the user.

The RAID-Disk Mapping

The disk space that RAID subsystem and disk administrator are striden a plurality of disc drivers based on the RAID-Disk Mapping comes the dynamic assignment data.In one embodiment, RAID subsystem and disk administrator determine whether to need other disc driver, and if need other disc driver then send notice.

Figure 15 shows according to principle of the present invention, is mapped to the example of three disk drive 108 (Fig. 1) of a plurality of disk block 1502-1512 in the RAID-5 subsystem 1500.

Figure 16 shows after three disc drivers 108 that disc driver 1602 added to as shown in Figure 15, the disk drive storage block remap 1600 example.

Disk administrator

As shown in fig. 1, disk administrator 106 general management disk and disk arrays, comprise grouping/resource merge (pooling), Disk Properties abstract, format, add/deduct disk and follow the tracks of disk service number of times and error rate.Disk administrator 106 is not distinguished the difference between the various disk models, and provides general memory device for the RAID assembly.Disk administrator 106 also provides packet capability, and this ability is convenient to construct the RAID grouping that has such as special characteristics such as 10,000RPM disks.

In one embodiment of the invention, disk administrator 106 is three layers at least: abstract, configuration and I/O optimize.Disk administrator 106 presents " disk " to higher level, and higher level for example can be, physical disk drive or long-range additional disk system that Local or Remote is additional.

Common foundation characteristic is that any one in these equipment can be the target of I/O operation.Abstract service is that higher level (especially RAID subsystem) provides the uniform data path interface, and provides general mechanism for Admin Administration's target device.

Disk administrator 106 of the present invention also provides packet capability with streamlining management and configuration.Disk can be named and is placed in the group, and group also can be named.Grouping is to simplify such as rolling up from a grouping of disk to migrate to another, a grouping of disk is exclusively used in the powerful feature of the tasks such as being grouped into of specific function, designated disk be for subsequent use.

Disk administrator also with such as the equipment interfaces such as scsi device subsystem that be responsible for to detect external unit and whether exist.The scsi device subsystem can be determined the subset as the equipment of block type target device at least for optical-fibre channel/SCSI type equipment.These equipment are by disk management management and abstract just.

In addition, disk administrator is responsible in response to the flow process control from the scsi device layer.Disk administrator has the ability of queuing, and this provides the chance of the I/O request being assembled to optimize the disk drive system handling capacity as method.

And, a plurality of disk storage system controllers of disk management management of the present invention.Equally, can realize that a plurality of redundancy magnetic disk controller system memories cover the fault of operated disk storage system controller.The redundancy magnetic disk controller system memory is also by the disk management management.

The relation of disk administrator and other subsystem

Disk administrator and some other subsystems are mutual.The RAID subsystem is major customer's machine of the service that provided for the data routing activity by disk administrator.The RAID subsystem is used disk administrator in the exclusive path of accomplishing for the disk of I/O.The event from disk administrator is also monitored by the RAID system, to determine existing and mode of operation of disk.The RAID subsystem also works structure range of distribution for RAID equipment with disk administrator.The disk event is monitored in management control, to understand existing and understanding the mode of operation change of disk.In one embodiment of the invention, RAID subsystem 104 can comprise the combination of at least one RAID type, RAID type such as RAID-0, RAID-1, RAID-5 and RAID-10.Be appreciated that and in the RAID subsystem of replacing, use other RAID type, such as RAID-3, RAID-4, RAID-6 and RAID-7 etc.

In one embodiment of the invention, disk administrator utilizes the configuration access service to store persistent configuration and such as current cambic the read messages such as statistics to presentation layer.Disk administrator is asked these parameters to configuration access location registration process program with this.

Disk administrator also utilizes the service of scsi device layer to understand existence and the mode of operation of block device, and contains the I/O path to these block devices.Disk administrator is to scsi device subsystem query facility, as the support method that identifies uniquely disk.

Data instant replay and data merge immediately

The present invention also provides data instant replay and the instant method that merges of data.Fig. 3 A and 3B show principle according to the present invention a plurality of time intervals the place to the schematic diagram of the snapshot of the disk block of RAID subsystem.Fig. 3 C shows data instant replay method 300, and it comprises the default size of definition logical block or disk block so that the disk space of RAID subsystem forms the step 302 of storage page pool or disk storage block matrix; With predetermined time the interval automatically generate the step 304 of snapshot of the matrix of the snapshot of volume of page pool or disk block; And the allocation index of the snapshot of store storage page pool or disk storage block matrix or increment wherein, so that the allocation index that the snapshot of disk storage block matrix or increment can pass through to store is located immediately.

As shown in Fig. 3 B, at each predetermined time interval place, for example, 5 minutes, such as T1 (12:00PM), T2 (12:05PM), T3 (12:10PM) and T4 (12:15PM), automatically generate the snapshot of storage page pool or disk storage block matrix.Snapshot or the increment wherein of storage page pool or disk storage block matrix are stored in this storage page pool or the disk storage block matrix, so that this snapshot or the increment of storage page pool or disk storage block matrix located immediately in the allocation index that can pass through to store.

Thereby data instant replay method is with the user-defined time interval, user configured dynamic time stamp (for example, every a few minutes or several hours etc.) or automatically generated the snapshot of RAID subsystem by time of server indication.Just in case the system failure or virus attack occur, these virtual snapshots that add time stamp allow large approximate number minute or hour in etc. the instant replay of data and the instant recovery of data.This technology is also referred to as the instant replay fusion, namely merges in time collapse or attack data not long ago, and can immediately use the snapshot of storing before collapse or the attack to be used for the operation in future.

Fig. 4 also shows according to principle of the present invention, the schematic diagram of the instant fusion function 400 of data of a plurality of snapshots of the disk block by using the RAID subsystem.At the T3 place, the parallel chain of generating snapshot (parallelchain) T3 '-T5 ' is merged by the data T3 ' that merges whereby and/or the data of recovery can be used for replacing at the T4 place Fused data.Similarly, but a plurality of parallel chain T3 of generating snapshot ", T4 ' ", be used for to replace and to locate and T4 at T4 '-T5 ' "-T5 " locate Fused data.In alternative embodiment, still can be with T4, T4 '-T5 ', T5 " snapshot located is stored in page pool or the matrix.

Snapshot can be stored in local RAID subsystem or long-range RAID subsystem, if so that owing to the Major Systems collapse occurs in for example attack of terrorism, the integrality of data is with unaffected, and data can be by instant recovery.Fig. 5 shows according to principle of the present invention by the local-remote data Replica of the snapshot of the disk block of use RAID subsystem and the schematic diagram of instant recovery function 500.

Remote copy is carried out and will be rolled up data Replica to the service of remote system.It attempts to keep as much as possible the close synchronization of local and remote volume.In one embodiment, the data of remote volume may not reflect the perfect copy of the data of local volume.Network connection and performance may be so that remote volume be asynchronous with local volume.

Another feature of data instant replay and the instant fusion method of data is, snapshot can be used for test, and simultaneity factor still keeps its operation.Can use real time data to be used for real-time testing.

Snapshot and time point copy (PITC)

According to principle of the present invention, an example of data instant replay is to utilize the snapshot of the disk block of RAID subsystem.The snapshot record is to the write operation of volume, so that can create the content of the view volume of checking over.Therefore snapshot is also supported by establishment the data of the view of the previous time point copy (PITC) of volume to be recovered.

The core of snapshot realizes establishment, polymerization, management and the I/O operation of snapshot.Snapshot monitors the writing of volume, and accesses to roll up by view for creation-time point copy (PITC).It adds LBA (Logical Block Addressing) (LBA) to the data routing in the virtualization layer and remaps layer.This is another the virtual LBA mapping layer in the I/O path.PITC can not copy all volume informations, and it can only revise the table that remaps use.

Snapshot is followed the tracks of the change to the volume data, and the ability of checking from the volume data of previous time point is provided.Snapshot is carried out this function by safeguard the tabulation that increment is write for each PITC.

Snapshot comprises for PITC brief introduction table provides several different methods: the starting with the time of application program launching.Snapshot provides the ability that creates PITC for application program.Application program creates by the control of the API on the server, and establishment is passed to snapshot API.Equally, snapshot provides the ability of creation-time brief introduction table.

Snapshot can not realize log processing system or recover volume all write.Snapshot can only be preserved writing for the last time the individual address in the PITC window.Snapshot allows the user to create the PITC that covers such as the defined short-term time of a few minutes or several hours etc.Be handling failure, snapshot is written to disk with all information.Snapshot is safeguarded and is comprised the volume data page pointer that increment is write.Because table provides the mapping to the volume data, and if do not have it then inaccessible volume data, therefore showing data must the processing controller failure condition.

View volume function provides asks the anti-of PITC.View volume function can be additional to any PITC except existing PITC in the volume.Comparatively faster operation to the additional of PITC.The purposes of view volume function comprises test, training, backup and recovers.View volume function allow write operation and do not revise it based on basic PITC.

In one embodiment, the design snapshot is easy to Optimal performance and take disk space as cost use:

Snapshot provides quick response for user's request.User's request comprises the I/O operation, creates PITC and establishment/deletion view volume.For this reason, snapshot uses the more disk space that needs than minimum to come storage list information.To I/O, snapshot is shown the current state general introduction of volume to individual in, so that can satisfy all read and write requests by individual table.Snapshot reduces the impact on normal I/O operation as much as possible.Secondly, to the operation of view volume, snapshot uses the table mechanism identical with the master file data routing.

Snapshot minimizes the data volume that copies.For this reason, snapshot is safeguarded pointer gauge for each PITC.Snapshot and moving hand, but its data on the Move Volumes not.

Snapshot is managed volume with the data page of fixed size.Following the tracks of individual sector may need a large amount of storeies to be used for the volume of single fair-sized.By using the data page greater than the sector, some page can comprise directly from another page and copy and the information of the certain percentage come.

Snapshot is stored the data page table with the data space on the volume.After controller failure, regenerate look-up table.Look-up table divide gather leaves of a book go forward side by side a step segmentation they.

In one embodiment, snapshot is by requiring to operate the processing controller fault on the single controller with twisting in of snapshot.This embodiment does not require any coherence.The institute of volume changed all to be recorded on the disk or to be recorded to reliable high-speed cache recover to use for replacing controller.In one embodiment, from controller failure, recover to require to read SNAPSHOT INFO from disk.

Snapshot uses virtual RAID interface to visit storage.Snapshot can use a plurality of RAID equipment as the individual data space.

Snapshot is supported the individual PITC of every volume ' n ' and the individual view of every volume ' m '.Restriction to ' n ' and ' m ' is the function of disk space and controller storage.

Volume and volume distribution/layout

Snapshot adds LBA to volume and remaps layer.Remap and use I/O request LBA with look-up table address translation to be become data page.As shown in Figure 6, use the volume that presents of snapshot and the volume that does not have snapshot to operate in the same manner.It has linear LBA space and processes the I/O request.Snapshot is carried out I/O with the RAID interface, and a plurality of RAID equipment are included in the volume.In one embodiment, the size of the RAID equipment of snapped volume be not present the size of volume.RAID equipment allows snapshot to be the data page Extended Spaces in the volume.

The new volume of enabling at the very start snapshot only need comprise the space of new data page or leaf.Snapshot does not create the page or leaf tabulation and places bottom PITC.In this case, bottom PITC is empty.In a minute timing, all PITC pages or leaves all are positioned on the free-lists.Just enable at the beginning the volume of snapshot by establishment, but the physical space still less that its distribution ratio volume presents.Snapshot is followed the tracks of writing volume.In one embodiment of the invention, will in page pool or matrix, not copy and/or store the NULL volume, thereby improve the service efficiency to storage space.

In one embodiment, to these two kinds of allocative decisions, PITC all places virtual NULL volume in the bottom of tabulation.NULL volume read to return zero piece.Before NULL volume is processed not by the sector of server write.Writing the NULL volume can not occur.Volume use NULL volume is used for reading the sector of not writing.

The quantity of free page depends on the size of volume, the quantity of PITC and the expection speed that data change.System is the quantity that given volume is determined the page or leaf of distribution.The quantity of data page can expand in time.Expansion can be supported than expecting that faster data change, more many PITC or larger volume.New page or leaf is added into free-lists.Can automatically occur free-lists is added page or leaf.

Snapshot usage data page or leaf is managed volume space.Each data page can comprise the data of several megabyte.Use operating system often in the same area of volume, to write a plurality of sectors.Storage requirement also indicates snapshot to manage volume with page or leaf.Each sector that is the volume of 1 terabyte safeguards that single 32 bit pointers can need the RAM of 8 gigabytes.Different volumes can have different page or leaf sizes.

Fig. 7 shows an embodiment of snapshot structure.Snapshot is added into volume structure with a plurality of objects.Other object comprises pointer, data page free-lists, sub-view volume and the PITC aggregate objects of PITC, sensing activity PITC.

Movable PITC (AP) pointer is by volume maintenance.AP processes the mapping to the read and write request of volume.AP comprises the general introduction of the current location of all data in the volume.

The data page free-lists is followed the tracks of the available page or leaf on the volume.

Can choose sub-view volume wantonly provides rolling up the access of PITC.The AP that the view wraparound contains them to the writing of PITC, does not revise basic data with record simultaneously.Volume can be supported a plurality of sub-view volumes.

The snapshot aggregate objects links two PITC for for the purpose of removing previous PITC temporarily.Polymerization to PITC relates to the entitlement of Mobile data page or leaf and discharges data page.

PITC comprises table and the data page for the page or leaf of writing when PITC is movable.PITC comprises and freezes time stamp, carves PITC at that and stops to accept write request.PITC also comprises the life span value, and this value determines when that PITC is with polymerization.

Equally, getting PITC so that the moment of predictable read and write performance to be provided, snapshot is summarized the data page pointer of whole volume.Other solution can require to read to check that a plurality of PITC are to find out up-to-date pointer.These solutions need to be shown cache algorithm, but have worst-case performance.

The storer that snapshot general introduction among the present invention also reduces the worst case of table uses.It can require whole table is loaded in the storer, but it may only require to load single table.

General introduction comprises the page or leaf that current PITC has, and can comprise the page or leaf from all previous PITC.For determining which page or leaf PITC can write, and it follows the tracks of page or leaf entitlement to each data page.It also follows the tracks of entitlement to polymerisation run.For this reason, the data page pointer comprises a page index.

Fig. 8 shows the PITC embodiment of life cycle.Each PITC is a plurality of following states of process before as read-only submission:

1. create table---when establishment, table is created.

2. submit to disk---this is the storage on the PITC generation disk.By writing at the moment table, it has guaranteed the required space of memory allocated table information before getting PITC.Simultaneously, also the PITC object is submitted to disk.

3. accept I/O---it becomes movable PITC (AP)---it processes the read and write request for volume now.This is unique state of accepting the write request of his-and-hers watches.PITC generates expression, and it is movable event at present.

4. table being submitted to disk---PITC no longer is AP, and no longer accepts other page or leaf.New AP takes over.At the moment, unless in converging operationJu Hecaozuo, remove table, table will no longer change.It is read-only.At the moment, PITC generates its frozen and submitted event of expression.Any service can be monitored this event.

5. the storer that free list storer---free list needs.This step is also removed daily record and has been written into disk to state all changes.

The top layer PITC of volume or view volume is called as movable PITC (AP).AP satisfies all the read and write requests to volume.For volume, AP is unique PITC that can accept write request.AP comprises the general introduction to the data page pointer of whole volume.

For polymerisation run, AP can be the destination, rather than the source.As the destination, AP increases the quantity of the page or leaf that has, but it does not change the view of data.

Volume is expanded, and AP increases with volume immediately.New page or leaf points to the NULL volume.Non-AP PITC expands volume not to be needed to revise.

Each PITC safeguards the table that the LBA of input is mapped to the data page pointer that the basis is rolled up.This table comprises the pointer that points to data page.The more physical disk space of logical space that this table needs contrast before to present is carried out addressing.Fig. 9 shows an embodiment of the list structure that contains multiple index.This structure will be rolled up LBA and will be decoded into the data page pointer.As shown in Figure 9, the over the ground more and more lower position decoding of location of every one-level.This structure of table allows fast finding and the ability that expands volume is provided.To fast finding, multilevel index structure makes table very shallow, in every one-level a plurality of clauses and subclauses is arranged.Index is carried out array in every one-level and is searched.For supporting the volume expansion, multilevel index structure allows to add other layer to support expansion.In whole situation, it is the expansion of presenting to the LBA counting of higher level that volume expands, rather than is the expansion of rolling up the actual quantity of the storage space that distributes.

Multiple index comprises the general introduction that whole volume data page remaps.Volume complete that each PITC is included in the time point of submitting PITC to remaps tabulation.

Each layer of multilevel index structure his-and-hers watches uses different entry types.Different entry type supports from the disk read message and storer the storage information demand.The bottom clauses and subclauses can only comprise the data page pointer.Top layer and middle layer clauses and subclauses comprise two arrays, a LBA who is used for the next stage table clause, and another is used for the memory pointer of Compass.

When the expansion of the volume size that presents, the size of previous PITC table did not need to increase, and these tables do not need to revise.Because table is for read-only, the information in the table can not change, and the expansion process is revised table by adding the NULL page pointer that points to the end.Snapshot does not directly present table from previous PITC to the user.

I/O operation requirements table is mapped to the data page pointer with LBA.Then I/O multiply by the data page pointer data page size to obtain the LBA of basic RAID.In one embodiment, the data page size is 2 power.

This table provides API to remap LBA, adds page or leaf and Aggregation Table.

Snapshot usage data page or leaf is stored PITC object and LBA mapping table.This table is directly accessed the RAID interface for the I/O to its table clause.In the time should showing read and write to RAID equipment, this table minimizes modification.In the situation that do not revise, the direct read and write of table information may be entered the table clause structure.This has reduced the required copy of I/O.Snapshot can use the change daily record to create focus to stop at disk.Focus is the position of reusing to follow the tracks of to the renewal of volume.The change log recording is to the renewal of PITC table and the free-lists of volume.In rejuvenation, snapshot creates AP and free-lists in the storer again with the change daily record.Figure 10 shows an embodiment of the recovery of his-and-hers watches, and it has illustrated AP and the relation of change between the daily record on the AP in the storer, the disk.It also shows the same relation to free-lists.AP in storer table can be from disk AP and daily record rebuild.To any controller failure, by reading the AP on the disk and using change daily record to it and rebuild AP in the storer.Depend on system configuration, different physical resources is used in the change daily record.For multi controller systems, the change daily record depends on the reserve battery cache memory for storage.The number of times that uses cache memory to allow snapshot to reduce disk is write table is maintaining data integrity still simultaneously.The change daily record reproduction to backup controller for recovery.For the single controller system, the change daily record is written to disk with all information.This has the spinoff that creates the focus on the disk in the daily record position.This permission is written to the individual equipment piece with a plurality of changes.

Periodically, snapshot is written to disk with PITC table and free-lists, thereby creates the checkpoint and remove the checkpoint in daily record.This cycle is depended on the quantity of the renewal of PITC is changed.Polymerisation run is not used the change daily record.

Snapshot data page or leaf I/O can require request to be fit within the data page border.If snapshot runs into the I/O request of crossing over page boundary, then it splits this request.Then it will ask going down to request handler.Write and read supposes that partly I/O is fit within page boundary.AP provides LBA to remap to satisfy the I/O request.

AP satisfies all write requests.Snapshot is supported two kinds of different write sequences to own and non-own page or leaf.Different write sequences allows to add page or leaf to table.Figure 11 shows an embodiment of the process of writing that contains own pagination row and non-own pagination row.

To own pagination row, this process comprises following:

1) finds out the table mapping; And

2) own page or leaf is write---and remap LBA, and data are written to the RAID interface.

The page or leaf of before writing is simple write request.Snapshot is written to this page or leaf with data, thus the current content of overwrite.Only write the data page that AP has.The page or leaf that other PITC has is read-only.

To non-own pagination row, this process comprises following:

1) finds out the table mapping;

2) read page or leaf before---execution is read data page, so that the page or leaf of write request and the data complete of reading.This is the beginning that copies of writing on the process.

3) data splitting---data page read and write request service load is placed single adjacent block.

4) free-lists distributes---from free-lists, obtain new data page pointer.

5) data with combination are written to new data page.

6) newly the information of page or leaf is submitted to daily record.

7) updating form---the LBA in the change table remaps to reflect the new data page pointer.This data page is had by this PITC now.

Add page or leaf and can require to block the read and write request, until page or leaf is added in the table.Be written to disk by showing to upgrade, and preserve the copy of a plurality of high-speed caches for daily record, snapshot is realized the controller coherence.

With regard to read request, AP fulfils all read requests.Use the AP table, read request is remapped to LBA the LBA of data page.It passes to the RAID interface to satisfy request with the LBA through remapping.Volume can be fulfiled the read request to the data page that is not written to before volume.These pages or leaves are marked as NULL address (complete 1) in the PITC table.Request to this address can be rolled up foot by NULL, and returns the constant data pattern.The page or leaf that is had by different PITC can satisfy the read request of crossing over page boundary.

Snapshot is rolled up the next read request that satisfies the data page of not writing before with NULL.It returns full 0 to each sector of reading.It does not have the space of RAID equipment or distribution.Be expected at the piece of preservation full 0 in the storer to satisfy the data demand of reading to the NULL volume.All volumes are shared NULL and are rolled up to satisfy read request.

In one embodiment, polymerisation run from volume, remove in the own page or leaf of PITC and its certain some.Removing PITC creates more free space and follows the tracks of new difference.Polymerization is to two adjacent table comparing differences, and only preserves newer difference.According to user's configuration, polymerization cycle or manually generation.

This process can comprise two PITC, source and destination.In one embodiment, regular as follows to qualified object:

1) source must be before the destination PITC---the source must create before the destination.

2) destination can not be the source simultaneously.

3) source can not be quoted by a plurality of PITC.When creating the view volume from PITC, multiple quoting occurs.

4) multiple quoting can be supported in the destination.

5) AP can be the destination, but cannot be the source.

Polymerisation run is written to disk with all changes, and does not require any coherence.If controller breaks down, volume recovers PITC information from disk, and restarts polymerisation run.

Two PITC of this Processes Tag are for polymerization, and comprise following steps:

1) the source state being set to syndicated feeds---this state is submitted to disk and recovers for storage failure.At this moment, because the data page in source may be invalid and access originator no longer.Data page can be returned to free-lists, or the transferable destination of giving of entitlement.

2) the destination state being set to the polymerization destination---this state is submitted to disk and recovers for controller failure.

3) loading and comparison sheet---this process Mobile data page pointer.The data page that discharges is added into free-lists immediately.

4) the destination state being set to normally---this process is finished.

5) adjust tabulation---change the prior pointer of next pointer of source into the sensing purpose.This removes the source effectively from tabulation.

6) source of release---return any data page for control information to free-lists.

Above process is supported the combination of two PITC.Those skilled in the art will appreciate that polymerization can be designed to remove a plurality of PITC and in one time, create a plurality of sources.

As shown in Figure 2, page pool service data page or leaf free-lists uses for all volumes that are associated with this page pool.This free-lists manager uses and from the data page of page pool free-lists is submitted to permanent memory.The renewal of free-lists is from more than individual source: the process of writing distributes page or leaf, control page management device to divide to gather leaves of a book and polymerisation run is returned page or leaf.

Free-lists is maintained in a certain threshold value and automatically expands the trigger of self.This trigger uses the page pool extending method to add page or leaf to page pool.Automatically expansion can be by the volume strategy decision.More important book will be allowed to expand fair, and more unessential volume is forced to polymerization.

The view volume provides the access of previous time point and supports normal volume I/O operation.PITC follows the tracks of the difference before the PITC, and the view volume allows the user to access the information that comprises in the PITC.View volume branch from PITC.The view volume is supported recovery, test, backup operation etc.Because the view volume does not need data trnascription, and the establishment of view volume almost occurs immediately.View volume can require its own AP to support writing that view is rolled up.

Can from current volume AP, copy the view of from the current state of volume AP, obtaining.Use AP, the view volume allows the write operation of view volume be need not to revise basic data.OS can require file system or file reconstruction to come usage data.In the view volume uncle volume for AP with write the data page allocation space.The RAID facility information that the view volume is not associated.Deletion view volume discharges back father's volume with the space.

Figure 12 shows the exemplary snapshot operation of using snapshot to show migration volume.Figure 12 shows the volume with 10 pages.Each state comprises fulfils tabulation to the read request of volume.The own data page pointer of shaded block indication.

Transfer in the middle of from this figure left side (that is, original state) to figure illustrates writing page or leaf 3 and 8.Writing of page or leaf 3 required to change PITC I (AP).PITC I follows new page or leaf and writes processing so that page or leaf 3 is added in the table.PITC reads unaltered information from page or leaf J, and stores this page or leaf with driver page or leaf B.Can process among this PITC all writing in the future to page or leaf 3 in the situation that need not move page.Writing of page or leaf 8 shown be used to the second situation that is written to page or leaf.Because PITC I has comprised page or leaf 8, the part data in the PITC I overwrite page or leaf 8.For this situation, it is present on the driver page or leaf C.

The transfer of (that is, end-state) illustrates the polymerization of PITC II and III to the figure right side in the middle of scheme.The snapshot polymerization relates to and removes respectively older page or leaf, still safeguards that the institute among two PITC changes simultaneously.These two PITC all comprise page or leaf 3 and page or leaf 8.This process keeps from the newer page or leaf of PITC II and discharges page or leaf from PITC III, and it returns to free-lists with page or leaf A and D.

Snapshot distributes the data page from page pool to be used for storage free-lists and PITC table information.A control page or leaf minute paired data page or leaf carries out secondary distribution with the required size of match objects.

Wraparound contains the page pointer to control page information top.From this page or leaf, can read all out of Memory.

Snapshot is followed the tracks of the quantity of the page or leaf in the use at interval sometime.When this permission snapshot predictive user needs to add more physical disk space to system exhausts to prevent snapshot.

The data staging management

In one embodiment of the invention, data staging management (DP) is used for data are little by little moved to the storage space with suitable cost.The present invention allows the user to add driver when the actual needs driver.This will reduce the total cost of disc driver significantly.

The data staging management moves to data and the historical snapshot data of non-recent visit in the more not expensive storage.For the data of non-recent visit, this has gradually reduced the cost of storage for any page or leaf of non-recent visit.It can not move to data instant the storage of least cost.To historical snapshot data, it moves to more effective storage space with read-only page or leaf, such as RAID 5, if this page or leaf moves to this page in the least expensive storage so no longer by the volume access.

Other advantage of data staging management of the present invention comprises, safeguards quick I/O access and the minimizing of current accessed data are bought fast but the demand of expensive disc driver.

In operation, the cost stored is determined in data staging management with the cost of physical medium and the efficient that is used for the RAID equipment of data protection.Data staging is managed also definite storage efficiency and correspondingly Mobile data.For example, the data staging management can convert RAID 10 to RAID 5 equipment in order to more effectively use physical disk space.

Data staging management is the current data that can be read or write by server with addressable data definition.It determines the storage class that page or leaf should use with accessibility.If page or leaf belongs to historical PITC, then it is read-only.If server does not upgrade this page or leaf in nearest PITC, then this page or leaf is still addressable.

Figure 17 shows an embodiment of the addressable data page in the data staging bookkeeping.This addressable data page is divided into following classification:

Recent visit addressable---these are active pages that volume uses at most.

Non-recent visit addressable---nearest untapped read-write page or leaf.

Historical accessible---the read-only page or leaf that can be read by volume---is applied to snapped volume

Historical non-addressable---rolling up the current not read-only data page or leaf of access---is applied to snapped volume.Snapshot is safeguarded these pages or leaves for the purpose of recovering, and these pages generally place in the storage of least cost as far as possible.

In Figure 17, show three PITC with different own pages or leaves of snapped volume.Represent separately the dynamic capacity volume by PITC C.All these pages are addressable and read-write.These pages can have the different access times.

Following table shows various memory devices according to the efficient that increases progressively or the monetary fee of successively decreasing.The tabulation of this memory device is also according to the gradually slow general sequence of writing the I/O access.The data staging Management Calculation is by the efficient of the logic protected space of total physical space division of RAID equipment.

Table 1:RAID type

Along with the increase of driver number in the band, RAID 5 efficient increase thereupon.Along with the increase of disk number in the band, inefficacy territory (fault domain) also increases thereupon.The increase of driver number has also increased the necessary minimum number of disks of establishment RAID equipment in the band.In one embodiment, because the increase of inefficacy territory size and limited efficient increase, RAID 5 stripe size greater than 9 drivers are not used in the data staging management.RAID 5 stripe size as snapshot page or leaf size integral multiple are used in the data staging management.This permission data staging management is carried out full band and is write when page or leaf is moved to RAID 5, thereby so that mobile more effective.To the purpose of data differentiated control, all RAID 5 configurations have the identical I/O feature of writing.For example, the RAID 5 on 2.5 inches FC disks may not use the performance of these disks effectively.For preventing this combination, the support of data staging managerial demand prevents the ability that the RAID type is moved at some disk type.The configuration of data staging management can prevent that also locking system uses the space of RAID 10 or RAID 5.

Disk type shown in the following table:

Table 2: disk type

The data staging management comprises carry out the ability of automatic classification with respect to the disc driver of intrasystem driver.The systems inspection disk determines that it is with respect to the performance of other disk in the system.Disk faster is sorted in the high value classification, and slower disk is sorted in the lower value classification.When disk is added into system, the classification of the value of the automatic rebalancing disk of system.The method processed simultaneously from immovable system and the system that when adding new disk, often changes both.Automatic classification can place a plurality of disk types same value classification.Enough approach on value if determine driver, they can have identical value so.

In one embodiment, system comprises following driver:

Height-10K FC driver

Low-SATA drive

Along with the interpolation of 15K FC driver, the data staging management reclassifies disk automatically, and this 10K FC driver of demoting.This produces following classification:

Height-15K FC driver

In-10K FC driver

Low-SATA drive

In another embodiment, system can have following type of driver:

Height-25K FC driver

Low-15K FC driver

Thereby this 15K FC driver is classified as the lower value classification, and 15K FC driver is classified as the high value classification.

If SATA drive is added into this system, the data staging management reclassifies disk automatically.This produces following classification:

Height-25K FC driver

In-15K FC driver

Low-SATA drive

The data staging management can comprise the waterfall type differentiated control.Usually, the waterfall type differentiated control only when having used resource fully just with data mobile to more not expensive resource.The waterfall type differentiated control maximizes the use of expensive system resource effectively.It is the cost of minimization system also.Add cheap disk to minimum pond and will create in the bottom larger pond.

RAID 10 spaces are used in typical waterfall type differentiated control, then use the next one in the RAID space, such as RAID 5 spaces.This forces waterfall directly to advance to the RAID 10 of next class disk.Perhaps, the data staging management can comprise mixing RAID waterfall type differentiated control as shown in Figure 24.This replacement data grading management method has solved the problem of maximization disk space and performance, and allows to store the more effective form that converts in the same disk sort.This replacement method is also supported the requirement of the total resources of RAID 10 and RAID 5 shared disk classes.This can require to configure the fixed percentage of the disk space that the RAID grade can use disk sort.Thereby the use of the data staging management method maximization expensive storage of this replacement allows the space coexistence to another RAID class simultaneously.

This mixes RAID waterfall type method when storage is limited, also only the page or leaf movement is moved to more not expensive storage.Memory space such as a certain RAID type of the threshold restrictions such as number percent of total disk space.The use of expensive storage in this maximization system.When storage approached its limit, the data staging management moved to page or leaf the storage of lower cost automatically.The data staging management provides impact damper for writing peak value.

Be appreciated that above waterfall type method also moves to page or leaf least cost storage immediately because in some cases, may exist with mode timely with historical and non-addressable page move to the more not demand of expensive storage.Historical page or leaf also can be moved to more not expensive storage immediately.

Figure 18 shows the process flow diagram of data staging managing process 1800.Data staging management to every its access module of one page continuous review in the system and carrying cost to determine whether to exist the data that will move.The data staging management can determine also whether storage reaches its maximum allocated.

The data staging managing process determines whether this page or leaf can be by any volume access.This process checks that to PITC being additional to historical each rolls up to determine whether to quote this page or leaf.If this page or leaf is just used actively, this page or leaf is qualified for upgrading or slowly demoting so.If this page or leaf can not by any volume access, move to it available least cost storage so.Time Calculation before the data staging management also expires PITC is interior.If snapshot scheduling PITC is about to expiration, there is not so page or leaf with differentiated control.If page pool is just with positive pattern operation, so gradable management of page or leaf.

Data staging management recent visit detects need to be from the outburst to the elimination activity during upgrading of page or leaf.The data staging management is followed the tracks of write access and is separated.This allows the data staging management to keep data at addressable RAID 5 equipment.Operate only read data such as virus scan or report etc.If the storage shortage, then the data staging management changes the Qualification of recent visit.This allows the data staging management more energetically page or leaf to be demoted.This also helps when when shortage storage fill system from the bottom up.

When system resource became shortage, the data staging management is the Mobile data page or leaf energetically.For all these situations, still more disk or configuration change must be arranged.The time quantum that system can operate has been elongated in the data staging management in short supply state.The data staging management is attempted as far as possible for a long time, and keeping system can operate.This lasts till when its all storage class all exhausts the space always.

Working as the shortage of RAID 10 spaces, and in the situation of total free disk space shortage, the data staging management can be alloted RAID 10 disk spaces and be moved among the more effective RAID 5.Take write performance as cost, this has increased the population size of system.But still more disk must be arranged.If used specific storage class fully, data staging management allows to use the non-page or leaf of accepting and moves with keeping system so.For example, if volume is configured to its accessive information is used RAID 10-FC, it can divide from RAID 5-FC or RAID 10-SATA and gathers leaves of a book so, until there is more RAID 10-FC space to use.

The data staging management supports that also compression increases the capacity of discovering of system.The historical page or leaf that compression can be only be used for not accessing, or as the storage of recovery information.Compression shows as the another kind of storage near the carrying cost bottom.

As shown in Figure 25, page pool comprises free-lists and facility information basically.Page pool need to be supported the page or leaf allocative decision of a plurality of free-lists, enhancing and the classification of free-lists.Page pool is the independent free-lists of each class storage system maintenance.Allocative decision allows minute from a plurality of ponds to gather leaves of a book, and sets simultaneously the minimum or maximum class that allows.The classification of free-lists configures from equipment.Each free-lists provides its oneself counter to be used for statistics to compile and show.Each free-lists also provides RAID plant efficiency information to be used for compiling of storage efficiency state.

In one embodiment, list of devices can require to follow the tracks of the additional capabilities of storage class cost.The class of storage is determined in this combination.When if the user wishes that the class that configures had more or less granularity, this situation occurs.

Figure 26 shows an embodiment in high-performance data storehouse, and wherein all data availables even if do not access recently, also only reside on the 2.5FC driver.Non-addressable historical data is moved into RAID 5 optical-fibre channels.

Figure 27 shows an embodiment of MRI reflection volume, is SATA RAID 10 and RAID 5 for this dynamic volume accessible storage wherein.If it is not accessed recently to video, this reflection is moved into RAID 5 so.New writing enters RAID 10 at first.Figure 19 shows an embodiment of compressed page or leaf layout.The data staging management is carried out secondary distribution by the data page to fixed size and is realized compression.The position of the freely part of this page or leaf of secondary distribution information trace and the distribution portion of this page or leaf.The data staging management is the efficient of pre-measured compressed not, and can process the variable-size page or leaf in its secondary distribution.

Compressed page or leaf can affect cpu performance significantly.To write access, compressed page or leaf will require whole page or leaf to be extracted and again compression.Thereby the page or leaf of just being accessed actively is not compressed, and is back to its uncompressed state.In the situation that storage is extremely limited, it may be essential writing.

PITC replay firing table points to secondary distribution information, and is marked as the compressed page or leaf of indication.Access the comparable incompressible page or leaf of compressed page or leaf and need higher I/O counting.Access can be to the position of reading to retrieve real data of secondary distribution information.These compressed data can from disk, read and can be on processor decompress(ion).

The data staging management can require the enough part decompress(ion)s to whole page or leaf of compression energy.This allows the only fraction of decompress(ion) page or leaf of data staging management read access.Read reading in advance feature and can helping to postpone compression of high-speed cache.Single decompress(ion) can be processed a plurality of server I/O.The data staging administrative tag is for the page or leaf of the non-good candidate of compression, so that it needn't frequently attempt page compression.

Figure 20 shows an embodiment according to the management of the data staging in the senior disk drive system of principle of the present invention.The data staging management does not change the operation of external behavior or the data routing of volume.The data staging management can require the modification to page pool.Page pool comprises free-lists and facility information basically.Page pool need to be supported the page or leaf allocative decision of a plurality of free-lists, enhancing and the classification of free-lists.Page pool is the independent free-lists of each class storage system maintenance.This allocative decision allows minute from a plurality of ponds to gather leaves of a book, and sets simultaneously the minimum or maximum class that allows.The classification of free-lists can configure from equipment.Each free-lists provides its oneself counter to be used for statistics to compile and show.Each free-lists also provides RAID plant efficiency information to be used for compiling of storage efficiency statistics.

PITC identifies and is used for mobile candidate, and blocks the I/O to this page when mobile addressable page or leaf.The data staging management constantly checks the candidate to PITC.Because the establishment of server I/O, new snapshot web update and view volume/deletion, the asking property of preventing of page or leaf constantly changes.The data staging management also constantly checks the volume configuration change, and summarizes the current list of page or leaf class and counting.This allows the data staging management assessment to summarize, and determines whether to exist mobile possibly page or leaf.

Each PITC presents the counter for the quantity of the page or leaf of each class storage.The data staging management identifies the PITC of the good candidate that becomes move page when reaching threshold value with this information.

RAID is based on disk cost distributing equipment from one group of disk.RAID also provides API to come the efficient of retrieval facility or potential equipment.It also needs to return the information about the required I/O quantity of write operation.The data staging management also can require RAID NULL to use third party RAID controller as the part of data staging management.RAIDNULL can consume whole disk, and can be only as the layer that passes.

Disk administrator also can be determined and the memory disk classification automatically.Automatically determine that disk sort can require the change to the SCSI start-up routine.

By above description and accompanying drawing, those of ordinary skill in the art can understand, shown in and the specific embodiment described only be used for the purpose of explanation, and be not intended to limit the scope of the invention.Those of ordinary skill in the art can recognize, the present invention also available other concrete form realizes, and do not deviate from spirit of the present invention or essential characteristic.Reference to the details of specific embodiment is not intended to limit the scope of the invention.

Claims

1. the disk drive system of the data space in can dynamic assignment disk storage subsystem, described system comprises:

The disk storage subsystem, a plurality of storage of subscriber data volumes of described disk storage subsystem maintenance; And

Disk administrator, wherein said disk administrator is configured to:

Safeguard the matrix of virtual volume, the matrix of described virtual volume is that the virtual store of a plurality of disk storage devices is abstract, and described virtual volume has predefined size separately;

From the matrix of virtual volume, dynamically distribute free virtual volume to a plurality of storage of subscriber data volumes; And

Data are write the virtual volume that is assigned with.

2. the system as claimed in claim 1 is characterized in that, described disk administrator determines whether to need extra disc driver, and if need extra disc driver then send notice.

3. the system as claimed in claim 1 is characterized in that, a plurality of disk storage system controllers of described disk management management.

4. system as claimed in claim 3 is characterized in that, also comprises a plurality of redundancy magnetic disk controller system memories, to cover the fault of operated disk storage system controller.

5. the system as claimed in claim 1 is characterized in that, described disk storage subsystem comprises the combination that comprises at least a RAID type, and described at least a RAID type comprises RAID-0, RAID-1, RAID-5 and RAID-10.

6. system as claimed in claim 5 is characterized in that, described at least a RAID type also comprises RAID-3, RAID-4, RAID-6 and RAID-7.

7. the system as claimed in claim 1 is characterized in that, described virtual volume is that the virtual store of a plurality of RAID equipment is abstract.

8. the method for the data space of a dynamic assignment disk storage subsystem may further comprise the steps:

Safeguard the storage of subscriber data volume of a plurality of storage spaces in a plurality of data storage devices of disk storage subsystem;

The matrix of managing virtual volume, the matrix of described virtual volume is that the virtual store of a plurality of disk storage devices is abstract, described virtual volume has predefined size separately;

The free virtual volume of dynamic assignment is given the storage of subscriber data volume of a plurality of storage spaces of the matrix that uses described virtual volume from the matrix of virtual volume; And

Data are write the virtual volume that distributes.

9. method as claimed in claim 8 is characterized in that, comprises that also size with described virtual volume is made as acquiescence and can be changed by the user.

10. method as claimed in claim 8 is characterized in that, also comprises the occupancy of determining the disk space of described disk storage subsystem based on the historical occupancy of the disk space piece of described disk storage subsystem.

11. method as claimed in claim 10 is characterized in that, also comprises:

Need to determine whether extra disc driver; And

If need extra disc driver, then send notice to described disk storage subsystem.

12. method as claimed in claim 8, it is characterized in that, the storage of subscriber data volume of described a plurality of storage spaces comprises the disk space piece of at least a RAID type, and wherein said at least a RAID type comprises RAID-0, RAID-1, RAID-5 and RAID-10.