US20020143743A1 - Methods for in-place online reorganization of a database - Google Patents

Methods for in-place online reorganization of a database Download PDF

Info

Publication number
US20020143743A1
US20020143743A1 US10/151,393 US15139302A US2002143743A1 US 20020143743 A1 US20020143743 A1 US 20020143743A1 US 15139302 A US15139302 A US 15139302A US 2002143743 A1 US2002143743 A1 US 2002143743A1
Authority
US
United States
Prior art keywords
list
reorganization
records
user
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US10/151,393
Inventor
Balakrishna Iyer
Gary Sockut
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US10/151,393 priority Critical patent/US20020143743A1/en
Publication of US20020143743A1 publication Critical patent/US20020143743A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99951File or database maintenance
    • Y10S707/99952Coherency, e.g. same view to multiple users
    • Y10S707/99953Recoverability

Definitions

  • This invention relates in general to database management systems performed by computers, and in particular, to a method of providing in-place reorganization of a database.
  • Any database management system can need some type of reorganization.
  • Reorganization of a database is defined as changing some aspect of the logical and/or physical arrangement of the database.
  • a tutorial paper referenced in [12] discusses issues in reorganization and types of reorganization. This specification describes the problem in reorganizing offline, and the need for online reorganization. (See, e.g., [11]).
  • Clustering is the practice of storing records near each other if they meet certain criteria.
  • One popular criterion is consecutive values in a column of the records.
  • Clustering should reduce disk input/output for records that users often access together. When users write data into the database, this writing can decrease the amount of clustering and thus degrade performance.
  • Reorganization can restore clustering and performance.
  • the area being reorganized is offline or only partially available; users cannot write (and perhaps cannot even read) data in that area.
  • a highly available database a database that is to be fully available 24 hours per day, 7 days per week
  • Applications that require high availability include reservations, finance (especially global finance), process control, hospitals, police, armed forces, and Internet service.
  • the present invention provides methods for in-place online reorganization (specifically, for restoration of clustering).
  • the data structures are those of IBM's DBMS Database 2 (DB2) for OS/390 [4], but the concepts in the methods presented herein should apply to many DBMS's.
  • the methods perform reorganization in place; i.e., they do not make a new copy of the data being reorganized.
  • the methods track the reorganization's movement of records across a user's position within a scan of data, and they correct the behavior of a user transaction to account for the movement.
  • the present invention discloses a method, system, and article of manufacture for providing in-place reorganization of a database that achieves reasonably accurate results for users during high-throughput concurrent usage of the database.
  • the reorganization's movement of records across a user transaction's position within a scan of the database is tracked.
  • the behavior of the user transaction is corrected to account for the movement of the records.
  • FIG. 1 illustrates an exemplary computer hardware environment that could be used with the preferred embodiment of the present invention
  • FIG. 2 shows the structure of file pages according to the preferred embodiment of the present invention
  • FIG. 3 shows the structure of index pages according to the preferred embodiment of the present invention
  • FIG. 4 shows an example of a clustering index, a table space, and a non-clustering index according to the preferred embodiment of the present invention
  • FIG. 5 shows a state transition diagram for the reorganizer according to the preferred embodiment of the present invention
  • FIG. 6 shows the structure of a forward ordered list (FOL) according to the preferred embodiment of the present invention
  • FIG. 7 shows a state transition diagram for each record that a user will eventually process according to the preferred embodiment of the present invention
  • FIG. 8 illustrates the structure of a differential list according to the preferred embodiment of the present invention
  • FIG. 9 shows a diagram for the states of a record ID (RID) according to the preferred embodiment of the present invention.
  • FIG. 10 shows an associated chart for the states of a pair of record IDs (RIDs) (source and target) according to the preferred embodiment of the present invention
  • FIG. 11 is a flowchart if the main loop used by the reorganizer according to the preferred embodiment of the present invention.
  • FIG. 12 is a flowchart of the CLUSTER_RECORD procedure according to the preferred embodiment of the present invention.
  • FIGS. 13A and 13B together are a flowchart of the MOVE_RECORD procedure according to the preferred embodiment of the present invention.
  • FIG. 14 is a flowchart of the MODIFY_FOL_AND_BOL procedure according to the preferred embodiment of the present invention.
  • FIG. 15 is a flowchart of the MODIFY_DL procedure according to the preferred embodiment of the present invention.
  • FIG. 16 is a flowchart of the logic used by a user performing a table space scan or index scan according to the preferred embodiment of the present invention.
  • FIG. 17 is a flowchart that describes the logic of the PROCESS_BOL_ENTRY procedure according to the preferred embodiment of the present invention.
  • FIG. 18 is a flowchart that describes the logic of the SCAN procedure according to the preferred embodiment of the present invention.
  • FIGS. 19 A- 19 D together are a flowchart that describes the logic of the CORRECT_RID_LISTS procedure according to the preferred embodiment of the present invention.
  • the present invention describes methods for performing one type of reorganization online.
  • the type of reorganization distributes free space evenly, removes overflow, and clusters data.
  • the methods perform reorganization in place; i.e., they do not make a new copy of the data being reorganized.
  • the methods track the reorganization's movement of records across a user's position within a scan of data, and they correct the behavior of a user transaction to account for the movement.
  • the novelty provided by the present invention is online restoration of clustering that is both (1) in place (to obviate extra disk space for a copy of the data) and (2) supportive of high-throughput concurrent usage with reasonably accurate results.
  • FIG. 1 illustrates an exemplary computer hardware environment that could be used with the preferred embodiment of the present invention.
  • a computer system 100 is comprised of one or more processors 102 , each of which is connected to one or more data storage devices 104 .
  • the processor 102 executes database management system (DBMS) software 106 that manages one or more databases having one or more tables 108 and associated indices 110 stored on the data storage devices 104 .
  • DBMS database management system
  • the tables 108 and indices 110 may be comprised of a plurality of partitions, as for example in the table 108 labeled as TS#1 and the index 110 labeled as IN#1, wherein the partitions are labeled as P1, P2, P3.
  • This partitioning scheme which is well known in the art, allows parallelized access and retrieval functions among both the data storage devices 104 .
  • the tables 108 and indices 110 may be comprised of a single contiguous space, as for example in the table 108 labeled as TS#2 and the index 110 labeled as IN#2. In effect, this configuration comprises a single partition.
  • the processor 102 also executes reorganizer software 112 that reorganizes the tables 108 and associated indices 110 stored on the data storage devices 104 .
  • This reorganization comprises a logical and/or physical arrangement of the tables 108 and associated indices 110 .
  • the DBMS 106 , tables 108 , associated indices 110 , and reorganizer 112 comprise instructions and/or data that is embodied in or retrievable from a computer-readable device, medium, or carrier, e.g., a memory, a data storage device 104 , a remote device coupled to the computer system 100 by a data communications device, etc.
  • these instructions and/or data when read, executed, and/or interpreted by the computer system 100 , cause the computer system 100 to perform the steps necessary to implement and/or use the present invention.
  • the present invention may be implemented as a method, system, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof.
  • article of manufacture or alternatively, “computer program carrier”, as used herein is intended to encompass instructions and/or logic and/or data accessible from any computer-readable device, carrier, or media.
  • This section describes a set of DBMS 106 features (storage structures, types of concurrency control, and types of access), which come from System R [2] and DB2 [4, 3]; several other relational DBMS's use comparable features. This description applies to normal usage, i.e., without online reorganization; the types of access are later modified to accommodate online reorganization.
  • a row of a table 108 in a database is a logical unit within the table 108 .
  • a table 108 of employees includes a row for Jones.
  • a row contains columns of data, e.g., for name, job tide, and salary.
  • a column can have a fixed length (which does not change) or a variable length. The length of a variable-length column changes for each row according to the length of the data that users place in that column in that row.
  • the DBMS 106 implements a row by one data record, which is a lower-level (more physical) unit in storage. It will be explained shortly that sometimes the DBMS 106 implements a row by two data records. Users see rows, but do not directly see data records.
  • the DBMS 106 tracks the writing by appending corresponding entries to a collection of entries called the log. Later, it is possible to recover after an accidental loss of data by reloading from a backup copy of data and then applying (performing on the data) the log entries that the DBMS 106 appended after creation of the backup copy.
  • the log record sequence number (LRSN) of a log entry is a number that represents the entry's position in the log.
  • a table 108 space is a region of storage that stores the data records for one or more tables 108 .
  • table 108 space is a region of storage that stores the data records for one or more tables 108 .
  • table 108 space is a region of storage that stores the data records for one or more tables 108 .
  • table 108 space is a region of storage that stores the data records for one or more tables 108 .
  • table 108 space is a region of storage that stores the data records for one or more tables 108 .
  • table 108 space is a region of storage that stores the data records for one or more tables 108 .
  • the DBMS 106 divides a table 108 space into units called file pages.
  • FIG. 2 shows the structure 200 of file pages.
  • the header of each file page includes the LRSN of the most recently written log entry that corresponds to writing of that page.
  • a file page contains zero or more data records, which the DBMS 106 allocates at the beginning of the page (after the header). Deletion of records can cause gaps between remaining records.
  • the end of a file page contains an ID map, which is an array of pointers (offsets of data records within the page).
  • the term “slot” is used to mean the place (if any) to which an ID map entry points.
  • An ID map helps identify records.
  • DB2 and several other DBMS's that use the SQL [1] database language not every table 108 has a unique key (a set of columns that identifies rows). Therefore, file pages, entries in the log, and indices 110 (structures that speed access to records) cannot use a key for identification. Instead, they use a record's record identifier (RID), which contains the record's page number and the offset of the record's entry in the ID map. A record's RID can change only during reorganization.
  • RID record's record identifier
  • the data goes into another page. On an insertion, a new record goes into that other page. On growth by update of existing data, the data moves into a new overflow data record in the other page, and the existing data record in the original page becomes a pointer data record, which contains the RID of the new overflow data record.
  • the DBMS 106 sometimes implements a row by two records (a pointer record and an overflow record). Data records that do not involve overflow (hopefully, most data records) are regular data records.
  • file page 17 contains a regular data record and a pointer data record.
  • File page 22 contains a regular data record and an overflow data record.
  • the pointer record in page 17 contains the RID of the overflow record in page 22 .
  • the two bits in the header of each data record in FIG. 2 indicate whether the record is a pointer and whether it is an overflow, respectively.
  • a table 108 has zero or more indices 110 , each of which uses an associated key (set of columns). For example, an employee table 108 might have an index 110 whose key is the table's 108 department number, an index 110 whose key is social security number, and an index 110 whose key is the combination of last name and first name.
  • the DBMS 106 maintains the key values in sorted order. Defining a key to be unique means that no two rows can have the same values in the key columns.
  • the DBMS 106 divides the storage of an index 110 into units called index 110 pages.
  • FIG. 3 shows the structure 300 of index 110 pages, which the DBMS 106 arranges in a hierarchy.
  • the index 110 pages that are numbered 101 and 109 are leaves of the hierarchy, and index 110 page 137 is a nonleaf.
  • Each entry in a leaf page contains a key value and a list of RIDs whose records have that key value.
  • the DBMS 106 designer (or, in some DBMS's, the database designer) optionally specifies that the DBMS 106 will sort each list by RID.
  • Each entry in a nonleaf page points to another nonleaf page or a leaf page, although this simple figure shows no entries that point to other nonleaf pages.
  • Each entry in a nonleaf page also contains the value of the highest key of the page to which the entry points.
  • the first entry in index 110 page 137 contains 101 (the number of another index 110 page) and the highest key value of page 101 .
  • the second entry contains corresponding information for page 109 .
  • a possible alternative implementation is for each entry in a nonleaf page to contain the value of the lowest key of the page after the page to which the entry points.
  • the database designer declares at most one index 110 as a clustering index 110 .
  • the assignment of data records to file pages reflects the data records' order in the key of the clustering index 110 .
  • the records having the first few values of the key might reside (be stored) in one file page, the records having the next few values might reside in a second file page, etc. This clustering speeds some queries.
  • the database designer optionally declares the clustering index 110 to be a partitioning index 110 .
  • the DBMS 106 divides the table 108 space (and the clustering index 110 ) into partitions according to values of the indexed key.
  • the table 108 space is called a partitioned table 108 space. Partitions reside in separate files; a nonpartitioned table 108 space can reside in one file.
  • FIG. 4 shows an example of the clustering index 400 , the table space 402 , and a non-clustering index 404 (an index that is not the clustering index 400 ).
  • the key of the clustering index 400 is an employee's name
  • partitions might cover names starting with A through F, G through M, and N through Z, respectively, as in the figure.
  • the arrows represent RIDs in indices 400 and 404 .
  • the RIDs for each partition are contiguous.
  • the RIDs for a partition need not be contiguous.
  • the order of RIDs in the index 404 might be the RIDs for accountants (for all partitions), the RIDs for architects (for all partitions), the RIDs for artists (for all partitions), etc.
  • the RIDs for a partition are scattered throughout the index 404 , since some of them might be accountants, some might be architects, etc.
  • FIG. 4 illustrates lack of contiguity via two RIDs for the G-M partition with an intervening RID for the N-Z partition.
  • the RIDs might be sorted. Therefore, the RIDs for a partition might be contiguous within each key value, but they are not contiguous throughout the nonclustering index 404 .
  • the storage structures described above can degrade.
  • One type of degradation occurs when free space becomes unevenly distributed among the file pages of a table 108 space. After subsequent insertions, the order of some records no longer reflects the clustering index 110 . This type of degradation slows some queries.
  • a second type of degradation occurs when variable-length data grows too large to fit in its original file page.
  • the DBMS 106 then creates an overflow in another page and makes the original record a pointer.
  • Indices 110 still contain the RID of the original record. This causes an extra page reference and thus extra time on some queries.
  • Reorganization removes such structural degradation. Specifically, reorganization distributes free space evenly, removes overflows (so that each row uses just one record, not two), and clusters data. Reorganization can move records between file pages. Therefore, the page that contains the record for a row after reorganization might differ from the page or pages that contained a record or records for that row before reorganization.
  • Offline reorganization operates by (1) unloading (copying out) the data, (2) sorting the unloaded data by clustering key, and (3) reloading the data in sorted order. Users have read-only access (i.e., they can read but cannot write the data) during steps 1 and 2; they have no access during step 3.
  • a parameter of the command specifies the name of the table 108 space that the DBMS 106 should reorganize.
  • another parameter which is optional, specifies the partition to reorganize; absence of this parameter signifies reorganization of the entire table 108 space.
  • the term “area being reorganized” (often shortened to just “area”) is used to mean the table 108 space or partition on which reorganization operates.
  • a user can choose one of these levels of isolation:
  • a transaction holds read-locks (if any) on only one page (per cursor [1]) at a time, except when crossing a page boundary, and holds write-locks (if any) until commitment. This dramatically improves concurrency and thus throughput, but several reads of the same data during one transaction will yield different values if another user has changed that data. Also, even if a transaction is scanning an area of the database and reading each record once (e.g., to calculate the sum of all the deposits in a bank), other users' writing (e.g., transferring money from one bank account to another) can cause an inconsistent calculation. Many users choose cursor stability, but this choice requires tolerance of inconsistency or knowledge that the concurrent writing will not cause inconsistency.
  • This section describes three types of access to a single table 108 ; although the simplifications that sometimes apply are omitted.
  • an example query in SQL [1] is provided.
  • the keyword “WHERE” precedes a combination of one or more predicates.
  • a predicate is an expression whose instantiation returns “true,” “false,” or “unknown.”
  • Table 108 space scan An example query (where table 108 T has no index 110 on column C) is:
  • the DBMS 106 reads every page in the table 108 space. For each page of the table 108 , it scans the ID map to find the rows.
  • the DBMS 106 processes an overflow record only during the scan of the page containing its pointer.
  • Index 110 scan An example query (where T has an index 110 whose first column is C) is:
  • the ellipsis represents possible additional predicates in the query.
  • the DBMS 106 scans the relevant subset of the index 110 to select the RIDs whose key values satisfy the indexed part of the predicate on C. For each selected RID, it reads the row, evaluates the remainder (if any) of the WHERE clause for that row, and returns the row if it satisfies the clause.
  • An example query (where T has an index 110 on C1, an index 110 on C2, and an index 110 on C3) is:
  • the WHERE clause includes the conjunctions and/or disjunctions of n predicates, each of which the DBMS 106 can evaluate by just scanning an index 110 .
  • This type of access involves two lists of RIDs. One list (obtained from scanning an index 110 ) is called the recent RID list, and the other list (built incrementally by merging the recent RID list from each index 110 ) is called the composite RID list.
  • the DBMS 106 executes these phases:
  • a survey paper [13] describes issues and surveys work in online reorganization of databases. Most of the work describes types of reorganization other than restoration of clustering (which is the type that is performed by the present invention). Some of the work deals with restoration of clustering, but most of this work in clustering does not discuss the issue of movement of data across a user's position in a scan.
  • Omiecinski et al. [7, 8] describe an method for online reorganization in place. They use repeatable read with page-level locking.
  • the methods of the present invention allow, but do not require, repeatable read; they also allow cursor stability.
  • the typical throughput using cursor stability exceeds the typical throughput using repeatable read, and many customers demand the opportunity to use cursor stability.
  • cursor stability a user transaction can experience some inaccuracy due to concurrent user transactions, but customers could not tolerate the potentially massive amount of inaccuracy that could result from online reorganization in place without mechanisms to correct the inaccuracy.
  • the reorganization's tracking of movement of records and the subsequent correction of user transactions are mechanisms that correct the inaccuracy. They allow cursor stability (with reasonably accurate results) during online reorganization in place.
  • One method for online restoration of clustering in DB2's structures [5, 11, 10] operates by copying data from the original copy of the area being reorganized (which users access) into a shadow copy of that area (in reorganized form).
  • a survey paper [13] describes other methods for online reorganization by copying. Such an approach differs from our methods, which reorganize in place without making a shadow copy.
  • Advantages of reorganization by copying include (1) simplicity without sacrificing cursor stability and (2) reduced contention with users for access to the area being reorganized.
  • the term reorganizer 112 is used to denote a DBMS 106 process that executes in the background to sweep through a table 108 space (or a partition of a table 108 space) and reorganize it.
  • the methods involve any number of users and one reorganizer 112 .
  • the reorganizer 112 usually has the lowest priority in scheduling, but if the reorganizer's 112 locks have blocked a user, then the system should either preempt the reorganizer 112 or raise its allocation priority, to avoid a convoy of users waiting for the reorganizer 112 .
  • the order in which a user's transaction accesses data or index 110 entries is the order in the table 108 space or in the index 110 .
  • the user has reached some position in the order. Online reorganization can move records and thus change the order, and each record that moves between pages gets a new RID. Something that was ahead of the user's position can move behind, or something behind can move ahead. The methods described herein assure that if something moves, the user will access each appropriate piece of data exactly once (not twice and not zero times). Thus, the result of the transaction is consistent with the user's chosen level of isolation. This accommodation of the changed order is considered to be worthwhile, since the rate of movement by the reorganizer 112 might greatly exceed the rate of writing by regular transactions that execute concurrently with a cursor stability transaction.
  • FIG. 5 shows a state transition diagram.
  • the reorganizer 112 (while operating) allocates certain data to track movement of records across the user's position.
  • users can consume data from the database and can consume data that the reorganizer 112 has allocated, and the reorganizer 112 can deallocate its data after users have consumed it.
  • the amount of allocation increases with the reorganizer's 112 activity and decreases with users' activity.
  • the reorganizer 112 can switch between the operating state 500 and suspended state 502 for either of these reasons:
  • Reorganization suspends when a period of peak usage begins and resumes when the period ends. Another possibility is for the reorganizer 112 to pause between its steps.
  • reorganization suspends when its total allocation grows past a parameter-controlled size; it resumes when the allocation shrinks below another parameter-controlled size.
  • the storage allocator (or, in an alternative implementation, the reorganizer 112 ) tracks this allocation and changes the reorganizer's 112 state between the operating state 500 and suspended state 502 .
  • users give a higher priority to consuming data from the database, to take advantage of locality of reference.
  • reorganizer 112 suspension users give a higher priority to consuming the reorganizer's 112 data, to allow deallocation and thus availability of storage.
  • the use of two parameters (for suspension and resumption) instead of one reduces the frequency of state-switching and thus gives users longer periods of database consumption, when they can take advantage of locality of reference.
  • the methods of the present invention refer to a user's current position, whose definition depends on the user's type of access:
  • the position is the RID being processed.
  • the position is the key and the offset within the index's RID list for that key. If the key does not satisfy the predicate, the position is the end of the list for the previous satisfying key, if any.
  • phase 1 of a multiple index 110 access the position for the index 110 being scanned (and the recent RID list) is as in an index 110 scan; the position for already-scanned indices 110 (and the composite RID list) is the end of those indices 110 .
  • the position is the RID in the composite RID list.
  • the record moves forward from a position behind the user (in the already-scanned area) to a position ahead of the user (in the not-yet-scanned area).
  • the RID can move within the index 110 if the DBMS 106 sorts indices' 110 RID lists by RID. The same four terms are used for the directions of movement of a RID within an index 110 .
  • movement can be in any direction, depending on the user's position (RID).
  • movement can be in any direction, depending on the user's position in the index 110 .
  • movement is behind the user or ahead of the user, respectively.
  • phase 1 of a multiple index 110 access for the index 110 being scanned (and the recent RID list), movement is as in an index 110 scan.
  • movement can be in any direction, depending on the user's position (RID) in the composite RID list.
  • the reorganizer 112 For each user, the reorganizer 112 maintains movement lists to track the reorganizer's 112 movement of records that the user might access.
  • the type of access determines the types of movement lists (forward and backward ordered lists or differential lists).
  • the lists' structures and operations are explained below; Section 5 discusses how to use the lists, thus clarifying the reasons for the structures and operations.
  • FIG. 6 shows the structure of a forward ordered list (FOL) 600 , which tracks records that the reorganizer 112 moves forward; the user has already processed them.
  • FOL 600 which is a doubly linked list, ordered by RID and indexed.
  • Each entry contains the new RID of a moved record.
  • the logical head points to the first entry that is not logically deleted; the physical head might point to a logically deleted entry.
  • RID is in a FOL 600
  • the system tries to arrange for the beginning part of the FOL 600 (which probably matches the user's current file page) to be in main storage; the rest of the FOL 600 might be paged out.
  • a backward ordered list tracks records that the reorganizer 112 moves backward; the user has not yet processed them. Each user has at most one BOL. Its structure and operations are like those of a FOL 600 , as illustrated in FIG. 6.
  • a user transaction scans a list of RIDs (in a table 108 space, an index 110 , or a RID list) and processes each record whose RID it finds.
  • the transaction omits redundant processing of records whose RIDs appear in the FOL 600 (since it has already processed them). It also eventually scans the BOL and processes the records whose RIDs appear there (since its scan of data would never find those records).
  • FIG. 7 shows a state transition diagram for each record that a user will eventually process, including the possible states 700 , 702 , 704 of a record.
  • a differential list tracks records that the reorganizer 112 moves forward, backward, or behind the user.
  • DLs 800 are used for users performing phase 1, 2, or 3 of a multiple index 110 access.
  • a user transaction's RID list can become invalid if, for example, the reorganizer 112 moves a record from a slot whose RID is in the RID list (thus changing the record's RID). Applying a differential list to a RID list will correct such invalidity in the RID list.
  • FIG. 8 illustrates the structure of a DL 800 , an entry of which contains:
  • a flag shown in the figure as “O” or “N” to tell whether the RID is an old RID (the source of a movement) or a new RID (the target)
  • O An old entry (without a paired new entry) contains the old RID of a record that the reorganizer 112 moves forward. See the first entry in the figure.
  • N A new entry (without a paired old entry) contains the new RID of a record that the reorganizer 112 moves backward. See the third entry in the figure.
  • O-N—Two paired entries together contain the old and new RIDs of a record that the reorganizer 112 moves behind the user. See the second and fourth entries in the figure.
  • a DL 800 is ordered by RID, with old and new entries interleaved.
  • a user transaction corrects a RID list by making a local copy of the corresponding DL 800 and then inserting new entries and deleting old entries in the RID list. Since a user extracts RIDs from only the part of an index 110 that satisfies an index 110 predicate, forward or backward movement can occur only for a record that satisfies the predicate.
  • an unpaired entry (either old or new) in a DL 800 represents a record that satisfies the predicate.
  • movement behind the user can occur for a satisfying or an unsatisfying record. Therefore, a pair of entries represents a satisfying record only if the user's RID list includes the old RID. This fact requires the pairing; a user transaction should insert a paired new RID into a RID list only if the RID list already contains the paired old RID (which it should delete).
  • any RID e.g., the RID for the slot of the source of the move or the RID for the target
  • the RID for the target is in any of several possible states (with respect to a DL 800 ).
  • a diagram shows the transitions among a RID's states. Each state has a state number.
  • a chart shows the post-move states for each possible pair of (1) pre-move state of the move's source and (2) pre-move state of the move's target.
  • the chart refers to the state numbers.
  • Each operation on a DL 800 has a detailed description, which includes justifying comments and refers (when appropriate) to the chart's rows and columns.
  • a RID can be in any of several states, depending on the entries (if any) for that RID in that DL 800 , whether the RID is behind or ahead of the user's position in the corresponding index 110 , and whether the slot for that RID contains a record.
  • FIG. 9 shows a diagram 900 for the states of a RID.
  • the rectangle for a state shows the state number and the four components of a state:
  • the possible initial states are 1 and 2 (if the DL 800 is for the index 110 that the user is now scanning or will scan) or 5 and 6 (otherwise).
  • States 3-11 are the possible final states (when the user performs the operation that copies the DL 800 into a local copy); these states (i.e., all except 1 and 2) are also the states in which the RID is behind the user's position. Since the user locks an index 110 entry while reading it, the reorganizer 112 cannot concurrently move that entry's record (and thus change its RID's status in a DL 800 ), so the diagram omits states in which the user s position is exactly at the RID. The diagram contains only the states that are reachable from the initial states; unreachable states are ignored, of course. For each transition between states, the diagram shows an arrow, with text that describes the circumstances under which the transition occurs.
  • FIG. 10 shows an associated chart 1000 for the states of a pair of RIDs (source and target).
  • the columns in FIG. 10 represent the possible pre-move states of the target of the move (the states in which the RID's slot does not yet contain a record).
  • the center of the right side of each column header shows whether the RID has an old (O) entry or no entry.
  • the state number appears below this indication. Any paired entry (for another RID) appears above the indication; the arrow goes from O to N (the direction of movement).
  • the RID has an O entry, which is paired with a different RID's N entry.
  • the RID is ahead of the user, and in states 5, 7, and 8, the RID is behind the user.
  • the rows represent the possible pre-move states of the source of the move (the states in which the RID's slot already contains a record).
  • the center of the bottom of each row header shows whether the RID has an old (O) entry, a new (N) entry, both, or neither. If the RID has both, the N is above the O.
  • the state number appears to the right of this indication. Any paired entries (for other RIDs) appear to the left of the indication. In state 10, the O and N at the left are not for the same RID. In state 2, the RID is ahead of the user, and in states 3, 4, 6, 9, 10, and 11, the RID is behind the user.
  • each intersection cell the right side shows the post-move state of the target, and the bottom shows the post-move state of the source.
  • a record has moved from a source slot (whose RID's state, at the bottom, has changed from 6 to 7) into a target slot (whose RID's state, at the right, has changed from 8 to 11).
  • Some intersections in column 1 contain two cells (two pairs of post-move states), since the states depend on whether the RID whose O is paired with the N in the source also had an unpaired N.
  • some intersections in column 7 contain two cells, since the states depend on whether a record is returning to its original RID.
  • the chart does not explicitly show states of RIDs whose entries are paired with the source or target of the move. Also, the chart does not model the user's crossing of a RID during a scan. Therefore, the chart omits some of the state transitions of FIG. 9. For almost every transition of FIG. 9, several intersections of the chart show the transition.
  • Row state 2 column state 1: the only case of movement ahead of the user
  • Row state 2 column state 5: one case of movement backward
  • Row state 6 column state 1: one case of movement forward
  • Row state 6 column state 5: one case of movement behind the user.
  • the operations on a DL 800 modify the DL's 800 order and index 802 when necessary.
  • the operations that reflect a record's movement refer (in parentheses) to state numbers in the chart's 1000 rows and columns and include justifying comments. There is no operation to change the DL 800 to reflect movement ahead of the user (row 2, column 1).
  • the DL 800 contains an unpaired N entry for the same RID as the recently unpaired O entry (left part of column 1), delete both the N and O entries for that RID.
  • Reason The user's RID list will already correctly include that RID, since that RID's original record and current record both satisfy the predicate. This additional deletion is omitted if the DL 800 contains a paired N entry for the same RID as the recently unpaired O entry (right part of column 1), since the current record at that RID might not satisfy the predicate.
  • the DL 800 has a paired O entry for the target RID (column 7) or no O entry for the target RID (column 5), insert an unpaired N entry for the target RID.
  • Reason Movement backward and then behind into a RID that was already the source of a movement behind (i.e., a RID with a paired O entry) resembles the latter movement behind followed by the movement backward. Movement backward and then behind into a RID that had no movement (i.e., a RID with no O entry) resembles one movement backward into that RID.
  • the reorganizer 112 scans the clustering index 110 , and the reorganizer's 112 current position is defined as the current key and the offset within the index's RID list for that key.
  • Section 4.3 contains a similar definition of the position of a user performing an index 110 scan.
  • the description of the reorganizer's 112 method will apply comparison operators (e.g., “>”) to positions; such a comparison involves comparing the keys, and, if the keys are identical, comparing the offsets. Incrementing the reorganizer's 112 position is defined as incrementing the offset, and, if there is no next RID, moving to the next key (if any).
  • FIG. 11 is a flowchart of the main loop used by the reorganizer 112 .
  • Block 1100 is a decision block that represents the main loop.
  • Block 1102 is a decision block that determines whether there is any garbage collection ready to be performed. If so, control transfers to Block 1104 , which performs the garbage collection; otherwise, control transfers to Block 1106 . Note that, if the reorganizer 112 is operating, a possible alternative (instead of checking for garbage collection in every iteration of the main loop) is checking for garbage collection once in every few iterations of the main loop.
  • Block 1106 is a decision block that determines whether the state of the reorganizer 112 is “suspended”. If so, control transfers to Block 1108 , which causes the loop to “sleep” for a parameter-controlled period before returning to Block 1100 ; otherwise, control transfers to Block 1110 .
  • Block 1110 is a decision block that determines whether the state of the reorganizer 112 is “finished”. If so, control transfers to Block 1112 ; otherwise, control transfers to Block 1114 .
  • Block 1112 is a decision block that determines whether any movement lists exist. If so, control transfers to Block 1108 ; otherwise, the loop is exited.
  • Block 1114 is a decision block that determines whether the state of the reorganizer 112 is “operating”. If not, control transfers to Block 1100 ; otherwise, control transfers to Block 1116 .
  • Block 1116 represents a shared lock being applied to the end of the clustering index for the table being reorganized.
  • Block 1118 represents a variable L being assigned the position in the clustering index of the logically last record in the index.
  • Block 1120 represents the unlocking of the end of the clustering index for the table being reorganized.
  • Block 1122 is a decision block that determines whether R>L, where R is the position of the reorganizer 112 ; initially, this is at the beginning of the clustering index. If so, control transfers to Block 1124 ; otherwise, control transfers to Block 1126 .
  • Block 1124 represents the state of the reorganizer 112 being set to “Finished.” Thereafter, control transfers to Block 1100 .
  • Block 1126 represents D_P (Desired Page for R) being calculated, which is based on the size of R, page size, size of earlier records, desired frequency of free pages, desired percent of free space per page, etc.
  • D_P Desired Page for R
  • Block 1128 is a decision block that determines whether D_P is the current page for R. If so, control transfers to Block 1130 ; otherwise, control transfers to Block 1132 .
  • Block 1130 represents R being incremented. Thereafter, control transfers to Block 1100 .
  • Block 1132 represents the procedure CLUSTER_RECORD being called with the RID of R and D_P as its parameters. Thereafter, control transfers to Block 1130 .
  • FIG. 12 is a flowchart of the CLUSTER_RECORD procedure, which has as its parameters, R and D_P. This procedure clusters record R, preferably on desired page D_P.
  • the loop in this procedure initially tries the desired page D_P. For each tried page T_P, it tests for enough space, and if there is not enough space, the next iteration tries page T_P+1. Instead of iterating to the next page, the method could attempt to find a “nearby” page. For example, record R could be put on a page that contains a record with the same key, or if the record has a unique key, a record with a neighboring key. Another alternative is to put all such records at the end of the table space.
  • Block 1202 represents an exclusive lock being applied to T_P.
  • Block 1204 is a decision block that determines whether the largest available space on T_P is greater than or equal to the size of parameter R. If so, control transfers to Block 1228 ; otherwise, control transfers to Block 1206 .
  • Block 1206 represents T_P being compacted to make free space contiguous.
  • Block 1208 is a decision block that determines whether the available space on T_P is greater than or equal to the size of parameter R. If so, control transfers to Block 1228 ; otherwise, control transfers to Block 1210 .
  • Block 1210 is a decision block that determines whether the available space on T_P plus the total space of non-overflow records on T_P that logically follow R is greater than or equal to the size of parameter R. If not, control transfers to Block 1212 ; otherwise, control transfers to Block 1214 .
  • Block 1212 represents T_P being unlocked. This Block is reached when R cannot be put on T_P, and thus an iteration must occur.
  • Block 1214 is a decision block that represents a loop that iterates while available space on T_P is less than the size of parameter R. Upon completion of this loop, control transfers to Block 1224 . This Block is reached when removing records from this page will succeed in providing enough space for R.
  • Block 1216 represents finding a non-overflow record A_R to be removed on T_P.
  • Block 1218 represents finding and applying an exclusive lock on another page A_P with space for record A_R.
  • Block 1220 represents the procedure MOVE_RECORD being called with the parameters A_R and A_P.
  • Block 1222 represents A_P being unlocked. Thereafter, control transfers to Block 1214 .
  • Block 1224 is a decision block that determines whether the largest available space on T_P is less than the size of R. If so, control transfers to Block 1226 ; otherwise, control transfers to Block 1228 .
  • Block 1226 represents T_P being compacted. Thereafter, control transfers to Block 1228 .
  • Block 1228 represents an exclusive lock being applied to the page of R.
  • Block 1230 represents the procedure MOVE_RECORD being called with the parameters R and T_P.
  • Block 1232 represents T_P and the page of R being unlocked.
  • FIGS. 13A and 13B together are a flowchart of the MOVE_RECORD procedure, which has as its parameters, R and P. This procedure moves record R to page P.
  • Block 1300 represents the variable NEW_RID being assigned the first available offset in page P's ID map.
  • Block 1302 is a decision block that represents a loop for each user. Upon completion of the loop, control transfers to FIG. 13B via “A”.
  • Block 1304 is a decision block that determines whether the user's type of access comprises a table space scan. If so, control transfers to Block 1306 ; otherwise, control transfers to Block 1310 .
  • Block 1306 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1308 ; otherwise, control transfers to Block 1302 .
  • Block 1308 represents the procedure MODIFY_FOL_AND_BOL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302 .
  • Block 1312 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1314 ; otherwise, control transfers to Block 1302 .
  • Block 1314 represents the procedure MODIFY_FOL_AND_BOL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302 .
  • Block 1316 is a decision block that determines whether the user's type of access comprises Phase 1, 2, or 3 of a multiple index access. If so, control transfers to Block 1318 ; otherwise, control transfers to Block 1322 .
  • Block 1318 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1320 ; otherwise, control transfers to Block 1302 .
  • Block 1320 represents the procedure MODIFY_DL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302 .
  • Block 1322 is a decision block that determines whether the user's type of access comprises Phase 4 of a multiple index access. If so, control transfers to Block 1324 ; otherwise, control transfers to Block 1302 .
  • Block 1324 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1326 ; otherwise, control transfers to Block 1302 .
  • Block 1326 represents the procedure MODIFY_FOL_AND_BOL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302 .
  • Block 1328 represents finding all the index entries that reference R.
  • Block 1330 represents finding all the referential integrity structures that reference R.
  • Block 1332 represents moving R to P, modifying the index entries, and (if necessary) modifying the referential integrity structures. If R is a pointer record, the data is moved from the overflow record to a new, regular data record. If deletion and insertion ordinarily trigger certain actions, e.g., deletion via referential integrity, those actions should not occur for record movement.
  • repeatable read prevents movement of a record whose RID the user has already read, so the system omits movement lists.
  • the procedure is presented under the assumption that the DBMS 106 sorts index 110 leaf entries by RID. If the DBMS 106 does not sort them, then the system need not maintain a movement list for an index 110 scan, since it never moves a record across the user's position.
  • FIG. 14 is a flowchart of the MODIFY_FOL_AND_BOL procedure, which has as its parameters, USER_ID, OLD_RID, and NEW_RID. This procedure modifies the FOL 600 and BOL 600 for one user. If the user is doing a table space scan, the reorganizer 112 performs FOL/BOL processing for data records; for an index scan, the reorganizer 112 performs FOL/BOL processing for index entries; and for Phase 4 of a multiple index access, the reorganizer 112 performs FOL/BOL processing for RID list entries.
  • Block 1400 represents the FOL 600 and BOL 600 being latched.
  • Block 1402 is a decision block that determines whether the record movement is forward of the user's position. If so, control transfers to Block 1404 ; otherwise, control transfers to Block 1410 .
  • Block 1404 is a decision block that determines whether the old RID is in the BOL 600 . If so, control transfers to Block 1406 , which deletes the old RID from the BOL 600 ; otherwise, control transfers to Block 1408 , which inserts the new RID into the FOL. Thereafter, control transfers to Block 1430 .
  • Block 1410 is a decision block that determines whether the record movement is backward from the user's position. If so, control transfers to Block 1412 ; otherwise, control transfers to Block 1418 .
  • Block 1412 is a decision block that determines whether the old RID is in the FOL 600 . If so, control transfers to Block 1414 , which deletes the old RID from the FOL 600 ; otherwise, control transfers to Block 1416 , which inserts the new RID into the BOL 600 . Thereafter, control transfers to Block 1430 .
  • Block 1418 is a decision block that determines whether the record movement stays ahead of the user's position. If so, control transfers to Block 1420 ; otherwise, control transfers to Block 1424 .
  • Block 1420 is a decision block that determines whether the old RID is in the FOL 600 . If so, control transfers to Block 1422 , which changes the old RID to the new RID in the FOL 600 . Thereafter, control transfers to Block 1430 .
  • Block 1424 is a decision block that determines whether the record movement stays behind the user's position. If so, control transfers to Block 1426 ; otherwise, control transfers to Block 1430 .
  • Block 1426 is a decision block that determines whether the old RID is in the BOL 600 . If so, control transfers to Block 1428 , which changes the old RID to the new RID in the BOL 600 . Thereafter, control transfers to Block 1430 .
  • Block 1430 unlatches the FOL 600 and BOL 600 .
  • FIG. 15 is a flowchart of the MODIFY_DL procedure, which has as its parameters, USER_ID, OLD_RID, and NEW_RID. This procedure modifies the DL 800 for one user. If the user is doing Phase 1 of a multiple index access, the reorganizer 112 performs DL 800 processing for the index being scanned and the recent RID list; any movement is possible. If the user is doing Phase 2 or 3 of a multiple index access, the reorganizer 112 performs DL 800 processing for the index being scanned and the recent RID list; all movement is behind the user. For Phases 1, 2, or 3 of a multiple index access, the reorganizer 112 performs DL 800 processing for already-scanned indices and the composite RID list; all movement is behind the user.
  • Block 1500 represents the recent DL 800 being latched.
  • Block 1502 is a decision block that determines whether the record movement is forward from the user's position in the index for the recent RID list. If so, control transfers to Block 1504 ; otherwise, control transfers to Block 1506 .
  • Block 1504 represents the recent DL 800 being changed to reflect the forward movement.
  • Block 1506 is a decision block that determines whether the record movement is backward from the user's position in the index for the recent RID list. If so, control transfers to Block 1508 ; otherwise, control transfers to Block 1510 .
  • Block 1508 represents the recent DL 800 being changed to reflect the backward movement.
  • Block 1510 is a decision block that determines whether the record stays ahead of the user's position in the index for the recent RID list. If so, control transfers to Block 1516 ; otherwise, control transfers to Block 1512 .
  • Block 1512 is a decision block that determines whether the record stays behind the user's position in the index for the recent RID list. If so, control transfers to Block 1514 ; otherwise, control transfers to Block 1516 .
  • Block 1514 represents the recent DL 800 being changed to reflect the movement behind.
  • Block 1516 represents the recent DL 800 being unlatched.
  • Block 1518 represents the composite DL 800 being latched.
  • Block 1520 represents the composite DL 800 being changed to reflect movement behind. This composite DL 800 was produced by earlier merges. The record always stays behind the user, who is already at the end of all indices that the RID list represents.
  • Block 1522 represents the composite DL 800 being unlatched.
  • FIG. 16 is a flowchart of the logic used by a user performing a table space scan or index scan.
  • Block 1600 represents the FOL 600 and BOL 600 structures being created.
  • Block 1602 is a decision block that represents a loop for all records.
  • Block 1604 represents the logical head of the BOL 600 being latched.
  • Block 1606 represents the variable H being assigned the RID of the logical head of the BOL 600 .
  • Block 1608 represents the logical head of the BOL 600 being unlatched.
  • Block 1610 represents the state of the reorganizer 112 being latched.
  • Block 1612 represents the variable S being assigned the state of the reorganizer 112 .
  • Block 1614 represents the state of the reorganizer 112 being unlatched.
  • Block 1616 is a decision block that determines whether POS (the position in the scan) has reached the end of the scan. If so, control transfers to Block 1618 ; otherwise, control transfers to Block 1626 .
  • Block 1618 is a decision block that determines whether H is null. If so, control transfers to Block 1620 , which logically deletes the FOL 600 and BOL 600 , and then to Block 1622 where the procedure exits; otherwise, control transfers to Block 1624 , which calls the procedure PROCESS_BOL_ENTRY, and then returns to Block 1602 .
  • Block 1626 is a decision block that determines whether H is null. If so, control transfers to Block 1628 , which calls the procedure SCAN, and then returns to Block 1602 ; otherwise, control transfers to Block 1630 .
  • Block 1630 is a decision block that determines whether S is “Operating”. If so, control transfers to Block 1628 , which calls the procedure SCAN, and then returns to Block 1602 ; otherwise, control transfers to Block 1632 , which calls the procedure PROCESS_BOL_ENTRY, and then returns to Block 1602 .
  • FIG. 17 is a flowchart that describes the logic of the PROCESS_BOL_ENTRY procedure.
  • Block 1700 represents the shared lock being applied to the record that the logical head entry indicates.
  • Block 1702 represents the first entry in the BOL 600 being logically deleted.
  • Block 1704 represents that record being processed, e.g., evaluating the WHERE clause and returning the row if appropriate.
  • Block 1706 represents all locks being unlocked, i.e., if cursor stability is in effect.
  • Block 1708 represents an exit from the procedure.
  • FIG. 18 is a flowchart that describes the logic of the SCAN procedure.
  • Block 1800 represents the POS (position in the scan) being incremented.
  • Block 1802 represents a shared lock being applied to the current record.
  • Block 1806 logically deletes the first entry in the FOL 600 , e.g., because it moved forward. Thereafter, control transfers to Block 1810 .
  • Block 1808 represents that record being processed, e.g., evaluating the WHERE clause and returning the tow if appropriate.
  • Block 1810 represents all locks being unlocked, i.e., if cursor stability is in effect.
  • Block 1812 represents an exit from the procedure.
  • Block 1900 represents the two DLs 800 being latched.
  • Block 1902 is a decision block that represents a loop which tests for emptiness, i.e., it iterates while either DL 800 contains anything. Upon completion of the loop, control transfers to Block 1936 via “C”.
  • Block 1906 represents the DL 800 being copied into a local copy and the DL 800 being logically deleted.
  • Block 1908 represents the creation of an empty new DL 800 .
  • Block 1910 represents the two DLs 800 being unlatched.
  • Block 1914 is a decision block that represents a loop through the entries in the local copy of the DL 800 that corresponds to the RID list indicated by J.
  • Block 1916 is a decision block that determines if an entry in the copy of the DL 800 is “old”. If so, control transfers to Block 1918 ; otherwise, control transfers to Block 1914 .
  • Block 1918 is a decision block that determines if an entry's RID does not exist in the RID list. If so, control transfers to Block 1920 ; otherwise, control transfers to Block 1914 .
  • Block 1920 represents nullifying the entry and its paired entry (if any). Thereafter, control transfers to Block 1914 .
  • Block 1924 is a decision block that represents a loop through the entries in the local copy of the DL 800 that corresponds to the RID list indicated by J. Upon completion of the loop, control transfers to Block 1922 .
  • Block 1926 is a decision block that determines if the local copy of the DL 800 contains an “old” entry for an RID but no “new” entry for that RID. If so, control transfers to Block 1928 ; otherwise, control transfers to Block 1930 .
  • Block 1928 represents deleting the RID from the RID list. Thereafter, control transfers to Block 1924 .
  • Block 1930 is a decision block that determines if the local copy of the DL 800 contains a “new” entry for an RID but no “old” entry for that RID. If so, control transfers to Block 1932 ; otherwise, control transfers to Block 1924 .
  • Block 1932 represents inserting the RID into the RID list (maintaining sorted order). Thereafter, control transfers to Block 1924 .
  • Block 1934 represents the two DLs 800 being latched. Control then transfers to Block 1902 via “D”.
  • Block 1936 indicates that the user transaction is moving to the next phase.
  • Block 1938 is a decision block that determines if the user transaction is moving from phase 3 to phase 4. If so, control transfers to Block 1940 ; otherwise, control transfers to Block 1944 .
  • Block 1940 represents logically deleting DL 800 structures.
  • Block 1942 represents creating the FOL 600 and BOL 600 structures.
  • Block 1944 represents the DLs 800 being unlatched. Thereafter, the procedure exists at 1946 .
  • Logging and recovery could be added (for both the reorganizer 112 and the users) during online reorganization.
  • a new type of log record can simply be added, which means that a record has been moved from the old RID to the new.
  • Movement lists could also be stored in artificial data pages; i.e., not just in main storage. This implies that a transaction that is ordinarily considered read-only (because it just does a SELECT) is not really read-only, since it writes into the movement lists.
  • the reorganizer 112 should abort instead of waiting when the reorganizer's 112 lock request conflicts with a user's lock.
  • a warning bit could be maintained to indicate that the FOL 600 contains an entry for the current page.
  • Any database management system can need some type of reorganization.
  • a solution is to reorganize online. Methods have been described for performing a certain type of reorganization online. The reorganization restores clustering and removes overflows.
  • the methods include synchronization, and they use movement lists to track the reorganization's movement of records across a user's position within a scan of data.
  • the reorganization's tracking of movement of records and the subsequent correction of user transactions are mechanisms that correct inaccuracy in user transactions and thus allow cursor stability during online reorganization in place.
  • the novelty is in allowing high throughput concurrent usage during in-place online restoration of clustering. As the amount of information and dependence on computers both grow, the number of very large or highly available databases will grow. Therefore, the importance of online reorganization will grow.
  • the present invention discloses a method, system, and article of manufacture for in-place reorganization of a database that allows high-throughput concurrent usage by users of the database.
  • the reorganization's movement of records are tracked across a user transaction's position within a scan of the database.
  • the behavior of the user transaction is corrected to account for the movement of the records.

Abstract

An in-place reorganization of a database achieves reasonably accurate results for users during high-throughput concurrent usage of the database. The reorganization's movement of records across a user transaction's position within a scan of the database is tracked. The behavior of the user transaction is corrected to account for the movement of the records.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This application is a Continuation of application Ser. No. 09/220,084, filed Dec. 23, 1998, entitled “METHODS FOR IN-PLACE ONLINE REORGANIZATION OF A DATABASE,” which application is incorporated herein by reference.[0001]
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention [0002]
  • This invention relates in general to database management systems performed by computers, and in particular, to a method of providing in-place reorganization of a database. [0003]
  • 2. Description of Related Art. [0004]
  • (Note: This application references a number of different publications as indicated throughout the specification by reference numbers enclosed in brackets, e.g., [x]. A list of these different publications ordered according to these reference numbers can be found in [0005] Section 8 of the “Detailed Description of the Preferred Embodiment.” Each of these publications is incorporated by reference herein.)
  • Any database management system (DBMS) can need some type of reorganization. Reorganization of a database is defined as changing some aspect of the logical and/or physical arrangement of the database. A tutorial paper referenced in [12] discusses issues in reorganization and types of reorganization. This specification describes the problem in reorganizing offline, and the need for online reorganization. (See, e.g., [11]). [0006]
  • The type of reorganization described herein involves restoration of clustering. Clustering is the practice of storing records near each other if they meet certain criteria. One popular criterion is consecutive values in a column of the records. Clustering should reduce disk input/output for records that users often access together. When users write data into the database, this writing can decrease the amount of clustering and thus degrade performance. [0007]
  • Reorganization can restore clustering and performance. During most types of reorganization in a typical database, the area being reorganized is offline or only partially available; users cannot write (and perhaps cannot even read) data in that area. However, a highly available database (a database that is to be fully available 24 hours per day, 7 days per week) should not go offline for significant periods, of course. Applications that require high availability include reservations, finance (especially global finance), process control, hospitals, police, armed forces, and Internet service. [0008]
  • Even for less essential applications, many database administrators prefer 24-hour availability. The maximum tolerable period of unavailability is specific to the application. When queried, DBMS customers (not all of whom have highly available databases) state that the maximum tolerable period ranges from 0 to 5 hours. Even without such a preference for 24-hour availability, reorganizing a very large database might require much longer than the maximum tolerable period of unavailability. [0009]
  • As examples of very large databases, a survey paper [6] mentions a database with several terabytes of data and the desire for one with petabytes. The author of one book [14] considers offline reorganization such an important problem for very large databases that he defines a very large database as one “whose reorganization by reloading takes a longer time than the users can afford to have the database unavailable.” These considerations call for the ability to reorganize the database online (concurrently with usage or incrementally within users' transactions), so that users can read and write the database during most or all phases of reorganization. [0010]
  • In the context of papers that do not concentrate on online reorganization, many people have stated the need for the ability to reorganize online. As the amount of information and dependence on computers both grow, the number of very large or highly available databases will grow. Therefore, the importance of online reorganization will grow. [0011]
  • The present invention provides methods for in-place online reorganization (specifically, for restoration of clustering). The data structures are those of IBM's DBMS Database 2 (DB2) for OS/390 [4], but the concepts in the methods presented herein should apply to many DBMS's. The methods perform reorganization in place; i.e., they do not make a new copy of the data being reorganized. To allow high-throughput concurrent usage by users of the database, the methods track the reorganization's movement of records across a user's position within a scan of data, and they correct the behavior of a user transaction to account for the movement. [0012]
  • This specification describes relevant features of a DBMS, discusses the advantages of the present invention over previous research (including the novelty of the methods), presents the concepts in the methods, describes the methods in more detail, and proposes extensions based on the methods. [0013]
  • SUMMARY OF THE INVENTION
  • To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, system, and article of manufacture for providing in-place reorganization of a database that achieves reasonably accurate results for users during high-throughput concurrent usage of the database. The reorganization's movement of records across a user transaction's position within a scan of the database is tracked. The behavior of the user transaction is corrected to account for the movement of the records.[0014]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • Referring now to the drawings in which like reference numbers represent corresponding parts throughout: [0015]
  • FIG. 1 illustrates an exemplary computer hardware environment that could be used with the preferred embodiment of the present invention; [0016]
  • FIG. 2 shows the structure of file pages according to the preferred embodiment of the present invention; [0017]
  • FIG. 3 shows the structure of index pages according to the preferred embodiment of the present invention; [0018]
  • FIG. 4 shows an example of a clustering index, a table space, and a non-clustering index according to the preferred embodiment of the present invention; [0019]
  • FIG. 5 shows a state transition diagram for the reorganizer according to the preferred embodiment of the present invention; [0020]
  • FIG. 6 shows the structure of a forward ordered list (FOL) according to the preferred embodiment of the present invention; [0021]
  • FIG. 7 shows a state transition diagram for each record that a user will eventually process according to the preferred embodiment of the present invention; [0022]
  • FIG. 8 illustrates the structure of a differential list according to the preferred embodiment of the present invention; [0023]
  • FIG. 9 shows a diagram for the states of a record ID (RID) according to the preferred embodiment of the present invention; [0024]
  • FIG. 10 shows an associated chart for the states of a pair of record IDs (RIDs) (source and target) according to the preferred embodiment of the present invention; [0025]
  • FIG. 11 is a flowchart if the main loop used by the reorganizer according to the preferred embodiment of the present invention; [0026]
  • FIG. 12 is a flowchart of the CLUSTER_RECORD procedure according to the preferred embodiment of the present invention; [0027]
  • FIGS. 13A and 13B together are a flowchart of the MOVE_RECORD procedure according to the preferred embodiment of the present invention; [0028]
  • FIG. 14 is a flowchart of the MODIFY_FOL_AND_BOL procedure according to the preferred embodiment of the present invention; [0029]
  • FIG. 15 is a flowchart of the MODIFY_DL procedure according to the preferred embodiment of the present invention; [0030]
  • FIG. 16 is a flowchart of the logic used by a user performing a table space scan or index scan according to the preferred embodiment of the present invention; [0031]
  • FIG. 17 is a flowchart that describes the logic of the PROCESS_BOL_ENTRY procedure according to the preferred embodiment of the present invention; [0032]
  • FIG. 18 is a flowchart that describes the logic of the SCAN procedure according to the preferred embodiment of the present invention; and [0033]
  • FIGS. [0034] 19A-19D together are a flowchart that describes the logic of the CORRECT_RID_LISTS procedure according to the preferred embodiment of the present invention.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
  • In the following description of the preferred embodiment, reference is made to the accompanying drawings which form a part hereof, and in which is shown by way of illustration a specific embodiment in which the invention may be practiced. It is to be understood that other embodiments may be utilized and structural and functional changes may be made without departing from the scope of the present invention. [0035]
  • 1.0 Introduction [0036]
  • The present invention describes methods for performing one type of reorganization online. The type of reorganization distributes free space evenly, removes overflow, and clusters data. The methods perform reorganization in place; i.e., they do not make a new copy of the data being reorganized. To allow high-throughput concurrent usage by users of the database, the methods track the reorganization's movement of records across a user's position within a scan of data, and they correct the behavior of a user transaction to account for the movement. The novelty provided by the present invention is online restoration of clustering that is both (1) in place (to obviate extra disk space for a copy of the data) and (2) supportive of high-throughput concurrent usage with reasonably accurate results. [0037]
  • 1.1 Hardware Environment [0038]
  • FIG. 1 illustrates an exemplary computer hardware environment that could be used with the preferred embodiment of the present invention. In the exemplary environment, a [0039] computer system 100 is comprised of one or more processors 102, each of which is connected to one or more data storage devices 104. The processor 102 executes database management system (DBMS) software 106 that manages one or more databases having one or more tables 108 and associated indices 110 stored on the data storage devices 104.
  • The tables [0040] 108 and indices 110 may be comprised of a plurality of partitions, as for example in the table 108 labeled as TS#1 and the index 110 labeled as IN#1, wherein the partitions are labeled as P1, P2, P3. This partitioning scheme, which is well known in the art, allows parallelized access and retrieval functions among both the data storage devices 104.
  • Alternatively, the tables [0041] 108 and indices 110 may be comprised of a single contiguous space, as for example in the table 108 labeled as TS#2 and the index 110 labeled as IN#2. In effect, this configuration comprises a single partition.
  • The [0042] processor 102 also executes reorganizer software 112 that reorganizes the tables 108 and associated indices 110 stored on the data storage devices 104. This reorganization comprises a logical and/or physical arrangement of the tables 108 and associated indices 110.
  • Generally, the [0043] DBMS 106, tables 108, associated indices 110, and reorganizer 112 comprise instructions and/or data that is embodied in or retrievable from a computer-readable device, medium, or carrier, e.g., a memory, a data storage device 104, a remote device coupled to the computer system 100 by a data communications device, etc. Moreover, these instructions and/or data, when read, executed, and/or interpreted by the computer system 100, cause the computer system 100 to perform the steps necessary to implement and/or use the present invention.
  • Thus, the present invention may be implemented as a method, system, or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture”, or alternatively, “computer program carrier”, as used herein is intended to encompass instructions and/or logic and/or data accessible from any computer-readable device, carrier, or media. Of course, those skilled in the art will recognize many modifications may be made to this configuration without departing from the scope of the present invention. [0044]
  • Those skilled in the art will recognize that any combination of the above components, or any number of different components, including computer programs, peripherals, and other devices, may be used to implement the present invention, so long as similar functions are performed thereby. [0045]
  • 2.0 Features of a DBMS [0046]
  • This section describes a set of [0047] DBMS 106 features (storage structures, types of concurrency control, and types of access), which come from System R [2] and DB2 [4, 3]; several other relational DBMS's use comparable features. This description applies to normal usage, i.e., without online reorganization; the types of access are later modified to accommodate online reorganization.
  • 2.1 Storage Structures, Structural Degradation, and Offline Reorganization [0048]
  • 2.1.1 Storage Structures for Data [0049]
  • The discussion of storage structures begins with the structures for data. [0050]
  • A row of a table [0051] 108 in a database is a logical unit within the table 108. For example, if Jones is an employee, a table 108 of employees includes a row for Jones. A row contains columns of data, e.g., for name, job tide, and salary. A column can have a fixed length (which does not change) or a variable length. The length of a variable-length column changes for each row according to the length of the data that users place in that column in that row. Ordinarily, the DBMS 106 implements a row by one data record, which is a lower-level (more physical) unit in storage. It will be explained shortly that sometimes the DBMS 106 implements a row by two data records. Users see rows, but do not directly see data records.
  • When users write rows (and thus the [0052] DBMS 106 writes records to implement the writing of rows), the DBMS 106 tracks the writing by appending corresponding entries to a collection of entries called the log. Later, it is possible to recover after an accidental loss of data by reloading from a backup copy of data and then applying (performing on the data) the log entries that the DBMS 106 appended after creation of the backup copy. The log record sequence number (LRSN) of a log entry is a number that represents the entry's position in the log.
  • A table [0053] 108 space is a region of storage that stores the data records for one or more tables 108. For simplicity, only one table 108 per table 108 space is used in the examples described herein, although any number of tables 108 and table 108 spaces could be used.
  • The [0054] DBMS 106 divides a table 108 space into units called file pages. FIG. 2 shows the structure 200 of file pages. The header of each file page includes the LRSN of the most recently written log entry that corresponds to writing of that page. A file page contains zero or more data records, which the DBMS 106 allocates at the beginning of the page (after the header). Deletion of records can cause gaps between remaining records. The end of a file page contains an ID map, which is an array of pointers (offsets of data records within the page). The term “slot” is used to mean the place (if any) to which an ID map entry points.
  • An ID map helps identify records. In DB2 and several other DBMS's that use the SQL [1] database language, not every table [0055] 108 has a unique key (a set of columns that identifies rows). Therefore, file pages, entries in the log, and indices 110 (structures that speed access to records) cannot use a key for identification. Instead, they use a record's record identifier (RID), which contains the record's page number and the offset of the record's entry in the ID map. A record's RID can change only during reorganization.
  • There are effects on storage structures during insertion of a row or during growth by update (modification) of a variable-length column of an existing row. During these operations, if the desired page lacks enough contiguous free space (the space available for insertions and growth), the [0056] DBMS 106 compacts the page to make its free space contiguous. During compaction, when the DBMS 106 moves a record, the DBMS 106 updates the ID map's pointer to the record; compaction does not change the record's RID. If compaction produces enough free space, the data goes into the desired page.
  • If such compaction does not produce enough space, the data goes into another page. On an insertion, a new record goes into that other page. On growth by update of existing data, the data moves into a new overflow data record in the other page, and the existing data record in the original page becomes a pointer data record, which contains the RID of the new overflow data record. Thus, the [0057] DBMS 106 sometimes implements a row by two records (a pointer record and an overflow record). Data records that do not involve overflow (hopefully, most data records) are regular data records.
  • In FIG. 2, [0058] file page 17 contains a regular data record and a pointer data record. File page 22 contains a regular data record and an overflow data record. The pointer record in page 17 contains the RID of the overflow record in page 22. The two bits in the header of each data record in FIG. 2 indicate whether the record is a pointer and whether it is an overflow, respectively.
  • 2.1.2 Storage Structures for Indices [0059]
  • A table [0060] 108 has zero or more indices 110, each of which uses an associated key (set of columns). For example, an employee table 108 might have an index 110 whose key is the table's 108 department number, an index 110 whose key is social security number, and an index 110 whose key is the combination of last name and first name. Within an index 110, the DBMS 106 maintains the key values in sorted order. Defining a key to be unique means that no two rows can have the same values in the key columns.
  • The [0061] DBMS 106 divides the storage of an index 110 into units called index 110 pages. FIG. 3 shows the structure 300 of index 110 pages, which the DBMS 106 arranges in a hierarchy. In this example, the index 110 pages that are numbered 101 and 109 are leaves of the hierarchy, and index 110 page 137 is a nonleaf. Each entry in a leaf page contains a key value and a list of RIDs whose records have that key value. The DBMS 106 designer (or, in some DBMS's, the database designer) optionally specifies that the DBMS 106 will sort each list by RID.
  • Each entry in a nonleaf page points to another nonleaf page or a leaf page, although this simple figure shows no entries that point to other nonleaf pages. Each entry in a nonleaf page also contains the value of the highest key of the page to which the entry points. For example, the first entry in [0062] index 110 page 137 contains 101 (the number of another index 110 page) and the highest key value of page 101. The second entry contains corresponding information for page 109. A possible alternative implementation is for each entry in a nonleaf page to contain the value of the lowest key of the page after the page to which the entry points.
  • For each table [0063] 108, the database designer declares at most one index 110 as a clustering index 110. In reorganization (and, whenever possible, in subsequent insertions), the assignment of data records to file pages reflects the data records' order in the key of the clustering index 110. For example, the records having the first few values of the key might reside (be stored) in one file page, the records having the next few values might reside in a second file page, etc. This clustering speeds some queries.
  • The database designer optionally declares the [0064] clustering index 110 to be a partitioning index 110. Here, the DBMS 106 divides the table 108 space (and the clustering index 110) into partitions according to values of the indexed key. The table 108 space is called a partitioned table 108 space. Partitions reside in separate files; a nonpartitioned table 108 space can reside in one file.
  • For a partitioned table [0065] 108 space, FIG. 4 shows an example of the clustering index 400, the table space 402, and a non-clustering index 404 (an index that is not the clustering index 400). For example, if the key of the clustering index 400 is an employee's name, partitions might cover names starting with A through F, G through M, and N through Z, respectively, as in the figure. The arrows represent RIDs in indices 400 and 404. In the set of leaf pages of the clustering index 400 for a partitioned table space 402, the RIDs for each partition are contiguous.
  • Within a [0066] nonclustering index 404, however, the RIDs for a partition need not be contiguous. For example, suppose that the key of a nonclustering index 404 is an employee's job title. The order of RIDs in the index 404 might be the RIDs for accountants (for all partitions), the RIDs for architects (for all partitions), the RIDs for artists (for all partitions), etc. Thus, the RIDs for a partition are scattered throughout the index 404, since some of them might be accountants, some might be architects, etc.
  • FIG. 4 illustrates lack of contiguity via two RIDs for the G-M partition with an intervening RID for the N-Z partition. Within each value of the [0067] index 404 key (e.g., “architect”), the RIDs might be sorted. Therefore, the RIDs for a partition might be contiguous within each key value, but they are not contiguous throughout the nonclustering index 404.
  • 2.1.3 Structural Degradation and Offline Reorganization [0068]
  • The storage structures described above can degrade. One type of degradation occurs when free space becomes unevenly distributed among the file pages of a table [0069] 108 space. After subsequent insertions, the order of some records no longer reflects the clustering index 110. This type of degradation slows some queries.
  • A second type of degradation occurs when variable-length data grows too large to fit in its original file page. The [0070] DBMS 106 then creates an overflow in another page and makes the original record a pointer. Indices 110 still contain the RID of the original record. This causes an extra page reference and thus extra time on some queries.
  • Reorganization removes such structural degradation. Specifically, reorganization distributes free space evenly, removes overflows (so that each row uses just one record, not two), and clusters data. Reorganization can move records between file pages. Therefore, the page that contains the record for a row after reorganization might differ from the page or pages that contained a record or records for that row before reorganization. Offline reorganization operates by (1) unloading (copying out) the data, (2) sorting the unloaded data by clustering key, and (3) reloading the data in sorted order. Users have read-only access (i.e., they can read but cannot write the data) during [0071] steps 1 and 2; they have no access during step 3.
  • To start offline reorganization, a database administrator issues a command. A parameter of the command specifies the name of the table [0072] 108 space that the DBMS 106 should reorganize. For a partitioned table 108 space, another parameter, which is optional, specifies the partition to reorganize; absence of this parameter signifies reorganization of the entire table 108 space. The term “area being reorganized” (often shortened to just “area”) is used to mean the table 108 space or partition on which reorganization operates.
  • 2.2 Types of Concurrency Control [0073]
  • A user can choose one of these levels of isolation: [0074]
  • For repeatable read, a transaction holds locks until commitment. Several reads of the same data during one transaction will yield the same values. [0075]
  • For cursor stability, a transaction holds read-locks (if any) on only one page (per cursor [1]) at a time, except when crossing a page boundary, and holds write-locks (if any) until commitment. This dramatically improves concurrency and thus throughput, but several reads of the same data during one transaction will yield different values if another user has changed that data. Also, even if a transaction is scanning an area of the database and reading each record once (e.g., to calculate the sum of all the deposits in a bank), other users' writing (e.g., transferring money from one bank account to another) can cause an inconsistent calculation. Many users choose cursor stability, but this choice requires tolerance of inconsistency or knowledge that the concurrent writing will not cause inconsistency. [0076]
  • Variations on these levels are possible in some DBMS's. [0077]
  • 2.3 Types of Access [0078]
  • This section describes three types of access to a single table [0079] 108; although the simplifications that sometimes apply are omitted. For each type, an example query in SQL [1] is provided. In each query, the keyword “WHERE” precedes a combination of one or more predicates. A predicate is an expression whose instantiation returns “true,” “false,” or “unknown.” An example of a predicate is “C=39.”
  • (1) Table [0080] 108 space scan: An example query (where table 108 T has no index 110 on column C) is:
  • SELECT * FROM T WHERE C=39 [0081]
  • The [0082] DBMS 106 reads every page in the table 108 space. For each page of the table 108, it scans the ID map to find the rows.
  • For each row, it evaluates the WHERE clause (if any) and returns the tow (to the application program) if it satisfies the clause. The [0083] DBMS 106 processes an overflow record only during the scan of the page containing its pointer.
  • (2) [0084] Index 110 scan: An example query (where T has an index 110 whose first column is C) is:
  • SELECT * FROM T WHERE (C BETWEEN 47 AND 51) . . . [0085]
  • The ellipsis represents possible additional predicates in the query. The [0086] DBMS 106 scans the relevant subset of the index 110 to select the RIDs whose key values satisfy the indexed part of the predicate on C. For each selected RID, it reads the row, evaluates the remainder (if any) of the WHERE clause for that row, and returns the row if it satisfies the clause.
  • (3) [0087] Multiple index 110 access: An example query (where T has an index 110 on C1, an index 110 on C2, and an index 110 on C3) is:
  • SELECT * FROM T WHERE ((C1=68 AND C2=7) OR C3=101) [0088]
  • The WHERE clause includes the conjunctions and/or disjunctions of n predicates, each of which the [0089] DBMS 106 can evaluate by just scanning an index 110. This type of access involves two lists of RIDs. One list (obtained from scanning an index 110) is called the recent RID list, and the other list (built incrementally by merging the recent RID list from each index 110) is called the composite RID list.
  • The [0090] DBMS 106 executes these phases:
  • For J=1 to n, perform [0091] phases 1, 2, and 3:
  • 1. Initialize the recent RID list to empty. Scan the relevant subset of the [0092] index 110 that enables evaluation of the jth predicate. For each RID whose key value satisfies the jth predicate, append the RID to the recent RID list.
  • 2. Sort the recent RID list (ordered by RID) if it comes from more than one key value. In a [0093] DBMS 106 that does not order RIDs within each entry on an index 110 leaf page, sort the RID list even if it comes from just one key value.
  • 3. If J=1, make the recent RID list the composite RID list. If J>1, merge the recent RID list into the composite RID list; i.e., calculate the intersection (for a conjunction) or union (for a disjunction). [0094]
  • Perform [0095] phase 4 once:
  • 4. For each RID in the composite RID list, read the row, evaluate the remainder (if any) of the WHERE clause (for repeatable read) or the entire WHERE clause (for cursor stability), and return the row if it satisfies the clause. For cursor stability, the evaluation of the entire WHERE clause accommodates other users' possible writing into this table [0096] 108.
  • 3.0 Advantages Over Previous Research [0097]
  • A survey paper [13] describes issues and surveys work in online reorganization of databases. Most of the work describes types of reorganization other than restoration of clustering (which is the type that is performed by the present invention). Some of the work deals with restoration of clustering, but most of this work in clustering does not discuss the issue of movement of data across a user's position in a scan. [0098]
  • Previous research in in-place online restoration of clustering requires repeatable read: [0099]
  • Omiecinski et al. [7, 8] describe an method for online reorganization in place. They use repeatable read with page-level locking. [0100]
  • Salzberg and Dimock [9] describe techniques for online changing of references to a moved record. For user transactions that perform scanning, the authors note that using repeatable read with page-level locking avoids movement of data across a user s position in a scan. [0101]
  • In contrast to the research above, the methods of the present invention allow, but do not require, repeatable read; they also allow cursor stability. The typical throughput using cursor stability exceeds the typical throughput using repeatable read, and many customers demand the opportunity to use cursor stability. With cursor stability, a user transaction can experience some inaccuracy due to concurrent user transactions, but customers could not tolerate the potentially massive amount of inaccuracy that could result from online reorganization in place without mechanisms to correct the inaccuracy. The reorganization's tracking of movement of records and the subsequent correction of user transactions are mechanisms that correct the inaccuracy. They allow cursor stability (with reasonably accurate results) during online reorganization in place. [0102]
  • One method for online restoration of clustering in DB2's structures [5, 11, 10] operates by copying data from the original copy of the area being reorganized (which users access) into a shadow copy of that area (in reorganized form). A survey paper [13] describes other methods for online reorganization by copying. Such an approach differs from our methods, which reorganize in place without making a shadow copy. [0103]
  • Following are disadvantages of reorganization by copying: [0104]
  • It can require more disk space for the area being reorganized, to store (1) the shadow copy of the area and (2) for some DBMS's, a mapping between records' RIDs in the original copy and corresponding records' RIDs in the shadow copy. [0105]
  • It involves a transition between directing users' accesses into the original copy and directing them into the shadow copy. This transition typically requires a period of limited or nonexistent user access. [0106]
  • The benefit for users (by letting them access reorganized data) begins after the transition; reorganization in place might begin the benefit immediately (as the area that users access is reorganized). [0107]
  • Advantages of reorganization by copying include (1) simplicity without sacrificing cursor stability and (2) reduced contention with users for access to the area being reorganized. [0108]
  • The novelty of the approach described herein is online restoration of clustering that is both (1) in place (to obviate extra disk space) and (2) supportive of cursor stability (for high throughput) with reasonably accurate results. There is previous work in online restoration of clustering that has just the first property or just the second, but there is no previous work that has both properties. [0109]
  • 4.0 Concepts in the Methods [0110]
  • 4.1 Overview of the Methods [0111]
  • The [0112] term reorganizer 112 is used to denote a DBMS 106 process that executes in the background to sweep through a table 108 space (or a partition of a table 108 space) and reorganize it. The methods involve any number of users and one reorganizer 112. The reorganizer 112 usually has the lowest priority in scheduling, but if the reorganizer's 112 locks have blocked a user, then the system should either preempt the reorganizer 112 or raise its allocation priority, to avoid a convoy of users waiting for the reorganizer 112.
  • The order in which a user's transaction accesses data or [0113] index 110 entries is the order in the table 108 space or in the index 110. At any time during a transaction, the user has reached some position in the order. Online reorganization can move records and thus change the order, and each record that moves between pages gets a new RID. Something that was ahead of the user's position can move behind, or something behind can move ahead. The methods described herein assure that if something moves, the user will access each appropriate piece of data exactly once (not twice and not zero times). Thus, the result of the transaction is consistent with the user's chosen level of isolation. This accommodation of the changed order is considered to be worthwhile, since the rate of movement by the reorganizer 112 might greatly exceed the rate of writing by regular transactions that execute concurrently with a cursor stability transaction.
  • 4.2 States of the Reorganizer [0114]
  • At any time, the [0115] reorganizer 112 is operating, suspended, or finished; FIG. 5 shows a state transition diagram. For each user, the reorganizer 112 (while operating) allocates certain data to track movement of records across the user's position. In any state, users can consume data from the database and can consume data that the reorganizer 112 has allocated, and the reorganizer 112 can deallocate its data after users have consumed it. Thus, the amount of allocation increases with the reorganizer's 112 activity and decreases with users' activity.
  • The [0116] reorganizer 112 can switch between the operating state 500 and suspended state 502 for either of these reasons:
  • Reorganization suspends when a period of peak usage begins and resumes when the period ends. Another possibility is for the [0117] reorganizer 112 to pause between its steps.
  • To bound the allocation of main storage, reorganization suspends when its total allocation grows past a parameter-controlled size; it resumes when the allocation shrinks below another parameter-controlled size. The storage allocator (or, in an alternative implementation, the reorganizer [0118] 112) tracks this allocation and changes the reorganizer's 112 state between the operating state 500 and suspended state 502. During reorganizer 112 operation, users give a higher priority to consuming data from the database, to take advantage of locality of reference. During reorganizer 112 suspension, users give a higher priority to consuming the reorganizer's 112 data, to allow deallocation and thus availability of storage. The use of two parameters (for suspension and resumption) instead of one reduces the frequency of state-switching and thus gives users longer periods of database consumption, when they can take advantage of locality of reference.
  • 4.3 State Information for a User [0119]
  • The methods of the present invention refer to a user's current position, whose definition depends on the user's type of access: [0120]
  • During a table [0121] 108 space scan, the position is the RID being processed.
  • During an [0122] index 110 scan, if the current key satisfies the query's predicate, the position is the key and the offset within the index's RID list for that key. If the key does not satisfy the predicate, the position is the end of the list for the previous satisfying key, if any.
  • During [0123] phase 1 of a multiple index 110 access, the position for the index 110 being scanned (and the recent RID list) is as in an index 110 scan; the position for already-scanned indices 110 (and the composite RID list) is the end of those indices 110.
  • During [0124] phases 2 and 3 of a multiple index 110 access, the position for all those indices 110 (and both RID lists) is the end of those indices 110.
  • During [0125] phase 4 of a multiple index 110 access (when the user is processing the associated data), the position is the RID in the composite RID list.
  • For each user, this state information (most of which is never paged out) is maintained: [0126]
  • The name(s) of the table(s) [0127] 108 being accessed
  • The type of access [0128]
  • The user's current position [0129]
  • 4.4 Record Movement and Movement Lists [0130]
  • Four terms are used to describe the possible directions of the reorganizer's [0131] 112 movement of a record with respect to a user's position:
  • The record moves forward from a position behind the user (in the already-scanned area) to a position ahead of the user (in the not-yet-scanned area). [0132]
  • It moves backward from a position ahead of the user to a position behind the user. [0133]
  • It moves behind the user from one position behind the user to another position behind the user. [0134]
  • It moves ahead of the user from one position ahead of the user to another position ahead of the user. [0135]
  • When reorganization moves a record and thus changes its RID, the RID can move within the [0136] index 110 if the DBMS 106 sorts indices' 110 RID lists by RID. The same four terms are used for the directions of movement of a RID within an index 110.
  • Determination of the direction of movement depends on the user's type of access: [0137]
  • During a table [0138] 108 space scan, movement can be in any direction, depending on the user's position (RID).
  • During an [0139] index 110 scan, for a record whose key equals that of the user's position in the index 110, movement can be in any direction, depending on the user's position in the index 110. For a record with an earlier or later key, movement is behind the user or ahead of the user, respectively.
  • During [0140] phase 1 of a multiple index 110 access, for the index 110 being scanned (and the recent RID list), movement is as in an index 110 scan.
  • During [0141] phases 2 and 3 of a multiple index 110 access, for the index 110 being scanned (and the recent RID list), movement is behind the user.
  • During [0142] phases 1, 2, and 3 of a multiple index 110 access, for already-scanned indices 110 (and the composite RID list), movement is behind the user. For not-yet-scanned indices 110, movement is ahead of the user.
  • During [0143] phase 4 of a multiple index 110 access, movement can be in any direction, depending on the user's position (RID) in the composite RID list.
  • For each user, the [0144] reorganizer 112 maintains movement lists to track the reorganizer's 112 movement of records that the user might access. The type of access determines the types of movement lists (forward and backward ordered lists or differential lists). The lists' structures and operations are explained below; Section 5 discusses how to use the lists, thus clarifying the reasons for the structures and operations.
  • 4.5 Forward and Backward Ordered Lists [0145]
  • Forward and backward ordered lists are used for users performing a table [0146] 108 space scan, an index 110 scan, or phase 4 of a multiple index 110 access. FIG. 6 shows the structure of a forward ordered list (FOL) 600, which tracks records that the reorganizer 112 moves forward; the user has already processed them. Each user has at most one FOL 600, which is a doubly linked list, ordered by RID and indexed. Each entry contains the new RID of a moved record. The logical head points to the first entry that is not logically deleted; the physical head might point to a logically deleted entry. When a RID is in a FOL 600, it means that the FOL 600 contains an entry that has that RID and is not logically deleted. The system tries to arrange for the beginning part of the FOL 600 (which probably matches the user's current file page) to be in main storage; the rest of the FOL 600 might be paged out.
  • These are the operations on a [0147] FOL 600; they maintain the FOL's 600 order, heads, and index 602 when necessary:
  • Determine whether a RID has an entry in the FOL [0148] 600 (reorganizer 112 invokes, using index 602).
  • Change a RID in an existing entry ([0149] reorganizer 112 invokes).
  • Read the RID of the entry to which the logical head points (user invokes): This returns null if the head is null. [0150]
  • Logically and physically insert an entry ([0151] reorganizer 112 invokes).
  • Logically and physically delete an entry ([0152] reorganizer 112 invokes). This includes deallocation of storage.
  • Logically delete the first entry (user invokes): Advance the logical head. [0153]
  • Collect garbage ([0154] reorganizer 112 invokes):
  • Do until physical head=logical head; [0155]
  • Advance physical head; [0156]
  • Deallocate storage of the formerly physically first entry; [0157]
  • [0158]
  • End; [0159]
  • A backward ordered list (BOL) tracks records that the [0160] reorganizer 112 moves backward; the user has not yet processed them. Each user has at most one BOL. Its structure and operations are like those of a FOL 600, as illustrated in FIG. 6.
  • Without online reorganization, a user transaction scans a list of RIDs (in a table [0161] 108 space, an index 110, or a RID list) and processes each record whose RID it finds. With online reorganization, the transaction omits redundant processing of records whose RIDs appear in the FOL 600 (since it has already processed them). It also eventually scans the BOL and processes the records whose RIDs appear there (since its scan of data would never find those records).
  • FIG. 7 shows a state transition diagram for each record that a user will eventually process, including the [0162] possible states 700, 702, 704 of a record.
  • 4.6 Differential Lists [0163]
  • A differential list (CL [0164] 800) tracks records that the reorganizer 112 moves forward, backward, or behind the user. DLs 800 are used for users performing phase 1, 2, or 3 of a multiple index 110 access. A user transaction's RID list can become invalid if, for example, the reorganizer 112 moves a record from a slot whose RID is in the RID list (thus changing the record's RID). Applying a differential list to a RID list will correct such invalidity in the RID list.
  • For each RID list (recent or composite), a user has a [0165] corresponding DL 800, which is a doubly linked list that the reorganizer 112 maintains. FIG. 8 illustrates the structure of a DL 800, an entry of which contains:
  • A RID [0166]
  • A flag (shown in the figure as “O” or “N”) to tell whether the RID is an old RID (the source of a movement) or a new RID (the target) [0167]
  • In some but not all cases, a link to a paired entry with the opposite flag, to represent these possibilities: [0168]
  • O—An old entry (without a paired new entry) contains the old RID of a record that the [0169] reorganizer 112 moves forward. See the first entry in the figure.
  • N—A new entry (without a paired old entry) contains the new RID of a record that the [0170] reorganizer 112 moves backward. See the third entry in the figure.
  • O-N—Two paired entries together contain the old and new RIDs of a record that the [0171] reorganizer 112 moves behind the user. See the second and fourth entries in the figure.
  • A [0172] DL 800 is ordered by RID, with old and new entries interleaved.
  • A user transaction corrects a RID list by making a local copy of the [0173] corresponding DL 800 and then inserting new entries and deleting old entries in the RID list. Since a user extracts RIDs from only the part of an index 110 that satisfies an index 110 predicate, forward or backward movement can occur only for a record that satisfies the predicate.
  • Therefore, an unpaired entry (either old or new) in a [0174] DL 800 represents a record that satisfies the predicate. However, movement behind the user can occur for a satisfying or an unsatisfying record. Therefore, a pair of entries represents a satisfying record only if the user's RID list includes the old RID. This fact requires the pairing; a user transaction should insert a paired new RID into a RID list only if the RID list already contains the paired old RID (which it should delete).
  • When the [0175] reorganizer 112 moves a record, the corresponding modification of a DL 800 can be complex. At any time, any RID (e.g., the RID for the slot of the source of the move or the RID for the target) is in any of several possible states (with respect to a DL 800).
  • Three formats are used to describe the behavior of a DL [0176] 800:
  • A diagram shows the transitions among a RID's states. Each state has a state number. [0177]
  • A chart shows the post-move states for each possible pair of (1) pre-move state of the move's source and (2) pre-move state of the move's target. The chart refers to the state numbers. [0178]
  • Each operation on a [0179] DL 800 has a detailed description, which includes justifying comments and refers (when appropriate) to the chart's rows and columns.
  • The next three sections present the three formats. Most of the transitions in the state transition diagram are derived from the chart. Almost all of the contents of the chart are derived from the operations. The descriptions of operations contain all the necessary information; the diagram and chart were added only for clarity. [0180]
  • 4.6.1 Transitions Among a RID's States [0181]
  • With respect to a [0182] DL 800, a RID can be in any of several states, depending on the entries (if any) for that RID in that DL 800, whether the RID is behind or ahead of the user's position in the corresponding index 110, and whether the slot for that RID contains a record.
  • FIG. 9 shows a diagram [0183] 900 for the states of a RID. The rectangle for a state shows the state number and the four components of a state:
  • Whether the RID is behind (“B”) or ahead of (“A”) the user's position [0184]
  • Whether the slot for the RID contains a record (“R”) or is empty (“E”) [0185]
  • Whether the [0186] DL 800 contains a paired old entry (“O-”), an unpaired old entry (“O”), or no old entry (no “O”) for the RID
  • Whether the [0187] DL 800 contains a paired new entry (“N-”), an unpaired new entry (“N”), or no new entry (no “N”) for the RID
  • The possible initial states (when the [0188] DBMS 106 creates the DL 800) are 1 and 2 (if the DL 800 is for the index 110 that the user is now scanning or will scan) or 5 and 6 (otherwise). States 3-11 are the possible final states (when the user performs the operation that copies the DL 800 into a local copy); these states (i.e., all except 1 and 2) are also the states in which the RID is behind the user's position. Since the user locks an index 110 entry while reading it, the reorganizer 112 cannot concurrently move that entry's record (and thus change its RID's status in a DL 800), so the diagram omits states in which the user s position is exactly at the RID. The diagram contains only the states that are reachable from the initial states; unreachable states are ignored, of course. For each transition between states, the diagram shows an arrow, with text that describes the circumstances under which the transition occurs.
  • 4.6.2 States of a Pair of RIDs (Source and Target) During Movement [0189]
  • The reorganizer's [0190] 112 movement of a record changes both of these:
  • The state of the RID of the source of the move (with respect to a DL [0191] 800)
  • The state of the RID of the target of the move (with respect to a DL [0192] 800).
  • FIG. 10 shows an associated [0193] chart 1000 for the states of a pair of RIDs (source and target). The columns in FIG. 10 represent the possible pre-move states of the target of the move (the states in which the RID's slot does not yet contain a record). The center of the right side of each column header shows whether the RID has an old (O) entry or no entry. The state number appears below this indication. Any paired entry (for another RID) appears above the indication; the arrow goes from O to N (the direction of movement). Thus, in state 7, the RID has an O entry, which is paired with a different RID's N entry. In state 1, the RID is ahead of the user, and in states 5, 7, and 8, the RID is behind the user. The rows represent the possible pre-move states of the source of the move (the states in which the RID's slot already contains a record). The center of the bottom of each row header shows whether the RID has an old (O) entry, a new (N) entry, both, or neither. If the RID has both, the N is above the O. The state number appears to the right of this indication. Any paired entries (for other RIDs) appear to the left of the indication. In state 10, the O and N at the left are not for the same RID. In state 2, the RID is ahead of the user, and in states 3, 4, 6, 9, 10, and 11, the RID is behind the user.
  • In each intersection cell, the right side shows the post-move state of the target, and the bottom shows the post-move state of the source. For example, in the intersection of [0194] row 6 and column 8, a record has moved from a source slot (whose RID's state, at the bottom, has changed from 6 to 7) into a target slot (whose RID's state, at the right, has changed from 8 to 11). Some intersections in column 1 contain two cells (two pairs of post-move states), since the states depend on whether the RID whose O is paired with the N in the source also had an unpaired N. Similarly, some intersections in column 7 contain two cells, since the states depend on whether a record is returning to its original RID.
  • The chart does not explicitly show states of RIDs whose entries are paired with the source or target of the move. Also, the chart does not model the user's crossing of a RID during a scan. Therefore, the chart omits some of the state transitions of FIG. 9. For almost every transition of FIG. 9, several intersections of the chart show the transition. [0195]
  • In examining the chart, it is easiest to begin with the four simplest cases, in which the RID does not yet appear in the [0196] DL 800 before the movement:
  • [0197] Row state 2, column state 1: the only case of movement ahead of the user,
  • [0198] Row state 2, column state 5: one case of movement backward,
  • [0199] Row state 6, column state 1: one case of movement forward,
  • [0200] Row state 6, column state 5: one case of movement behind the user.
  • [0201]
  • 4.6.3 Operations on a Differential List [0202]
  • The operations on a [0203] DL 800 modify the DL's 800 order and index 802 when necessary. In the following description of FIG. 10, the operations that reflect a record's movement refer (in parentheses) to state numbers in the chart's 1000 rows and columns and include justifying comments. There is no operation to change the DL 800 to reflect movement ahead of the user (row 2, column 1).
  • (1) Change the [0204] DL 800 to reflect forward movement (reorganizer 112 invokes; parameter is source RID):
  • If the [0205] DL 800 has no N entry for the source RID (row 6; column 1), insert an unpaired O entry for the source RID. Reason: This record did not move previously.
  • If the [0206] DL 800 has an unpaired N entry for the source RID ( row 4 or 9; column 1), delete it. Reason: Movement backward and then forward resembles no movement.
  • If the [0207] DL 800 has a paired N entry for the source RID ( row 3, 10, or 11; column 1), delete it, thus unpairing its paired O entry. Reason: Movement behind the user and then forward resembles movement forward without prior movement behind.
  • Also, if the [0208] DL 800 contains an unpaired N entry for the same RID as the recently unpaired O entry (left part of column 1), delete both the N and O entries for that RID. Reason: The user's RID list will already correctly include that RID, since that RID's original record and current record both satisfy the predicate. This additional deletion is omitted if the DL 800 contains a paired N entry for the same RID as the recently unpaired O entry (right part of column 1), since the current record at that RID might not satisfy the predicate.
  • (2) Change the [0209] DL 800 to reflect backward movement (reorganizer 112 invokes; parameter is target RID):
  • If the [0210] DL 800 has a paired O entry for the target RID (row 2; column 7) or no O entry for the target RID (row 2; column 5), insert an unpaired N entry for the target RID. Reason: This record did not move previously.
  • If the [0211] DL 800 has an unpaired O entry for the target RID (row 2; column 8), delete it. Reason: Movement forward and then backward (for the same record or a different satisfying record) resembles no movement. This cannot be done for a paired O entry (column 7), since the original record at that RID might not satisfy the predicate, so the user's RID list might not include that RID.
  • (3) Change the [0212] DL 800 to reflect movement behind the user (reorganizer 112 invokes; parameters are source and target RIDs):
  • If the [0213] DL 800 has no N entry for the source RID (row 6; column 5, 7, or 8), insert a pair of an O entry for the source RID and an N entry for the target RID. Reason: This record did not move previously.
  • If the [0214] DL 800 has an unpaired N entry for the source RID ( row 4 or 9; column 5, 7, or 8), delete it. Also:
  • If the [0215] DL 800 has a paired O entry for the target RID (column 7) or no O entry for the target RID (column 5), insert an unpaired N entry for the target RID. Reason: Movement backward and then behind into a RID that was already the source of a movement behind (i.e., a RID with a paired O entry) resembles the latter movement behind followed by the movement backward. Movement backward and then behind into a RID that had no movement (i.e., a RID with no O entry) resembles one movement backward into that RID.
  • If the [0216] DL 800 has an unpaired O entry for the target RID (column 8), delete the O entry for the target RID. Reason: Movement backward and then behind into a RID that was already the source of a movement forward (i.e., a RID with an unpaired O entry) resembles movement forward and then backward, which resembles no movement.
  • If the [0217] DL 800 has a paired N entry for the source RID ( row 3, 10, or 11; column 5, 7, or 8), delete it. Also:
  • If the deleted N entry's formerly paired O entry is for the target RID (left part of column 7), delete that O entry. Reason: Returning a record to its original RID resembles no movement. [0218]
  • If the formerly paired O entry is not for the target RID ([0219] column 5 or 8 or right part of column 7), insert an N entry for the target RID, paired with the formerly paired O entry. Reason: Two movements behind the user constitute one combined movement behind the user.
  • (4) Copy the [0220] DL 800 into a local copy, and logically delete the original DL 800 (user invokes).
  • (5) Collect garbage ([0221] reorganizer 112 invokes on logically deleted DL 800): Deallocate storage.
  • Following is the reason for maintaining two DLs [0222] 800 (for the recent and composite RID lists): The placement of entries in a DL 800 depends upon the movement of records across a user's current position. A record might move across a user's current position in the index 110 that the user is scanning now, but it cannot move across the user's current position in an already-scanned index 110, since the user's “current” position is at the end of that index 110.
  • 5.0 The Methods [0223]
  • Following is a description of the reorganization methods using flowcharts to illustrate the methods. [0224]
  • [0225] 5.1 The Method for the Reorganizer
  • In effect, the [0226] reorganizer 112 scans the clustering index 110, and the reorganizer's 112 current position is defined as the current key and the offset within the index's RID list for that key. Section 4.3 contains a similar definition of the position of a user performing an index 110 scan. The description of the reorganizer's 112 method will apply comparison operators (e.g., “>”) to positions; such a comparison involves comparing the keys, and, if the keys are identical, comparing the offsets. Incrementing the reorganizer's 112 position is defined as incrementing the offset, and, if there is no next RID, moving to the next key (if any).
  • FIG. 11 is a flowchart of the main loop used by the [0227] reorganizer 112.
  • [0228] Block 1100 is a decision block that represents the main loop.
  • [0229] Block 1102 is a decision block that determines whether there is any garbage collection ready to be performed. If so, control transfers to Block 1104, which performs the garbage collection; otherwise, control transfers to Block 1106. Note that, if the reorganizer 112 is operating, a possible alternative (instead of checking for garbage collection in every iteration of the main loop) is checking for garbage collection once in every few iterations of the main loop.
  • [0230] Block 1106 is a decision block that determines whether the state of the reorganizer 112 is “suspended”. If so, control transfers to Block 1108, which causes the loop to “sleep” for a parameter-controlled period before returning to Block 1100; otherwise, control transfers to Block 1110.
  • [0231] Block 1110 is a decision block that determines whether the state of the reorganizer 112 is “finished”. If so, control transfers to Block 1112; otherwise, control transfers to Block 1114.
  • [0232] Block 1112 is a decision block that determines whether any movement lists exist. If so, control transfers to Block 1108; otherwise, the loop is exited.
  • [0233] Block 1114 is a decision block that determines whether the state of the reorganizer 112 is “operating”. If not, control transfers to Block 1100; otherwise, control transfers to Block 1116.
  • [0234] Block 1116 represents a shared lock being applied to the end of the clustering index for the table being reorganized.
  • [0235] Block 1118 represents a variable L being assigned the position in the clustering index of the logically last record in the index.
  • [0236] Block 1120 represents the unlocking of the end of the clustering index for the table being reorganized.
  • [0237] Block 1122 is a decision block that determines whether R>L, where R is the position of the reorganizer 112; initially, this is at the beginning of the clustering index. If so, control transfers to Block 1124; otherwise, control transfers to Block 1126.
  • [0238] Block 1124 represents the state of the reorganizer 112 being set to “Finished.” Thereafter, control transfers to Block 1100.
  • [0239] Block 1126 represents D_P (Desired Page for R) being calculated, which is based on the size of R, page size, size of earlier records, desired frequency of free pages, desired percent of free space per page, etc.
  • [0240] Block 1128 is a decision block that determines whether D_P is the current page for R. If so, control transfers to Block 1130; otherwise, control transfers to Block 1132.
  • [0241] Block 1130 represents R being incremented. Thereafter, control transfers to Block 1100.
  • [0242] Block 1132 represents the procedure CLUSTER_RECORD being called with the RID of R and D_P as its parameters. Thereafter, control transfers to Block 1130.
  • 5.1.1 Performing the Restoration of Clustering [0243]
  • In the following procedure to cluster a record on a page, sometimes the system must first make room for the record by moving other records off the page. Since it is assumed that an overflow record does not point back to its pointer record, the system could not efficiently find the pointer for an overflow record (to update the pointer if the system moved its overflow). Therefore, the system does not move overflow records off the page when it needs to make room. If the [0244] DBMS 106 uses back pointers, this restriction is omitted.
  • FIG. 12 is a flowchart of the CLUSTER_RECORD procedure, which has as its parameters, R and D_P. This procedure clusters record R, preferably on desired page D_P. [0245]
  • The loop in this procedure initially tries the desired page D_P. For each tried page T_P, it tests for enough space, and if there is not enough space, the next iteration tries [0246] page T_P+1. Instead of iterating to the next page, the method could attempt to find a “nearby” page. For example, record R could be put on a page that contains a record with the same key, or if the record has a unique key, a record with a neighboring key. Another alternative is to put all such records at the end of the table space.
  • [0247] Block 1200 is a decision block that represents a loop comprising T_P=D_P to the last page of the table 108 space. This loop finds the target page, T_P, for clustering. Upon completion of the loop, control transfers to Block 1228; otherwise, control transfers to Block 1202.
  • [0248] Block 1202 represents an exclusive lock being applied to T_P.
  • [0249] Block 1204 is a decision block that determines whether the largest available space on T_P is greater than or equal to the size of parameter R. If so, control transfers to Block 1228; otherwise, control transfers to Block 1206.
  • [0250] Block 1206 represents T_P being compacted to make free space contiguous.
  • [0251] Block 1208 is a decision block that determines whether the available space on T_P is greater than or equal to the size of parameter R. If so, control transfers to Block 1228; otherwise, control transfers to Block 1210.
  • [0252] Block 1210 is a decision block that determines whether the available space on T_P plus the total space of non-overflow records on T_P that logically follow R is greater than or equal to the size of parameter R. If not, control transfers to Block 1212; otherwise, control transfers to Block 1214.
  • [0253] Block 1212 represents T_P being unlocked. This Block is reached when R cannot be put on T_P, and thus an iteration must occur.
  • [0254] Block 1214 is a decision block that represents a loop that iterates while available space on T_P is less than the size of parameter R. Upon completion of this loop, control transfers to Block 1224. This Block is reached when removing records from this page will succeed in providing enough space for R.
  • [0255] Block 1216 represents finding a non-overflow record A_R to be removed on T_P.
  • [0256] Block 1218 represents finding and applying an exclusive lock on another page A_P with space for record A_R.
  • [0257] Block 1220 represents the procedure MOVE_RECORD being called with the parameters A_R and A_P.
  • [0258] Block 1222 represents A_P being unlocked. Thereafter, control transfers to Block 1214.
  • [0259] Block 1224 is a decision block that determines whether the largest available space on T_P is less than the size of R. If so, control transfers to Block 1226; otherwise, control transfers to Block 1228.
  • [0260] Block 1226 represents T_P being compacted. Thereafter, control transfers to Block 1228.
  • [0261] Block 1228 represents an exclusive lock being applied to the page of R.
  • [0262] Block 1230 represents the procedure MOVE_RECORD being called with the parameters R and T_P.
  • [0263] Block 1232 represents T_P and the page of R being unlocked.
  • Finally, the procedure exits at [0264] 1234.
  • FIGS. 13A and 13B together are a flowchart of the MOVE_RECORD procedure, which has as its parameters, R and P. This procedure moves record R to page P. [0265]
  • [0266] Block 1300 represents the variable NEW_RID being assigned the first available offset in page P's ID map.
  • [0267] Block 1302 is a decision block that represents a loop for each user. Upon completion of the loop, control transfers to FIG. 13B via “A”.
  • [0268] Block 1304 is a decision block that determines whether the user's type of access comprises a table space scan. If so, control transfers to Block 1306; otherwise, control transfers to Block 1310.
  • [0269] Block 1306 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1308; otherwise, control transfers to Block 1302.
  • [0270] Block 1308 represents the procedure MODIFY_FOL_AND_BOL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302.
  • [0271] Block 1310 is a decision block that determines whether the user's type of access comprises an index scan. If so, control transfers to Block 1312; otherwise, control transfers to Block 1316.
  • [0272] Block 1312 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1314; otherwise, control transfers to Block 1302.
  • [0273] Block 1314 represents the procedure MODIFY_FOL_AND_BOL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302.
  • [0274] Block 1316 is a decision block that determines whether the user's type of access comprises Phase 1, 2, or 3 of a multiple index access. If so, control transfers to Block 1318; otherwise, control transfers to Block 1322.
  • [0275] Block 1318 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1320; otherwise, control transfers to Block 1302.
  • [0276] Block 1320 represents the procedure MODIFY_DL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302.
  • [0277] Block 1322 is a decision block that determines whether the user's type of access comprises Phase 4 of a multiple index access. If so, control transfers to Block 1324; otherwise, control transfers to Block 1302.
  • [0278] Block 1324 is a decision block that determines whether cursor stability is in effect. If so, control transfers to Block 1326; otherwise, control transfers to Block 1302.
  • [0279] Block 1326 represents the procedure MODIFY_FOL_AND_BOL being called with the parameters USER_ID, R, and NEW_RID. Thereafter, control transfers to Block 1302.
  • Upon completion of the loop at [0280] Block 1302, control transfers to Block 1328 via Block 1328 represents finding all the index entries that reference R.
  • [0281] Block 1330 represents finding all the referential integrity structures that reference R.
  • [0282] Block 1332 represents moving R to P, modifying the index entries, and (if necessary) modifying the referential integrity structures. If R is a pointer record, the data is moved from the overflow record to a new, regular data record. If deletion and insertion ordinarily trigger certain actions, e.g., deletion via referential integrity, those actions should not occur for record movement.
  • 5.1.2 Maintaining the Movement Lists [0283]
  • This section illustrates how to maintain the movement lists. The above procedure sometimes omits the maintenance of movement lists: [0284]
  • For a table [0285] 108 space scan or index 110 scan, repeatable read prevents movement of a record across the user's position, so the system omits movement lists.
  • For a [0286] multiple index 110 access, repeatable read prevents movement of a record whose RID the user has already read, so the system omits movement lists.
  • Also, the procedure is presented under the assumption that the [0287] DBMS 106 sorts index 110 leaf entries by RID. If the DBMS 106 does not sort them, then the system need not maintain a movement list for an index 110 scan, since it never moves a record across the user's position.
  • FIG. 14 is a flowchart of the MODIFY_FOL_AND_BOL procedure, which has as its parameters, USER_ID, OLD_RID, and NEW_RID. This procedure modifies the [0288] FOL 600 and BOL 600 for one user. If the user is doing a table space scan, the reorganizer 112 performs FOL/BOL processing for data records; for an index scan, the reorganizer 112 performs FOL/BOL processing for index entries; and for Phase 4 of a multiple index access, the reorganizer 112 performs FOL/BOL processing for RID list entries.
  • [0289] Block 1400 represents the FOL 600 and BOL 600 being latched.
  • [0290] Block 1402 is a decision block that determines whether the record movement is forward of the user's position. If so, control transfers to Block 1404; otherwise, control transfers to Block 1410.
  • [0291] Block 1404 is a decision block that determines whether the old RID is in the BOL 600. If so, control transfers to Block 1406, which deletes the old RID from the BOL 600; otherwise, control transfers to Block 1408, which inserts the new RID into the FOL. Thereafter, control transfers to Block 1430.
  • [0292] Block 1410 is a decision block that determines whether the record movement is backward from the user's position. If so, control transfers to Block 1412; otherwise, control transfers to Block 1418.
  • [0293] Block 1412 is a decision block that determines whether the old RID is in the FOL 600. If so, control transfers to Block 1414, which deletes the old RID from the FOL 600; otherwise, control transfers to Block 1416, which inserts the new RID into the BOL 600. Thereafter, control transfers to Block 1430.
  • [0294] Block 1418 is a decision block that determines whether the record movement stays ahead of the user's position. If so, control transfers to Block 1420; otherwise, control transfers to Block 1424.
  • [0295] Block 1420 is a decision block that determines whether the old RID is in the FOL 600. If so, control transfers to Block 1422, which changes the old RID to the new RID in the FOL 600. Thereafter, control transfers to Block 1430.
  • [0296] Block 1424 is a decision block that determines whether the record movement stays behind the user's position. If so, control transfers to Block 1426; otherwise, control transfers to Block 1430.
  • [0297] Block 1426 is a decision block that determines whether the old RID is in the BOL 600. If so, control transfers to Block 1428, which changes the old RID to the new RID in the BOL 600. Thereafter, control transfers to Block 1430.
  • [0298] Block 1430 unlatches the FOL 600 and BOL 600.
  • FIG. 15 is a flowchart of the MODIFY_DL procedure, which has as its parameters, USER_ID, OLD_RID, and NEW_RID. This procedure modifies the [0299] DL 800 for one user. If the user is doing Phase 1 of a multiple index access, the reorganizer 112 performs DL 800 processing for the index being scanned and the recent RID list; any movement is possible. If the user is doing Phase 2 or 3 of a multiple index access, the reorganizer 112 performs DL 800 processing for the index being scanned and the recent RID list; all movement is behind the user. For Phases 1, 2, or 3 of a multiple index access, the reorganizer 112 performs DL 800 processing for already-scanned indices and the composite RID list; all movement is behind the user.
  • [0300] Block 1500 represents the recent DL 800 being latched.
  • [0301] Block 1502 is a decision block that determines whether the record movement is forward from the user's position in the index for the recent RID list. If so, control transfers to Block 1504; otherwise, control transfers to Block 1506.
  • [0302] Block 1504 represents the recent DL 800 being changed to reflect the forward movement.
  • [0303] Block 1506 is a decision block that determines whether the record movement is backward from the user's position in the index for the recent RID list. If so, control transfers to Block 1508; otherwise, control transfers to Block 1510.
  • [0304] Block 1508 represents the recent DL 800 being changed to reflect the backward movement.
  • [0305] Block 1510 is a decision block that determines whether the record stays ahead of the user's position in the index for the recent RID list. If so, control transfers to Block 1516; otherwise, control transfers to Block 1512.
  • [0306] Block 1512 is a decision block that determines whether the record stays behind the user's position in the index for the recent RID list. If so, control transfers to Block 1514; otherwise, control transfers to Block 1516.
  • [0307] Block 1514 represents the recent DL 800 being changed to reflect the movement behind.
  • [0308] Block 1516 represents the recent DL 800 being unlatched.
  • [0309] Block 1518 represents the composite DL 800 being latched.
  • [0310] Block 1520 represents the composite DL 800 being changed to reflect movement behind. This composite DL 800 was produced by earlier merges. The record always stays behind the user, who is already at the end of all indices that the RID list represents.
  • [0311] Block 1522 represents the composite DL 800 being unlatched.
  • 5.2 The Method for a User Performing a Table Space Scan [0312]
  • FIG. 16 is a flowchart of the logic used by a user performing a table space scan or index scan. [0313]
  • [0314] Block 1600 represents the FOL 600 and BOL 600 structures being created.
  • [0315] Block 1602 is a decision block that represents a loop for all records.
  • [0316] Block 1604 represents the logical head of the BOL 600 being latched.
  • [0317] Block 1606 represents the variable H being assigned the RID of the logical head of the BOL 600.
  • [0318] Block 1608 represents the logical head of the BOL 600 being unlatched.
  • [0319] Block 1610 represents the state of the reorganizer 112 being latched.
  • [0320] Block 1612 represents the variable S being assigned the state of the reorganizer 112.
  • [0321] Block 1614 represents the state of the reorganizer 112 being unlatched.
  • [0322] Block 1616 is a decision block that determines whether POS (the position in the scan) has reached the end of the scan. If so, control transfers to Block 1618; otherwise, control transfers to Block 1626.
  • [0323] Block 1618 is a decision block that determines whether H is null. If so, control transfers to Block 1620, which logically deletes the FOL 600 and BOL 600, and then to Block 1622 where the procedure exits; otherwise, control transfers to Block 1624, which calls the procedure PROCESS_BOL_ENTRY, and then returns to Block 1602.
  • [0324] Block 1626 is a decision block that determines whether H is null. If so, control transfers to Block 1628, which calls the procedure SCAN, and then returns to Block 1602; otherwise, control transfers to Block 1630.
  • [0325] Block 1630 is a decision block that determines whether S is “Operating”. If so, control transfers to Block 1628, which calls the procedure SCAN, and then returns to Block 1602; otherwise, control transfers to Block 1632, which calls the procedure PROCESS_BOL_ENTRY, and then returns to Block 1602.
  • FIG. 17 is a flowchart that describes the logic of the PROCESS_BOL_ENTRY procedure. [0326]
  • [0327] Block 1700 represents the shared lock being applied to the record that the logical head entry indicates.
  • [0328] Block 1702 represents the first entry in the BOL 600 being logically deleted.
  • [0329] Block 1704 represents that record being processed, e.g., evaluating the WHERE clause and returning the row if appropriate.
  • [0330] Block 1706 represents all locks being unlocked, i.e., if cursor stability is in effect.
  • [0331] Block 1708 represents an exit from the procedure.
  • FIG. 18 is a flowchart that describes the logic of the SCAN procedure. [0332]
  • [0333] Block 1800 represents the POS (position in the scan) being incremented.
  • [0334] Block 1802 represents a shared lock being applied to the current record.
  • [0335] Block 1804 is a decision block that determines whether POS =RID of the FOL 600 logical head. If so, then control transfers to Block 1806; otherwise, control transfers to Block 1808.
  • [0336] Block 1806 logically deletes the first entry in the FOL 600, e.g., because it moved forward. Thereafter, control transfers to Block 1810.
  • [0337] Block 1808 represents that record being processed, e.g., evaluating the WHERE clause and returning the tow if appropriate.
  • [0338] Block 1810 represents all locks being unlocked, i.e., if cursor stability is in effect.
  • [0339] Block 1812 represents an exit from the procedure.
  • 5.3 The Method for a User Performing an Index Scan [0340]
  • This resembles a table [0341] 108 space scan. The remainder (non-indexed part) of the WHERE clause, not the whole WHERE clause, is evaluated. The user has a position in a range (key value) of the index 110. At the end of scanning each range, the FOL 600 is empty, since no record can move beyond the range. Therefore, a FOL 600 can be sorted by RID; the system does not need a major sort by key value and a minor sort by RID. The BOL 600 might not be empty.
  • 5.4 The Method for a User Performing a Multiple Index Access [0342]
  • A user performing a [0343] multiple index 110 access (on n indices 110) does this:
  • [0344] Create DL 800 structures (indices 802, etc.).
  • For J=1 to n, perform [0345] phases 1, 2, and 3:
  • 1. Initialize the recent RID list to empty. Scan the relevant subset of the [0346] index 110 that enables evaluation of the jth predicate. For each RID whose key value satisfies the jth predicate, append the RID to the recent RID list.
  • 2. Sort the recent RID list (ordered by RID) if it comes from more than one key value. Correct both RID lists (as shown herein after) according to the [0347] DLs 800.
  • 3. If J=1, make the recent RID list the composite RID list. If j >1, merge the recent RID list into the composite RID list; i.e., calculate the intersection (for a conjunction) or union (for a disjunction). Correct the composite RID list according to the [0348] composite DL 800; this includes logically deleting the DLs 800 and creating the FOL 600 and BOL 600 structures for phase 4. The system no longer cares about correcting the recent RID list, since the system already merged it into the composite RID list, and any changes in the recent DL 800 will now be identical to changes in the composite DL 800.
  • Perform [0349] phase 4 once:
  • 4. For each RID in the composite RID list, read the row, evaluate the remainder (if any) of the WHERE clause (for repeatable read) or the entire WHERE clause (for cursor stability), and return the row if it satisfies the clause. Use the [0350] FOL 600 and the BOL 600, as in a table 108 space scan; the user has a position in the composite RID list.
  • When finished, logically delete [0351] FOL 600 and BOL 600 structures (indices 602, etc.).
  • Following is a description of how the user transaction corrects its RID lists according to the [0352] DLs 800. It does this at the end of each phase 2 (for both RID lists) and at the end of each phase 3 (for just the composite RID list):
  • FIGS. [0353] 19A-D is a flowchart that describes the logic of the CORRECT_RID_LISTS procedure, which accepts a parameter K, wherein K=1 signifies correction of just a composite RID list and K=2 signifies correction of both composite and recent RID lists.
  • [0354] Block 1900 represents the two DLs 800 being latched.
  • [0355] Block 1902 is a decision block that represents a loop which tests for emptiness, i.e., it iterates while either DL 800 contains anything. Upon completion of the loop, control transfers to Block 1936 via “C”.
  • [0356] Block 1904 is a decision block that represents a loop for J=1 to K, wherein J=1 means composite DL 800, while J=2 means recent DL 800.
  • [0357] Block 1906 represents the DL 800 being copied into a local copy and the DL 800 being logically deleted.
  • [0358] Block 1908 represents the creation of an empty new DL 800.
  • [0359] Block 1910 represents the two DLs 800 being unlatched.
  • [0360] Block 1912 is a decision block that represents a loop for J=1 to K, wherein J=1 means composite RID list and J=2 means recent RID list. This loop makes a first pass at correcting RID lists (to nullify non-qualifying pairs). Upon completion of the loop, control transfers to Block 1922 via “B”.
  • [0361] Block 1914 is a decision block that represents a loop through the entries in the local copy of the DL 800 that corresponds to the RID list indicated by J.
  • [0362] Block 1916 is a decision block that determines if an entry in the copy of the DL 800 is “old”. If so, control transfers to Block 1918; otherwise, control transfers to Block 1914.
  • [0363] Block 1918 is a decision block that determines if an entry's RID does not exist in the RID list. If so, control transfers to Block 1920; otherwise, control transfers to Block 1914.
  • [0364] Block 1920 represents nullifying the entry and its paired entry (if any). Thereafter, control transfers to Block 1914.
  • [0365] Block 1922 is a decision block that represents a loop for J=1 to K, wherein J=1 means composite RID list and J=2 means recent RID list. This loop makes a second pass at correcting RID lists. Upon completion of the loop, control transfers to Block 1934.
  • [0366] Block 1924 is a decision block that represents a loop through the entries in the local copy of the DL 800 that corresponds to the RID list indicated by J. Upon completion of the loop, control transfers to Block 1922.
  • [0367] Block 1926 is a decision block that determines if the local copy of the DL 800 contains an “old” entry for an RID but no “new” entry for that RID. If so, control transfers to Block 1928; otherwise, control transfers to Block 1930.
  • [0368] Block 1928 represents deleting the RID from the RID list. Thereafter, control transfers to Block 1924.
  • [0369] Block 1930 is a decision block that determines if the local copy of the DL 800 contains a “new” entry for an RID but no “old” entry for that RID. If so, control transfers to Block 1932; otherwise, control transfers to Block 1924.
  • [0370] Block 1932 represents inserting the RID into the RID list (maintaining sorted order). Thereafter, control transfers to Block 1924.
  • [0371] Block 1934 represents the two DLs 800 being latched. Control then transfers to Block 1902 via “D”.
  • [0372] Block 1936 indicates that the user transaction is moving to the next phase.
  • [0373] Block 1938 is a decision block that determines if the user transaction is moving from phase 3 to phase 4. If so, control transfers to Block 1940; otherwise, control transfers to Block 1944.
  • [0374] Block 1940 represents logically deleting DL 800 structures.
  • [0375] Block 1942 represents creating the FOL 600 and BOL 600 structures.
  • [0376] Block 1944 represents the DLs 800 being unlatched. Thereafter, the procedure exists at 1946.
  • In the above method, a possible optimization, if the outer loop is executed more than a parameter-controlled number of times, is to suspend the [0377] reorganizer 112.
  • Following is the reason for using two passes rather than one: With one pass, suppose that a new entry and an old entry are paired, and the new entry has a lower RID. The user transaction will find the new entry first. The transaction should insert the new entry's RID in its RID list only if the old entry's RID matches a RID that is already in the RID list. Lack of a match means that the new entry's RID does not represent a qualifying record. To find the old entry, the user transaction must do a searched access (rather than a sequential access) into the local copy of the [0378] DL 800, but there is no index, so there must be a search via another sequential access, which is an extra expense.
  • 6.0 Extensions Based on the Methods [0379]
  • This section describes possible extensions that could be added-to the methods of the present invention. [0380]
  • 6.1 Logging and Recovery [0381]
  • Logging and recovery could be added (for both the [0382] reorganizer 112 and the users) during online reorganization. For logging, a new type of log record can simply be added, which means that a record has been moved from the old RID to the new.
  • Logging could also be included for movement lists. [0383]
  • Movement lists could also be stored in artificial data pages; i.e., not just in main storage. This implies that a transaction that is ordinarily considered read-only (because it just does a SELECT) is not really read-only, since it writes into the movement lists. [0384]
  • 6.2 Multiple-Table Queries [0385]
  • So far, only transactions that operate on one table [0386] 108 have been considered. An alternative would consider transactions that operate on several tables 108.
  • 6.3 Transactions That Write [0387]
  • So far, only “reader” transactions have been considered. This can be generalized in two ways: [0388]
  • Even if a transaction is a reader, consider writers' effect on this transaction's movement lists. [0389]
  • Consider movement lists and synchronization for writers. In general, the same methods that are used for readers can be used, recognizing that locks are held until commit, for serializability. [0390]
  • 6.4 Index Reorganization [0391]
  • Online reorganization of an [0392] index 110 could be added, not just online reorganization of data with updating of the RIDs in an index 110.
  • 6.5 Possible Reorganizations [0393]
  • To give priority to users, the [0394] reorganizer 112 should abort instead of waiting when the reorganizer's 112 lock request conflicts with a user's lock.
  • In addition, a warning bit could be maintained to indicate that the [0395] FOL 600 contains an entry for the current page.
  • 7.0 Summary [0396]
  • Any database management system can need some type of reorganization. To avoid taking a very large or highly available database offline for reorganization, a solution is to reorganize online. Methods have been described for performing a certain type of reorganization online. The reorganization restores clustering and removes overflows. [0397]
  • The methods include synchronization, and they use movement lists to track the reorganization's movement of records across a user's position within a scan of data. The reorganization's tracking of movement of records and the subsequent correction of user transactions are mechanisms that correct inaccuracy in user transactions and thus allow cursor stability during online reorganization in place. The novelty is in allowing high throughput concurrent usage during in-place online restoration of clustering. As the amount of information and dependence on computers both grow, the number of very large or highly available databases will grow. Therefore, the importance of online reorganization will grow. [0398]
  • 8.0 References [0399]
  • The following references are incorporated by reference herein: [0400]
  • 1. Amer. Nad. Standards Institute, “Database Language SQL,” X3.135-1992, New York, 1992. [0401]
  • 2. M. M. Astrahan et al., “System R: Relational Approach to Database Management,” ACM Trans. Database Syst., Vol. 1, No. 2, June 1976, pp. 97-137. [0402]
  • 3. C. J. Date and C. J. White, A Guide to DB2, 4th edition, Addison-Wesley, Reading, Mass., 1993. [0403]
  • 4. D. J. Haderle and R. D. Jackson, “[0404] IBM Database 2 Overview,” IBM Syst. J., Vol. 23, No. 2, 1984, pp. 112-125.
  • 5. IBM Corp., “DB2 for OS/390 [0405] Version 5 Utility Guide and Reference,” SC26-8967-00, June 1997.
  • 6. C. Mohan, “A Survey of DBMS Research Issues in Supporting Very Large Tables,” in D. Lomet, Ed., Foundations of Data Organization and Methods (Proc. 4th Intl. Conf. Foundations of Data Organization and Methods, October 1993), Lecture Notes in Computer Science 730, Springer-Verlag, New York, pp. 279-300. [0406]
  • 7. E. Omiecinski, L. Lee, and P. Scheuermann, “Concurrent File Reorganization for Record Clustering: A Performance Study,” Proc. 8th Intl. Conf Data Engineering, IEEE-CS, Feb. 1992, pp. 265-272. For more details, see their later work [8]. [0407]
  • 8. E. Omiecinski, L. Lee, and P. Scheuermann, “Performance Analysis of a Concurrent File Reorganization Method for Record Clustering,” IEEE Trans. Knowledge and Data Engin., Vol. 6, No. 2, Apt. 1994, pp. 248-257. [0408]
  • 9. B. Salzberg and A. Dimock, “Principles of Transaction-Based On-Line Reorganization,” Proc. 18th Intl. Conf. Very Large Data Bases, Morgan Kaufmann Publishers, San Mateo, CA, Aug. 1992, pp. 511-520. [0409]
  • 10. G. H. Sockut and T. A. Beavin, “Interaction Between Application of a Log and Maintenance of a Table that Maps Record Identifiers During Online Reorganization of a Database,” U.S. Pat. No. 5,721,915, February 1998. [0410]
  • 11. G. H. Sockut, T. A. Beavin, and C.-C. Chang, “A Method for On-line Reorganization of a Database,” IBM Syst. J., Vol. 36, No. 3, 1997, pp. 411-436; erratum in Vol. 37, No. 1, 1998, p. 152. [0411]
  • 12. G. H. Sockut and R. P. Goldberg, “Database Reorganization—Principles and Practice,” Computing Surveys, ACM, Vol. 11, No. 4, Dec. 1979, pp. 371-395. [0412]
  • 13. G. H. Sockut and B. R. Iyer, “Reorganizing Databases Concurrently with Usage: A Survey,” Tech. Report 03.488, IBM Santa Teresa Lab., San Jose, Calif., June 1993. [0413]
  • 14. G. Wiederhold, Database Design, 2nd edition, McGraw-Hill, New York, 1983. [0414]
  • Conclusion
  • This concludes the description of the preferred embodiment of the invention. The following describes some alternative embodiments for accomplishing the present invention. For example, any type of computer, such as a mainframe, minicomputer, or personal computer, could be used with the present invention. In addition, any software program providing database management functions could benefit from the present invention. [0415]
  • In summary, the present invention discloses a method, system, and article of manufacture for in-place reorganization of a database that allows high-throughput concurrent usage by users of the database. The reorganization's movement of records are tracked across a user transaction's position within a scan of the database. The behavior of the user transaction is corrected to account for the movement of the records. [0416]
  • The foregoing description of the preferred embodiment of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto. [0417]

Claims (75)

What is claimed is:
1. A computer-implemented method of providing in-place reorganization of a database to achieve reasonably accurate results for user transactions during high-throughput concurrent usage of the database, comprising:
(a) tracking, in the computer, the reorganization's movement of records across a user transaction's position within a scan of the database; and
(b) correcting, in the computer, a behavior of the user transaction to account for the movement of the records.
2. The method of claim 1, wherein the reorganization clusters data.
3. The method of claim 1, wherein the reorganization provides online restoration of clustering.
4. The method of claim 1, wherein the reorganization distributes free space evenly.
5. The method of claim 1, wherein the reorganization removes overflow.
6. The method of claim 1, wherein the reorganization is performed concurrently with a plurality of concurrent user transactions.
7. The method of claim 6, wherein the concurrent user transactions are allowed to use different types of access to the database.
8. The method of claim 1, wherein the reorganization allows repeatable read for the user transaction.
9. The method of claim 1, wherein the reorganization does not require repeatable read but does allow cursor stability for the user transaction.
10. The method of claim 1, wherein the tracking step comprises maintaining, for each user transaction, at least one movement list to track the reorganization's movement of records that the user transaction might access.
11. The method of claim 10, wherein the tracking step comprises concurrently maintaining different types of movement lists appropriate for different types of access by different user transactions.
12. The method of claim 10, wherein a type of access determines a type of movement list used in the reorganization.
13. The method of claim 10, wherein the movement list comprises at least one of the lists in a group comprising a forward ordered list, backward ordered list, and differential lists.
14. The method of claim 13, wherein the forward ordered list tracks records that the reorganization moves forward.
15. The method of claim 13, wherein the forward ordered list tracks records that the user transaction has already processed.
16. The method of claim 13, further comprising omitting redundant processing of records that appear in the forward ordered list.
17. The method of claim 13, wherein the backward ordered list tracks records that the reorganization moves backward.
18. The method of claim 13, wherein the backward ordered list tracks records that the user transaction has not yet processed.
19. The method of claim 13, further comprising processing records that appear on the backward ordered list.
20. The method of claim 13, wherein the differential list tracks records that the reorganization moves forward, backward, or behind the user transaction.
21. The method of claim 13, further comprising applying a differential list to a list of records maintained by the user transaction to correct the list of records.
22. The method of claim 21, wherein the list of records is derived from an index.
23. The method of claim 21, further comprising correcting the list of records by inserting new entries of the list into the differential list and by deleting old entries of the list from the differential list.
24. The method of claim 13, further comprising modifying the differential list when the reorganization moves a record.
25. The method of claim 13, further comprising maintaining both a recent differential list and a composite differential list.
26. A computer-implemented system for providing in-place reorganization of a database to achieve reasonably accurate results for user transactions during high-throughput concurrent usage of the database, comprising:
(a) a computer;
(b) means, performed by the computer, for tracking a reorganization's movement of records across a user transaction's position within a scan of the database; and
(c) means, performed by the computer, for correcting a behavior of the user transaction to account for the movement of the records.
27. The system of claim 26, wherein the reorganization clusters data.
28. The system of claim 26, wherein the reorganization provides online restoration of clustering.
29. The system of claim 26, wherein the reorganization distributes free space evenly.
30. The system of claim 26, wherein the reorganization removes overflow.
31. The system of claim 26, wherein the reorganization is performed concurrently with a plurality of concurrent user transactions.
32. The system of claim 31, wherein the concurrent user transactions ate allowed to use different types of access to the database.
33. The system of claim 26, wherein the reorganization allows repeatable read for the user transaction.
34. The system of claim 26, wherein the reorganization does not require repeatable read but does allow cursor stability for the user transaction.
35. The system of claim 26, wherein the means for tracking comprises means for maintaining, for each user transaction, at least one movement list to track the reorganization's movement of records that the user transaction might access.
36. The system of claim 35, wherein the means for tracking comprises means for concurrently maintaining different types of movement lists appropriate for different types of access by different user transactions.
37. The system of claim 35, wherein a type of access determines a type of movement list used in the reorganization.
38. The system of claim 35, wherein the movement list comprises at least one of the lists in a group comprising a forward ordered list, backward ordered list, and differential lists.
39. The system of claim 38, wherein the forward ordered list tracks records that the reorganization moves forward.
40. The system of claim 38, wherein the forward ordered list tracks records that the user transaction has already processed.
41. The system of claim 38, further comprising means for omitting redundant processing of records that appear in the forward ordered list.
42. The system of claim 38, wherein the backward ordered list tracks records that the reorganization moves backward.
43. The system of claim 38, wherein the backward ordered list tracks records that the user transaction has not yet processed.
44. The system of claim 38, further comprising means for processing records that appear on the backward ordered list.
45. The system of claim 38, wherein the differential list tracks records that the reorganization moves forward, backward, or behind the user transaction.
46. The system of claim 38, further comprising means for applying a differential list to a list of records maintained by the user transaction to correct the list of records.
47. The system of claim 46, wherein the list of records is derived from an index.
48. The system of claim 46, further comprising means for correcting the list of records by inserting new entries of the list into the differential list and by deleting old entries of the list from the differential list.
49. The system of claim 38, further comprising means for modifying the differential list when the reorganization moves a record.
50. The system of claim 38, further comprising means for maintaining both a recent differential list and a composite differential list.
51. An article of manufacture embodying logic for a computer-implemented method of providing in-place reorganization of a database to achieve reasonably accurate results for user transactions during high-throughput concurrent usage of the database, comprising:
(a) tracking the reorganization's movement of records across a user transaction's position within a scan of the database; and
(b) correcting a behavior of the user transaction to account for the movement of the records.
52. The method of claim 51, wherein the reorganization clusters data.
53. The method of claim 51, wherein the reorganization provides online restoration of clustering.
54. The method of claim 51, wherein the reorganization distributes free space evenly.
55. The method of claim 51, wherein the reorganization removes overflow.
56. The method of claim 51, wherein the reorganization is performed concurrently with a plurality of concurrent user transactions.
57. The method of claim 56, wherein the concurrent user transactions are allowed to use different types of access to the database.
58. The method of claim 51, wherein the reorganization allows repeatable read for the user transaction.
59. The method of claim 51, wherein the reorganization does not require repeatable read but does allow cursor stability for the user transaction.
60. The method of claim 51, wherein the tracking step comprises maintaining, for each user transaction, at least one movement list to track the reorganization's movement of records that the user transaction might access.
61. The method of claim 60, wherein the tracking step comprises concurrently maintaining different types of movement lists appropriate for different types of access by different user transactions.
62. The method of claim 60, wherein a type of access determines a type of movement list used in the reorganization.
63. The method of claim 60, wherein the movement list comprises at least one of the lists in a group comprising a forward ordered list, backward ordered list, and differential lists.
64. The method of claim 63, wherein the forward ordered list tracks records that the reorganization moves forward.
65. The method of claim 63, wherein the forward ordered list tracks records that the user transaction has already processed.
66. The method of claim 63, further comprising omitting redundant processing of records that appear in the forward ordered list.
67. The method of claim 63, wherein the backward ordered list tracks records that the reorganization moves backward.
68. The method of claim 63, wherein the backward ordered list tracks records that the user transaction has not yet processed.
69. The method of claim 63, further comprising processing records that appear on the backward ordered list.
70. The method of claim 63, wherein the differential list tracks records that the reorganization moves forward, backward, or behind the user transaction.
71. The method of claim 63, further comprising applying a differential list to a list of records maintained by the user transaction to correct the list of records.
72. The method of claim 71, wherein the list of records is derived from an index.
73. The method of claim 71, further comprising correcting the list of records by inserting new entries of the list into the differential list and by deleting old entries of the list from the differential list.
74. The method of claim 63, further comprising modifying the differential list when the reorganization moves a record.
75. The method of claim 63, further comprising maintaining both a recent differential list and a composite differential list.
US10/151,393 1998-12-23 2002-05-20 Methods for in-place online reorganization of a database Abandoned US20020143743A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/151,393 US20020143743A1 (en) 1998-12-23 2002-05-20 Methods for in-place online reorganization of a database

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US09/220,084 US6411964B1 (en) 1998-12-23 1998-12-23 Methods for in-place online reorganization of a database
US10/151,393 US20020143743A1 (en) 1998-12-23 2002-05-20 Methods for in-place online reorganization of a database

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US09/220,084 Continuation US6411964B1 (en) 1998-12-23 1998-12-23 Methods for in-place online reorganization of a database

Publications (1)

Publication Number Publication Date
US20020143743A1 true US20020143743A1 (en) 2002-10-03

Family

ID=22821987

Family Applications (2)

Application Number Title Priority Date Filing Date
US09/220,084 Expired - Fee Related US6411964B1 (en) 1998-12-23 1998-12-23 Methods for in-place online reorganization of a database
US10/151,393 Abandoned US20020143743A1 (en) 1998-12-23 2002-05-20 Methods for in-place online reorganization of a database

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US09/220,084 Expired - Fee Related US6411964B1 (en) 1998-12-23 1998-12-23 Methods for in-place online reorganization of a database

Country Status (1)

Country Link
US (2) US6411964B1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060080324A1 (en) * 2004-10-07 2006-04-13 International Business Machines Corporation Method of Changing the Page Size of a DB2 Table Space While Keeping the Object Available
US20070005631A1 (en) * 2005-06-30 2007-01-04 International Business Machines Corporation Apparatus and method for dynamically determining index split options from monitored database activity
US20070143313A1 (en) * 2005-12-19 2007-06-21 International Business Machines Corporation Peephole DBMS reorganization allowing concurrent data manipulation
US20110153580A1 (en) * 2009-12-17 2011-06-23 Bmc Software, Inc. Index Page Split Avoidance With Mass Insert Processing
US20110208754A1 (en) * 2010-02-22 2011-08-25 International Business Machines Corporation Organization of Data Within a Database
US20110252000A1 (en) * 2010-04-08 2011-10-13 Microsoft Corporation In-memory database system
US8364640B1 (en) * 2010-04-09 2013-01-29 Symantec Corporation System and method for restore of backup data
US20130262609A1 (en) * 2009-02-17 2013-10-03 International Business Machines Corporation Efficient maintenance of a distributed system membership view
US20150026199A1 (en) * 2013-07-19 2015-01-22 International Business Machines Corporation Hardware projection of fixed and variable length columns of database tables
US9235564B2 (en) 2013-07-19 2016-01-12 International Business Machines Corporation Offloading projection of fixed and variable length database columns
US9880776B1 (en) 2013-02-22 2018-01-30 Veritas Technologies Llc Content-driven data protection method for multiple storage devices
US10628452B2 (en) * 2016-08-08 2020-04-21 International Business Machines Corporation Providing multidimensional attribute value information
US10713254B2 (en) 2016-08-08 2020-07-14 International Business Machines Corporation Attribute value information for a data extent

Families Citing this family (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6631366B1 (en) * 1998-10-20 2003-10-07 Sybase, Inc. Database system providing methodology for optimizing latching/copying costs in index scans on data-only locked tables
US6581205B1 (en) * 1998-12-17 2003-06-17 International Business Machines Corporation Intelligent compilation of materialized view maintenance for query processing systems
US6834290B1 (en) 1999-11-15 2004-12-21 Quest Software, Inc. System and method for developing a cost-effective reorganization plan for data reorganization
US7805423B1 (en) 1999-11-15 2010-09-28 Quest Software, Inc. System and method for quiescing select data modification operations against an object of a database during one or more structural operations
US7143092B1 (en) * 1999-12-14 2006-11-28 Samsung Electronics Co., Ltd. Data synchronization system and method of operation
US7065538B2 (en) 2000-02-11 2006-06-20 Quest Software, Inc. System and method for reconciling transactions between a replication system and a recovered database
US6535893B1 (en) * 2000-02-24 2003-03-18 International Business Machines Corporation Method for estimating the elapsed time required for a log apply process
US6826583B1 (en) 2000-05-15 2004-11-30 Sun Microsystems, Inc. Local allocation buffers for parallel garbage collection
US6823351B1 (en) * 2000-05-15 2004-11-23 Sun Microsystems, Inc. Work-stealing queues for parallel garbage collection
US6944607B1 (en) * 2000-10-04 2005-09-13 Hewlett-Packard Development Compnay, L.P. Aggregated clustering method and system
US6691121B1 (en) * 2000-10-27 2004-02-10 Bmc Software, Inc. Method and apparatus for online and dynamic extension of IMS data entry databases
US7231391B2 (en) * 2001-02-06 2007-06-12 Quest Software, Inc. Loosely coupled database clusters with client connection fail-over
US20040162836A1 (en) * 2001-02-06 2004-08-19 Eyal Aronoff System and method for altering database requests and database responses
US20020129146A1 (en) 2001-02-06 2002-09-12 Eyal Aronoff Highly available database clusters that move client connections between hosts
US6633884B2 (en) * 2001-04-03 2003-10-14 Bmc Software, Inc. System and method for analyzing a database for on-line reorganization
US7117229B2 (en) * 2001-05-31 2006-10-03 Computer Associates Think, Inc. Method and system for online reorganization of databases
US6728709B1 (en) * 2001-06-22 2004-04-27 Unisys Corporation Locking partitioned database tables
US7103887B2 (en) * 2001-06-27 2006-09-05 Sun Microsystems, Inc. Load-balancing queues employing LIFO/FIFO work stealing
US7299243B2 (en) * 2001-09-19 2007-11-20 Bmc Software, Inc. System and method for controlling free space distribution by key range within a database
US7228309B1 (en) * 2001-10-19 2007-06-05 Neon Enterprise Software, Inc. Facilitating maintenance of indexes during a reorganization of data in a database
US6879986B1 (en) 2001-10-19 2005-04-12 Neon Enterprise Software, Inc. Space management of an IMS database
US7444338B1 (en) 2001-10-19 2008-10-28 Neon Enterprise Software, Inc. Ensuring that a database and its description are synchronized
JP4130076B2 (en) * 2001-12-21 2008-08-06 富士通株式会社 Database management program and recording medium
JP4214712B2 (en) * 2002-05-16 2009-01-28 株式会社日立製作所 Database page allocation processing method
AU2003278612A1 (en) * 2002-06-24 2004-01-23 Xymphonic Systems As Method for data-centric collaboration
US7200625B2 (en) * 2002-10-18 2007-04-03 Taiwan Semiconductor Manufacturing Co., Ltd. System and method to enhance availability of a relational database
US7430557B1 (en) * 2003-03-19 2008-09-30 Unisys Corporation System and method for improving database reorganization time
US7412465B2 (en) * 2004-04-06 2008-08-12 International Business Machines Corporation Method for append mode insertion of rows into tables in database management systems
US8019779B2 (en) * 2004-05-04 2011-09-13 International Business Machines Corporation Efficient locking protocol for sub-document concurrency control using prefix encoded node identifiers in XML databases
US7519637B2 (en) * 2004-08-13 2009-04-14 Computer Associates Think, Inc. System and method for reorganizing a database
JP4082614B2 (en) * 2004-11-12 2008-04-30 インターナショナル・ビジネス・マシーンズ・コーポレーション Database management system, database management method, and program
US7565217B2 (en) * 2005-04-01 2009-07-21 International Business Machines Corporation Traversal of empty regions in a searchable data structure
JP4683546B2 (en) * 2005-07-15 2011-05-18 国立大学法人 東京大学 Database reorganization method and database reorganization system
US7487171B2 (en) * 2005-12-30 2009-02-03 International Business Machines Corporation System and method for managing a hierarchy of databases
US7890541B2 (en) * 2006-02-17 2011-02-15 International Business Machines Corporation Partition by growth table space
US7945543B2 (en) * 2006-03-30 2011-05-17 International Business Machines Corporation Method and system for deferred maintenance of database indexes
JP4148529B2 (en) * 2006-12-28 2008-09-10 インターナショナル・ビジネス・マシーンズ・コーポレーション System, method and program for checking index consistency in a database
US8027955B2 (en) * 2007-03-19 2011-09-27 Microsoft Corporation Database management using a file to accumulate changes
US8301623B2 (en) * 2007-05-22 2012-10-30 Amazon Technologies, Inc. Probabilistic recommendation system
US7792798B2 (en) * 2007-08-31 2010-09-07 International Business Machines Corporation Dynamic data compaction for data redistribution
US20090319581A1 (en) * 2008-06-19 2009-12-24 International Business Machines Corporation Online Table Move
US8060476B1 (en) 2008-07-14 2011-11-15 Quest Software, Inc. Backup systems and methods for a virtual computing environment
US8135930B1 (en) 2008-07-14 2012-03-13 Vizioncore, Inc. Replication systems and methods for a virtual computing environment
US8046550B2 (en) * 2008-07-14 2011-10-25 Quest Software, Inc. Systems and methods for performing backup operations of virtual machine files
US8429649B1 (en) 2008-09-25 2013-04-23 Quest Software, Inc. Systems and methods for data management in a virtual computing environment
US8996468B1 (en) 2009-04-17 2015-03-31 Dell Software Inc. Block status mapping system for reducing virtual machine backup storage
US9778946B2 (en) * 2009-08-07 2017-10-03 Dell Software Inc. Optimized copy of virtual machine storage files
US8812513B2 (en) 2009-12-25 2014-08-19 International Business Machines Corporation Hash pointer checking for hierarchical database logical relationship
US8453145B1 (en) 2010-05-06 2013-05-28 Quest Software, Inc. Systems and methods for instant provisioning of virtual machine files
US9569446B1 (en) 2010-06-08 2017-02-14 Dell Software Inc. Cataloging system for image-based backup
US9547562B1 (en) 2010-08-11 2017-01-17 Dell Software Inc. Boot restore system for rapidly restoring virtual machine backups
US8898114B1 (en) 2010-08-27 2014-11-25 Dell Software Inc. Multitier deduplication systems and methods
US9239871B2 (en) 2011-07-06 2016-01-19 Ca, Inc. System and method for analyzing sequential data access efficiency
US8301609B1 (en) * 2011-09-16 2012-10-30 International Business Machines Corporation Collision detection and data corruption protection during an on-line database reorganization
US9311375B1 (en) 2012-02-07 2016-04-12 Dell Software Inc. Systems and methods for compacting a virtual machine file
US9230008B2 (en) 2012-04-12 2016-01-05 Ca, Inc. System and method for automated online reorganization of sequential access databases
US9430272B2 (en) 2014-12-17 2016-08-30 Microsoft Technology Licensing, Llc Efficiently providing virtual machine reference points
US9547555B2 (en) 2015-01-12 2017-01-17 Microsoft Technology Licensing, Llc Change tracking using redundancy in logical time
US10558636B2 (en) * 2016-04-27 2020-02-11 Sap Se Index page with latch-free access
US10776428B2 (en) * 2017-02-16 2020-09-15 Nasdaq Technology Ab Systems and methods of retrospectively determining how submitted data transaction requests operate against a dynamic data structure

Citations (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4627019A (en) * 1982-07-08 1986-12-02 At&T Bell Laboratories Database management system for controlling concurrent access to a database
US4648036A (en) * 1985-03-06 1987-03-03 At&T Bell Laboratories Method for controlling query and update processing in a database system
US4823310A (en) * 1987-08-10 1989-04-18 Wang Laboratories, Inc. Device for enabling concurrent access of indexed sequential data files
US4847754A (en) * 1985-10-15 1989-07-11 International Business Machines Corporation Extended atomic operations
US5222235A (en) * 1990-02-01 1993-06-22 Bmc Software, Inc. Databases system for permitting concurrent indexing and reloading of data by early simulating the reload process to determine final locations of the data
US5247672A (en) * 1990-02-15 1993-09-21 International Business Machines Corporation Transaction processing system and method with reduced locking
US5367675A (en) * 1991-12-13 1994-11-22 International Business Machines Corporation Computer automated system and method for optimizing the processing of a query in a relational database system by merging subqueries with the query
US5434994A (en) * 1994-05-23 1995-07-18 International Business Machines Corporation System and method for maintaining replicated data coherency in a data processing system
US5455944A (en) * 1993-03-16 1995-10-03 International Business Machines Corporation Method for managing logging and locking of page free space information in a transaction processing system
US5504888A (en) * 1992-10-23 1996-04-02 Hitachi, Ltd. File updating system employing the temporary connection and disconnection of buffer storage to extended storage
US5596747A (en) * 1991-11-27 1997-01-21 Nec Corporation Method and apparatus for reorganizing an on-line database system in accordance with an access time increase
US5692178A (en) * 1992-08-20 1997-11-25 Borland International, Inc. System and methods for improved file management in a multi-user environment
US5717919A (en) * 1995-10-02 1998-02-10 Sybase, Inc. Database system with methods for appending data records by partitioning an object into multiple page chains
US5721915A (en) * 1994-12-30 1998-02-24 International Business Machines Corporation Interaction between application of a log and maintenance of a table that maps record identifiers during online reorganization of a database
US5778392A (en) * 1996-04-01 1998-07-07 Symantec Corporation Opportunistic tile-pulling, vacancy-filling method and apparatus for file-structure reorganization
US5815415A (en) * 1996-01-19 1998-09-29 Bentley Systems, Incorporated Computer system for portable persistent modeling
US5864849A (en) * 1996-12-16 1999-01-26 Lucent Technologies Inc. System and method for restoring a multiple checkpointed database in view of loss of volatile memory
US5940813A (en) * 1996-07-26 1999-08-17 Citibank, N.A. Process facility management matrix and system and method for performing batch, processing in an on-line environment
US5983225A (en) * 1998-01-26 1999-11-09 Telenor As Parameterized lock management system and method for conditional conflict serializability of transactions
US6023706A (en) * 1997-07-11 2000-02-08 International Business Machines Corporation Parallel file system and method for multiple node file access
US6067545A (en) * 1997-08-01 2000-05-23 Hewlett-Packard Company Resource rebalancing in networked computer systems
US6144970A (en) * 1998-09-24 2000-11-07 International Business Machines Corporation Technique for inplace reorganization of a LOB table space
US6185601B1 (en) * 1996-08-02 2001-02-06 Hewlett-Packard Company Dynamic load balancing of a network of client and server computers
US6519613B1 (en) * 1997-10-01 2003-02-11 International Business Machines Corporation Non-blocking drain method and apparatus for use in processing requests on a resource

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5307481A (en) 1990-02-28 1994-04-26 Hitachi, Ltd. Highly reliable online system
JPH0667950A (en) 1992-08-13 1994-03-11 Hitachi Ltd Database reorganization system
JPH0667944A (en) 1992-08-24 1994-03-11 Hitachi Ltd Database management system
JP3338724B2 (en) 1993-12-20 2002-10-28 日本電気株式会社 Database save method
JPH07244604A (en) 1994-03-04 1995-09-19 Nippon Telegr & Teleph Corp <Ntt> Method and device for data base on-line restoration
JPH07311702A (en) 1994-05-19 1995-11-28 Nec Corp Data base restoration device for deferred processing
JP3239924B2 (en) 1995-07-24 2001-12-17 日本電気株式会社 Relational database access control method

Patent Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4627019A (en) * 1982-07-08 1986-12-02 At&T Bell Laboratories Database management system for controlling concurrent access to a database
US4648036A (en) * 1985-03-06 1987-03-03 At&T Bell Laboratories Method for controlling query and update processing in a database system
US4847754A (en) * 1985-10-15 1989-07-11 International Business Machines Corporation Extended atomic operations
US4823310A (en) * 1987-08-10 1989-04-18 Wang Laboratories, Inc. Device for enabling concurrent access of indexed sequential data files
US5222235A (en) * 1990-02-01 1993-06-22 Bmc Software, Inc. Databases system for permitting concurrent indexing and reloading of data by early simulating the reload process to determine final locations of the data
US5247672A (en) * 1990-02-15 1993-09-21 International Business Machines Corporation Transaction processing system and method with reduced locking
US5596747A (en) * 1991-11-27 1997-01-21 Nec Corporation Method and apparatus for reorganizing an on-line database system in accordance with an access time increase
US5367675A (en) * 1991-12-13 1994-11-22 International Business Machines Corporation Computer automated system and method for optimizing the processing of a query in a relational database system by merging subqueries with the query
US5692178A (en) * 1992-08-20 1997-11-25 Borland International, Inc. System and methods for improved file management in a multi-user environment
US5504888A (en) * 1992-10-23 1996-04-02 Hitachi, Ltd. File updating system employing the temporary connection and disconnection of buffer storage to extended storage
US5455944A (en) * 1993-03-16 1995-10-03 International Business Machines Corporation Method for managing logging and locking of page free space information in a transaction processing system
US5434994A (en) * 1994-05-23 1995-07-18 International Business Machines Corporation System and method for maintaining replicated data coherency in a data processing system
US6026412A (en) * 1994-12-30 2000-02-15 International Business Machines Corporation Interaction between application of a log and maintenance of a table that maps record identifiers during online reorganization of a database
US5721915A (en) * 1994-12-30 1998-02-24 International Business Machines Corporation Interaction between application of a log and maintenance of a table that maps record identifiers during online reorganization of a database
US5717919A (en) * 1995-10-02 1998-02-10 Sybase, Inc. Database system with methods for appending data records by partitioning an object into multiple page chains
US5815415A (en) * 1996-01-19 1998-09-29 Bentley Systems, Incorporated Computer system for portable persistent modeling
US5778392A (en) * 1996-04-01 1998-07-07 Symantec Corporation Opportunistic tile-pulling, vacancy-filling method and apparatus for file-structure reorganization
US5940813A (en) * 1996-07-26 1999-08-17 Citibank, N.A. Process facility management matrix and system and method for performing batch, processing in an on-line environment
US6185601B1 (en) * 1996-08-02 2001-02-06 Hewlett-Packard Company Dynamic load balancing of a network of client and server computers
US5864849A (en) * 1996-12-16 1999-01-26 Lucent Technologies Inc. System and method for restoring a multiple checkpointed database in view of loss of volatile memory
US6023706A (en) * 1997-07-11 2000-02-08 International Business Machines Corporation Parallel file system and method for multiple node file access
US6067545A (en) * 1997-08-01 2000-05-23 Hewlett-Packard Company Resource rebalancing in networked computer systems
US6519613B1 (en) * 1997-10-01 2003-02-11 International Business Machines Corporation Non-blocking drain method and apparatus for use in processing requests on a resource
US5983225A (en) * 1998-01-26 1999-11-09 Telenor As Parameterized lock management system and method for conditional conflict serializability of transactions
US6144970A (en) * 1998-09-24 2000-11-07 International Business Machines Corporation Technique for inplace reorganization of a LOB table space

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8306978B2 (en) 2004-10-07 2012-11-06 International Business Machines Corporation Method of changing the page size of a DB2 table space while keeping the object available
US20080270492A1 (en) * 2004-10-07 2008-10-30 International Business Machines Corporation Method of changing the page size of a db2 table space while keeping the object available
US7447717B2 (en) 2004-10-07 2008-11-04 International Business Machines Corporation Method of changing the page size of a DB2 table space while keeping the object available
US20060080324A1 (en) * 2004-10-07 2006-04-13 International Business Machines Corporation Method of Changing the Page Size of a DB2 Table Space While Keeping the Object Available
US20070005631A1 (en) * 2005-06-30 2007-01-04 International Business Machines Corporation Apparatus and method for dynamically determining index split options from monitored database activity
US20070143313A1 (en) * 2005-12-19 2007-06-21 International Business Machines Corporation Peephole DBMS reorganization allowing concurrent data manipulation
US7603336B2 (en) * 2005-12-19 2009-10-13 International Business Machines Corporation Peephole DBMS reorganization allowing concurrent data manipulation
US20170353547A1 (en) * 2009-02-17 2017-12-07 International Business Machines Corporation Efficient maintenance of a distributed system membership view
US9442882B2 (en) * 2009-02-17 2016-09-13 International Business Machines Corporation Efficient maintenance of a distributed system membership view
US9769257B2 (en) 2009-02-17 2017-09-19 International Business Machines Corporation Efficient maintenance of a distributed system membership view
US20130262609A1 (en) * 2009-02-17 2013-10-03 International Business Machines Corporation Efficient maintenance of a distributed system membership view
US9609055B2 (en) 2009-02-17 2017-03-28 International Business Machines Corporation Efficient maintenance of a distributed system membership view
US10567496B2 (en) * 2009-02-17 2020-02-18 International Business Machines Corporation Efficient maintenance of a distributed system membership view
US8682872B2 (en) * 2009-12-17 2014-03-25 Bmc Software, Inc. Index page split avoidance with mass insert processing
US20110153580A1 (en) * 2009-12-17 2011-06-23 Bmc Software, Inc. Index Page Split Avoidance With Mass Insert Processing
US20110208754A1 (en) * 2010-02-22 2011-08-25 International Business Machines Corporation Organization of Data Within a Database
US9710507B2 (en) 2010-02-22 2017-07-18 International Business Machines Corporation Organization of data within a database
US9176995B2 (en) * 2010-02-22 2015-11-03 International Business Machines Corporation Organization of data within a database
US9251214B2 (en) * 2010-04-08 2016-02-02 Microsoft Technology Licensing, Llc In-memory database system
US10055449B2 (en) * 2010-04-08 2018-08-21 Microsoft Technology Licensing, Llc In-memory database system
US20160147827A1 (en) * 2010-04-08 2016-05-26 Microsoft Technology Licensing, Llc In-memory database system
US10296615B2 (en) * 2010-04-08 2019-05-21 Microsoft Technology Licensing, Llc In-memory database system
US20110252000A1 (en) * 2010-04-08 2011-10-13 Microsoft Corporation In-memory database system
US11048691B2 (en) * 2010-04-08 2021-06-29 Microsoft Technology Licensing, Llc In-memory database system
US9830350B2 (en) * 2010-04-08 2017-11-28 Microsoft Technology Licensing, Llc In-memory database system
US8364640B1 (en) * 2010-04-09 2013-01-29 Symantec Corporation System and method for restore of backup data
US9880776B1 (en) 2013-02-22 2018-01-30 Veritas Technologies Llc Content-driven data protection method for multiple storage devices
US9235564B2 (en) 2013-07-19 2016-01-12 International Business Machines Corporation Offloading projection of fixed and variable length database columns
US9535947B2 (en) 2013-07-19 2017-01-03 International Business Machines Corporation Offloading projection of fixed and variable length database columns
US9317497B2 (en) 2013-07-19 2016-04-19 International Business Machines Corporation Offloading projection of fixed and variable length database columns
US10089352B2 (en) 2013-07-19 2018-10-02 International Business Machines Corporation Offloading projection of fixed and variable length database columns
US9275168B2 (en) 2013-07-19 2016-03-01 International Business Machines Corporation Hardware projection of fixed and variable length columns of database tables
US9268879B2 (en) * 2013-07-19 2016-02-23 International Business Machines Corporation Hardware projection of fixed and variable length columns of database tables
US20150026199A1 (en) * 2013-07-19 2015-01-22 International Business Machines Corporation Hardware projection of fixed and variable length columns of database tables
US10628452B2 (en) * 2016-08-08 2020-04-21 International Business Machines Corporation Providing multidimensional attribute value information
US10713254B2 (en) 2016-08-08 2020-07-14 International Business Machines Corporation Attribute value information for a data extent

Also Published As

Publication number Publication date
US6411964B1 (en) 2002-06-25

Similar Documents

Publication Publication Date Title
US6411964B1 (en) Methods for in-place online reorganization of a database
Graefe Sorting And Indexing With Partitioned B-Trees.
US5430869A (en) System and method for restructuring a B-Tree
Salzberg et al. Comparison of access methods for time-evolving data
Kornacker et al. High-concurrency locking in R-trees
US6792432B1 (en) Database system with methods providing high-concurrency access in B-Tree structures
JP2583010B2 (en) Method of maintaining consistency between local index table and global index table in multi-tier index structure
US7418544B2 (en) Method and system for log structured relational database objects
Jagadish et al. Incremental organization for data recording and warehousing
US6061678A (en) Approach for managing access to large objects in database systems using large object indexes
Quass et al. On-line warehouse view maintenance
EP0303231B1 (en) Method and device for enabling concurrent access of indexed sequential data files
US8010497B2 (en) Database management system with efficient version control
US5758356A (en) High concurrency and recoverable B-tree index management method and system
US5123104A (en) Method and apparatus for concurrent modification of an index tree in a transaction processing system utilizing selective indication of structural modification operations
US5999943A (en) Lob locators
US6631366B1 (en) Database system providing methodology for optimizing latching/copying costs in index scans on data-only locked tables
Lomet et al. Transaction time support inside a database engine
Muth et al. The LHAM log-structured history data access method
Graefe A survey of B-tree logging and recovery techniques
Sockut et al. Online reorganization of databases
Sockut et al. A method for on-line reorganization of a database
US6535895B2 (en) Technique to avoid processing well clustered LOB&#39;s during reorganization of a LOB table space
EP0336548A2 (en) Supporting long fields of recoverable database records in a computer system
Srinivasan et al. On-line index construction algorithms

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION