US20120011144A1

US20120011144A1 - Aggregation in parallel computation environments with shared memory

Info

Publication number: US20120011144A1
Application number: US12/978,194
Authority: US
Inventors: Frederik Transier; Christian Mathis; Nico Bohnsack; Kai Stammerjohann
Original assignee: Individual
Current assignee: SAP SE
Priority date: 2010-07-12
Filing date: 2010-12-23
Publication date: 2012-01-12
Also published as: US9223829B2; US20120011133A1; US20130138628A1; US9177025B2; US8370316B2; US20120011108A1

Abstract

According to some embodiments, a data structure may be provided by separating an input table into a plurality of partitions; generating, by each of a first plurality of execution threads operating concurrently, a local hash table for each of the threads, each local hash table storing key—index pairs; and merging the local hash tables, by a second plurality of execution threads operating concurrently, to produce a set of disjoint result hash tables. An overall result may be obtained from the result set of disjoint result hash tables. The data structure may used in a parallel computing environment to determine an aggregation.

Description

FIELD

Some embodiments relate to a data structure. More specifically, some embodiments provide a method and system for a data structure and use of same in parallel computing environments.

BACKGROUND

A number of presently developed and developing computer systems include multiple processors in an attempt to provide increased computing performance. Advances in computing performance, including for example processing speed and throughput, may be provided by parallel computing systems and devices as compared to single processing systems that sequentially process programs and instructions.
For parallel shared-memory aggregation processes, a number of approaches have been proposed. However, the previous approaches each include sequential operations and/or synchronization operations such as, locking, to avoid inconsistencies or lapses in data coherency. Thus, prior proposed solutions for parallel aggregation in parallel computation environments with shared memory either contain a sequential step or require some sort of synchronization on the data structures.
Accordingly, a method and mechanism for efficiently processing data in parallel computation environments and the use of same in parallel aggregation processes are provided by some embodiments herein.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is block diagram of a system according to some embodiments.

FIG. 2 is a block diagram of an operating environment according to some embodiments.

FIGS. 3A-3D are illustrative depictions of various aspects of a data structure according to some embodiments.

FIG. 4 is a flow diagram of a method relating to a data structure, according to some embodiments herein.

FIGS. 5A-5D provide illustrative examples of some data tables according to some embodiments.

FIG. 6 is an illustrative depiction of an aggregation flow, in some embodiments herein.

FIG. 7 is a flow diagram of a method relating to an aggregation flow, according to some embodiments herein.

DETAILED DESCRIPTION

In an effort to more fully and efficiently use the resources of a particular computing environment, a data structure and techniques of using that data structure may be developed to fully exploit the design characteristics and capabilities of that particular computing environment. In some embodiments herein, a data structure and techniques for using that data structure (i.e., algorithms) are provided for efficiently using the data structure disclosed herein in a parallel computing environment with shared memory.
As used herein, the term parallel computation environment with shared memory refers to a system or device having more than one processing unit. The multiple processing units may be processors, processor cores, multi-core processors, etc. All of the processing units can access a main memory (i.e., a shared memory architecture). All of the processing units can run or execute the same program(s). As used herein, a running program may be referred to as a thread. Memory may be organized in a hierarchy of multiple levels, where faster but smaller memory units are located closer to the processing units. The smaller and faster memory units located nearer the processing units as compared to the main memory are referred to as cache.
FIG. 1 is a block diagram overview of a device, system, or apparatus 100 that may be used in a providing an index hash table or hash map in accordance with some aspects and embodiments herein, as well as providing a parallel aggregation based on such data structures. System 100 may be, for example, associated with any of the devices described herein and may include a plurality of processing units 105, 110, and 115. The processing units may comprise one or more commercially available Central Processing Units (CPUs) in form of one-chip microprocessors or a multi-core processor, coupled to a communication device 120 configured to communicate via a communication network (not shown in FIG. 1) to a end client (not shown in FIG. 1). Device 100 may also include a local cache memory associated with each of the processing units 105, 110, and 115 such as RAM memory modules. Communication device 515 may be used to communicate, for example, with one or more client devices or business service providers. System 100 further includes an input device 125 (e.g., a mouse and/or keyboard to enter content) and an output device 130 (e.g., a computer monitor to display a user interface element).
Processing units 105, 110, and 115 communicates with a shared memory 135 via a system bus 175. System bus also provides a mechanism for the processing units to communicate with a storage device 140. Storage device 140 may include any appropriate information storage device, including combinations of magnetic storage devices (e.g., a hard disk drive), optical storage devices, and/or semiconductor memory devices for storing data and programs.
Storage device 140 stores a program 145 for controlling the processing units 105, 110, and 115 and query engine application 150 for executing queries. Processing units 105, 110, and 115 may perform instructions of the program 145 and thereby operate in accordance with any of the embodiments described herein. For example, the processing units may concurrently execute a plurality of execution threads to build the index hash table data structures disclosed herein. Furthermore, query engine 150 may operate to execute a parallel aggregation operation in accordance with aspects herein in cooperation with the processing units and by accessing database 155. Program 145 and other instructions may be stored in a compressed, uncompiled and/or encrypted format. Program 645 may also include other program elements, such as an operating system, a database management system, and/or device drivers used by the processing units 105, 110, and 115 to interface with peripheral devices.
In some embodiments, storage device 140 includes a database 155 to facilitate the execution of queries based on input table data. The database may include data structures (e.g., index hash tables), rules, and conditions for executing a query in a parallel computation environment such as that of FIGS. 1 and 2.
In some embodiments, the data structure disclosed herein as being developed for use in parallel computing environments with shared memory is referred to as a parallel hash table. In some instances, the parallel hash table may also be referred to as a parallel hash map. In general, a hash table may be provided and used as index structures for data storage to enable fast data retrieval. The parallel hash table disclosed herein may be used in a parallel computation environment where multiple concurrently executing (i.e., running) threads insert and retrieve data in tables. Furthermore, an aggregation algorithm that uses the parallel hash tables herein is provided for computing an aggregate in a parallel computation environment.
FIG. 2 provides an illustrative example of a computation environment 100 compatible with some embodiments herein. While computation environment 100 may be compatible with some embodiments of the data structures and the methods herein, the data structures and the methods herein are not limited to the example computation environment 100. Processes to store, retrieve, and perform operations on data may be facilitated by a database system (DBS) and a database warehouse (DWH).
As shown in FIG. 2, DBS 210 is a server. DBS 210 further includes a database management system (DBMS) 215. DBMS 215 may comprise software (e.g., programs, instructions, code, applications, services, etc.) that controls the organization of and access to database 225 that stores data. Database 225 may include an internal memory, an external memory, or other configurations of memory. Database 225 may be capable of storing large amounts of data, including relational data. The relational data may be stored in tables. In some embodiments, a plurality of clients, such as example client 205, may communicate with DBS 210 via a communication link (e.g., a network) and specified application programming interfaces (APIs). In some embodiments, the API language provided by DBS 210 is SQL, the Structured Query Language. Client 205 may communicate with DBS 115 using SQL to, for example, create and delete tables; insert, update, and delete data; and query data.
In general, a user may submit a query from client 205 in the form of a SQL query statement to DBS 210. DBMS 215 may execute the query by evaluating the parameters of the query statement and accessing database 225 as needed to produce a result 230. The result 230 may be provided to client 205 for storage and/or presentation to the user.
One type of query is an aggregation query. As will be explained in greater detail below, a parallel aggregation algorithm, process, or operation may be used to compute SQL aggregates. In general with reference to FIG. 2, some embodiments herein may include client 205 wanting to group or aggregate data of a table stored in database 225 (e.g., a user at client 205 may desire to know the average salaries of the employees in all of a company's departments). Client 205 may connect to DBS 210 and issue a SQL query statement that describes and specifies the desired aggregation. DBMS 215 may create a executable instance of the parallel aggregation algorithm herein, provide it with the information needed to run the parallel aggregation algorithm (e.g., the name of a table to access, the columns to group by, the columns to aggregate, the aggregation function, etc.), and run the parallel aggregation operation or algorithm. In the process of running, the parallel aggregation algorithm herein may create an index hash map 220. The index hash map may be used to keep track of intermediate result data. An overall result comprising a result table may be computed based on the index hash map(s) containing the intermediate results. The overall parallel aggregation result may be transmitted to client 205.
As an extension of FIG. 2, DWHs may be built on top of DBSs. Thus, a use-case of a DWH may be similar in some respects to DBS 210 of FIG. 2.
The computation environment of FIG. 2 may include a plurality of processors that can operate concurrently, in parallel and include a device or system similar to that described in FIG. 1. Additionally, the computation environment of FIG. 2 may have a memory that is shared amongst the plurality of processors, for example, like the system of FIG. 1. In order to fully capitalize on the parallel processing power of such a computation environment, the data structures used by the system may be designed, developed or adapted for being efficiently used in the parallel computing environment.
A hash table is a fundamental data structure in computer science that is used for mapping “keys” (e.g., the names of people) to the associated values of the keys (e.g., the phone number of the people) for fast data look-up. A conventional hash table stores key—value pairs. Conventional hash tables are designed for sequential processing.
However, for parallel computation environments there exists a need for data structures particularly suitable for use in the parallel computing environment. In some embodiments herein, the data structure of an index hash map is provided. In some aspects, the index hash map provides a lock-free cache-efficient hash data structure developed to parallel computation environments with shared memory. In some embodiments, the index hash map may be adapted to column stores.
In a departure from conventional hash tables that store key—value pairs, the index hash map herein does not store key—value pairs. The index hash map herein generates key—index pairs by mapping each distinct key to a unique integer. In some embodiments, each time a new distinct key is inserted in the index hash map, the index hash map increments an internal counter and assigns the value of the counter to the key to produce a key—index pair. The counter may provide, at any time, the cardinality of an input set of keys that have thus far been inserted in the hash map. In some respects, the key—index mapping may be used to share a single hash map among different columns (or value arrays). For example, for processing a plurality of values distributed among different columns, the associated index for the key has to be calculated just once. The use of key—index pairs may facilitate bulk insertion in columnar storages. Inserting a set of key—index pairs may entail inserting the keys in a hash map to obtain a mapping vector containing indexes. This mapping vector may be used to build a value array per value column.
Referring to FIGS. 3A-3D, input data is illustrated in FIG. 3A including a key array 305. For each distinct key 315 from keys array 305, the index hash map returns an index 320 (i.e., a unique integer), as seen in FIG. 3B. When all of the keys, from a column for example, have been inserted in the hash map, the mapping vector of FIG. 3C results. The entries in the mapping of FIG. 3C are the indexes that point to a value array “A” 330 illustrated in FIG. 3D. The mapping of FIG. 3C may be used to aggregate the “Kf” columns 310 shown in FIG. 3A. The result of the aggregation of column 310 is depicted in FIG. 3D at 335.
To achieve a maximum parallel processor utilization, the index hash maps herein may be designed to avoid locking when being operated on by concurrently executing threads by producing wide data independence. In some embodiments, index hash maps herein may be described by a framework defining a two step process. In a first step, input data is split or separated into equal-sized blocks and the blocks are assigned to worker execution threads. These worker execution threads may produce intermediate results by building relatively small local hash tables or hash maps. The local hash maps are private to the respective thread that produces it. Accordingly, other threads may not see or access the local hash map produced by a given thread.
In a second step, the local hash maps including the intermediate results may be merged to obtain a global result by concurrently executing merger threads. When accessing and processing the local hash maps, each of the merger threads may only consider a dedicated range of hash values. The merger threads may process hash-disjoint partitions of the local hash maps and produce disjoint result hash tables that may be concatenated to build an overall result.
FIG. 4 is a flow diagram related to a data structure framework 400, in accordance with some embodiments herein. At S405, an input data table is separated or divided into a plurality of partitions. The size of the partitions may relate to or even be the size of a memory unit such as, for example, a cache associated with parallel processing units. In some embodiments, the partitions are equal in size. Furthermore, a first plurality of execution threads running in parallel may each generate a local hash table or hash map. Each of the local hash maps is private to the one of the plurality of threads that generated the local hash map.
The second step of the data structure framework herein is depicted in FIG. 4 at S410. At S410, the local hash maps are merged. The merging of the local hash maps produces a set of disjoint result hash tables or hash maps.
In some embodiments, when accessing and processing the local hash maps, each of the merger threads may only consider a dedicated range of hash values. From a logical perspective, the local hash maps may be considered as being partitioned by their hash value. One implementation may use, for example, some first bits of the hash value to form a range of hash values. The same ranges are used for all local hash maps, thus the “partitions” of the local hash maps are disjunctive. As an example, if a value “a” is in range 5 of a local hash map, then the value will be in the same range of other local hash maps. In this manner, all identical values of all local hash maps may be merged into a single result hash map. Since the “partitions” are disjunctive, the merged result hash maps may be created without a need for locks. Additionally, further processing on the merged result hash maps may be performed without locks since any execution threads will be operating on disjunctive data.
In some embodiments, the local (index) hash maps providing the intermediate results may be of a fixed size. Instead of resizing a local hash map, the corresponding worker execution thread may replace its local hash map with a new hash map when a certain load factor is reached and place the current local hash map into a buffer containing hash maps that are ready to be merged. In some embodiments, the size of the local hash maps may be sized such that the local hash maps fit in a cache (e.g., L2 or L3). The specific size of the cache may depend on the sizes of caches in a given CPU architecture.
In some aspects, insertions and lookups of keys may largely take place in cache. In some embodiments, over-crowded areas within a local hash map may be avoided by maintaining statistical data regarding the local hash maps. The statistical data may indicate when the local hash map should be declared full (independent of an actual load factor). In some aspects and embodiments, the size of a buffer of a computing system and environment holding local hash maps ready to be merged is a tuning parameter, wherein a smaller buffer may induce more merge operations while a larger buffer will necessarily require more memory.
In some embodiments, a global result may be organized into bucketed index hash maps where each result hash map includes multiple fixed-size physical memory blocks. In this configuration, cache-efficient merging may be realized, as well as memory allocation being more efficient and sustainable since allocated blocks may be shared between queries. In some aspects, when a certain load factor within a global result hash map is reached during a merge operation, the hash map may be resized. Resizing a hash map may be accomplished by increasing its number of memory blocks. Resizing of a bucketed index hash map may entail needing to know the entries to be repositioned. In some embodiments, the maps' hash function may be chosen such that its codomain increases by adding further least significant bits of need during a resize operation. In an effort to avoid too many resize operations, an estimate of a final target size may be determined before an actual resizing of the hash map.
In some embodiments, the index hash map framework discussed above may provide an infrastructure to implement parallelized query processing algorithms or operations. One embodiment of a parallelized query processing algorithm includes a hash-based aggregation, as will be discussed in greater detail below.
In some embodiments, a parallelized aggregation refers to a relational aggregation that groups and condenses relational data stored in tables. An example of a table that may form an input of a parallel aggregation operation herein is depicted in FIG. 5A. Table 500 includes sales data. The sales data is organized in three columns—a Product column 505, a Country column 510, and a Revenue column 15. Table 500 may be grouped and aggregated by, for example, four combinations of columns—by Product and Country, by Product, and by Country. In the following discussion the columns by which an aggregation groups the data is referred to as group columns.
Aggregation result tables determined by the four different groupings are illustrated in FIGS. 5B-5D. Each of the result tables 520, 540, 555, and 570 contain the distinct values (groups) of the desired group columns and, per group, the aggregated value. For example, table 520 includes the results based on grouping by Product and Country. Columns 525 and 530 include the distinct Product and Country values (i.e., groups) of the desired Product and Country columns (FIG. 5A, columns 505 and 510) and the aggregated value for each distinct Product and Country group is included in column 535. Furthermore, table 540 includes the results based on grouping by Product. Column 545 includes the distinct Product values (i.e., groups) of the desired Product column (FIG. 5A, column 505) and the aggregated value for each distinct Product group. Table 555 includes the results based on grouping by Country where columns 560 and 565 include the distinct Country values (i.e., groups) of the desired Country column (FIG. 5A, column 510) and the aggregated value for each distinct Country group.
In some embodiments, such as the examples of FIGS. 5A-5D, a summation function SUM is used to aggregate values. However, other aggregation functions such as, for example and not as a limitation, a COUNT, a MIN, a MAX, and an AVG aggregation function may be used. The column containing the aggregates may be referred to herein as the aggregate column. Thus, the aggregate columns in FIGS. 5A-5E are columns 535, 550, 560, and 575, respectively.
In an effort to fully utilize the resources of parallel computing environments with shared memory, an aggregation operation should be computed and determined in parallel. In an instance the aggregation is not computed in parallel, the processing performance for the aggregation would be bound by the speed of a single processing unit instead of being realized by the multiple processing units available in the parallel computing environment.
FIG. 6 is an illustrative depiction of a parallel aggregation flow, according to some embodiments herein. In some aspects, the parallel aggregation flow 600 uses the index hash table framework discussed hereinabove. In the example of FIG. 6, two degrees of parallelism are depicted and are achieved by the concurrent execution of two execution threads. However, the concepts conveyed by FIG. 6 may be extended to additional degrees of parallelism, including computation environments now known and those that become known in the future. In FIG. 6, input table 605 is separated into a plurality of partitions. Input table 605 is shown divided into partitions 610 and 615. All of or a portion of table 605 may be split into partitions for parallel aggregation. Portions of table 605 not initially partitioned and processed by a parallel aggregation operation may subsequently be partitioned for parallel aggregation processing. Table 605 may be partitioned into equal-sized partitions. Partitions 610 and 615 are but two example partitions, and additional partitions may exist and be processed in the parallel aggregation operations herein.
In some embodiments, a first plurality of execution threads, aggregator threads, are initially running and a second plurality of execution threads are not initially running or are in a sleep state. The concurrently operating aggregator threads operate to fetch an exclusive part of table 605. Partition 610 is fetched by aggregator thread 620 and partition 615 is fetched by aggregator thread 625.
Each of the aggregator threads may read their partition and aggregate the values of each partition into a private local hash table or hash map. Aggregator thread 620 produces private hash map 630 and aggregator thread 625 produces local hash map 635. Since each aggregator thread processes its own separate portion of input table 605, and has its private hash map, the parallel processing of the partitions may be accomplished lock-free.
In some embodiments, the local hash tables may be the same size as the cache associated with the processing unit executing an aggregator thread. Sizing the local hash tables in this manner may function to avoid cache misses. In some aspects, input data may be read from table 605 to aggregate and written to the local hash tables row-wise or column-wise.
When a partition is consumed by an aggregator thread, the aggregator thread may fetch another, unprocessed partition of input table 605. In some embodiments, the aggregator threads move their associated local hash maps into a buffer 640 when the local hash table reaches a threshold size, initiate a new local hash table, and proceed.
In some embodiments, when the number of hash tables in buffer 640 reaches a threshold size, the aggregator threads may wake up a second plurality of execution threads, referred to in the present example as merger threads, and the aggregator threads may enter a sleep state. In some embodiments, the local hash maps may be retained in buffer 640 until the entire input table 605 is consumed by the aggregations threads 620 and 625. When the entire input table 605 is consumed by the aggregator threads 620 and 625, the second plurality of execution threads, the merger threads, are awaken and the aggregator threads enter a sleep state.
Each of the merger threads is responsible for a certain partition of all of the private hash maps in buffer 640. The particular data partition each merger thread is responsible for may be determined by assigning distinct, designated key values of the local hash maps to each of the merger threads. That is, the partition of the data of the portion the data for which each merger thread is responsible may be determined by “key splitting” in the local hash maps. As illustrated in FIG. 6, merger thread 1 is responsible for designated keys 665 and merger thread 2 is responsible for keys 670. Each of the merger threads 1 and 2 operate to iterate over all of the private hash maps in buffer 640, read their respective data partition as determined by the key splitting, and merge their respective data partition into a thread-local part hash table (or part hash map).
As further illustrated in FIG. 6, merger thread 1 (662) and merger thread 2 (664) each consider all of the private hash maps in buffer 640 based on the key based partitions they are each responsible for and produce, respectively, part hash map 1 (675) and part hash map 2 (680).
In some embodiments, in the instance a merger thread has processed its data partition and there are additional data partitions in need of being processed, the executing merger threads may acquire responsibility for a new data partition and proceed to process the new data partition as discussed above. In the instance all data partitions are processed, the merger threads may enter a sleep state and the aggregator threads may return to an active, running state. Upon returning to the active, running state, the processes discussed above may repeat.
In the instance there is no more data to be processed by the aggregator threads and the merger threads, the parallel aggregation operation herein may terminate. The results of the aggregation process will be contained in the set of part hash maps (e.g., 675 and 680). In some respects, the part hash maps may be seen as forming a parallel result since the part hash maps are disjoint.
In some embodiments, the part hash maps may be processed in parallel. As an example, a having clause may be evaluated and applied to every group or parallel sorting and merging may be performed thereon.
An overall result may be obtained from the disjoint part hash maps by concatenating them together, as depicted in FIG. 6 at 685.
FIG. 7 is an illustrative example of a flow diagram 700 relating to some parallel aggregation embodiments herein. At S705, exclusive partitions of an input data table are received or retrieved for aggregating in parallel. At S710 the values of each of the exclusive partitions are aggregated. In some embodiments, the values of each partition are aggregated into a local hash map by one of a plurality of concurrently running execution threads.
At S715 a determination is made whether the aggregating of the partitions of the input table partitions is complete or whether the buffer is full. In the instance additional partitions remain to be aggregated and buffer 640 is not full, whether at the end of aggregating a current partition and/or for other considerations, process 700 returns to further aggregate partitions of the input data and store the aggregated values in key—index pairs in local hash tables. In the instance aggregating of the partitions is complete or the buffer is full, process 700 proceeds to assign designated parts of the local hash tables or hash maps to a second plurality of execution threads at S720. The second plurality of execution threads work to merge the designated parts of the local hash maps into thread-local part hash maps at S725 and to produce result tables.
At S730, a determination is made whether the aggregating is complete. In the instance the aggregating is not complete, process 700 returns to further aggregate partitions of the input data. In the instance aggregating is complete, process 700 proceeds S735.
At S735, process 700 operates to generate a global result by assembling the results obtained at S725 into a composite result table. In some embodiments, the overall result may be produced by concatenating the part hash maps of S725 to each other.
Each system described herein may be implemented by any number of devices in communication via any number of other public and/or private networks. Two or more of the devices herein may be co-located, may be a single device, or may be located remote from one another and may communicate with one another via any known manner of network(s) and/or a dedicated connection. Moreover, each device may comprise any number of hardware and/or software elements suitable to provide the functions described herein as well as any other functions. Other topologies may be used in conjunction with other embodiments.
All systems and processes discussed herein may be embodied in program code stored on one or more computer-readable media. Such media may include, for example, a floppy disk, a CD-ROM, a DVD-ROM, magnetic tape, and solid state Random Access Memory (RAM) or Read Only Memory (ROM) storage units. According to some embodiments, a memory storage unit may be associated with access patterns and may be independent from the device (e.g., magnetic, optoelectronic, semiconductor/solid-state, etc.) Moreover, in-memory technologies may be used such that databases, etc. may be completely operated in RAM memory at a processor. Embodiments are therefore not limited to any specific combination of hardware and software.
Embodiments have been described herein solely for the purpose of illustration. Persons skilled in the art will recognize from this description that embodiments are not limited to those described, but may be practiced with modifications and alterations limited only by the spirit and scope of the appended claims.

Claims

1. A computer implemented method, comprising:

separating an input table into a plurality of partitions;

generating, by each of a first plurality of execution threads operating concurrently, a local hash table for each of the partitions, each local hash table storing key—index pairs; and

merging the local hash tables, by a second plurality of execution threads operating concurrently, to produce a set of disjoint result hash tables.

2. The method of claim 1, wherein each distinct key is mapped to a unique integer.

3. The method of claim 1, wherein the local hash table generated by each of the first plurality of execution threads is private to the execution thread that generated it and is independent of other local hash tables.

4. The method of claim 1, wherein each of the second plurality of execution threads processes a dedicated range of hash values of all of the local hash tables.

5. The method of claim 1, wherein the local hash tables are of a fixed size.

6. The method of claim 1, further comprising concatenating the set of disjoint result hash tables to obtain an overall result.

7. A computer implemented method, comprising:

retrieving, by concurrently executing a first plurality of execution threads, disjoint partitions of an input table;

aggregating, by each of the first plurality of execution threads, values of each partition into a respective local hash table, each local hash table storing key—index pairs; and

merging the local hash tables, by a second plurality of execution threads operating concurrently, to produce a set of disjoint result hash tables, each of the second plurality of execution threads responsible for a dedicated range of hash values of all of the local hash tables.

8. The method of claim 7, wherein each distinct key is mapped to a unique integer.

9. The method of claim 7, wherein the local hash table generated by each of the first plurality of execution threads is private to the execution thread that generated it and is independent of other local hash tables.

10. The method of claim 7, wherein the plurality of first execution threads further:

retrieves another partition of the input table when a previously retrieved partition is consumed by the plurality of first execution threads; and

moves the local hash tables to a buffer when a hash table becomes a threshold size, initializes a new local hash table, and proceeds to retrieve another partition of the input table.

11. The method of claim 7, further comprising concatenating the set of disjoint result hash tables to obtain an overall result.

12. The method of claim 1, wherein the second plurality of execution threads are further responsible for another dedicated range of hash values of all of the local hash tables in an instance the second plurality of execution threads have processed all of the local hash tables and other ranges of hash values remain unprocessed.

13. A system, comprising:

a plurality of processing units;

a shared memory accessible by all of the plurality of processing units;

a database to store an input table; and

a query engine to execute an query comprising:

separating the input table into a plurality of partitions;

generating, by each of a first plurality of execution threads executing concurrently by the plurality of processing units, a local hash table for each of the partitions, each local hash table storing key—index pairs; and

merging the local hash tables, by a second plurality of execution threads executing concurrently by the plurality of processing units, to produce a set of disjoint result hash tables.

14. The system of claim 13, wherein each distinct key is mapped to a unique integer.

15. The system of claim 13, wherein the local hash table generated by each of the first plurality of execution threads is private to the execution thread that generated it and is independent of other local hash tables.

16. The system of claim 13, wherein each of the second plurality of execution threads processes a dedicated range of hash values of all of the local hash tables.

17. The system of claim 13, wherein the local hash tables are of a fixed size.

18. The system of claim 13, further comprising concatenating the set of disjoint result hash tables to obtain an overall result.

19. A system, comprising:

a plurality of processing units;

a shared memory accessible by all of the plurality of processing units;

a database to store an input table; and

a query engine to execute an aggregation query comprising:

retrieving, by concurrently executing a first plurality of execution threads by the plurality of processing units, disjoint partitions of an input table;

merging the local hash tables, by a second plurality of execution threads executing concurrently by the plurality of processing units, to produce a set of disjoint result hash tables, each of the second plurality of execution threads responsible for a dedicated range of hash values of all of the local hash tables.

20. The system of claim 19, wherein each distinct key is mapped to a unique integer.

21. The system of claim 19, wherein the local hash table generated by each of the first plurality of execution threads is private to the execution thread that generated it and is independent of other local hash tables.

22. The system of claim 19, wherein the plurality of first execution threads further:

retrieve another partition of the input table when a previously retrieved partition is consumed by the plurality of first execution threads;

move the local hash tables to a buffer when a hash table becomes a threshold size, initializes a new local hash table, and proceeds to retrieve another partition of the input table.

23. The system of claim 19, further comprising concatenating the set of disjoint result hash tables to obtain an overall result.