US20090292726A1 - System and Method for Identifying Hierarchical Heavy Hitters in Multi-Dimensional Data - Google Patents

System and Method for Identifying Hierarchical Heavy Hitters in Multi-Dimensional Data Download PDF

Info

Publication number
US20090292726A1
US20090292726A1 US12/512,723 US51272309A US2009292726A1 US 20090292726 A1 US20090292726 A1 US 20090292726A1 US 51272309 A US51272309 A US 51272309A US 2009292726 A1 US2009292726 A1 US 2009292726A1
Authority
US
United States
Prior art keywords
node
nodes
frequency count
hhh
data structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/512,723
Inventor
Graham Cormode
Philip Russell Korn
Shanmugavelayutham Muthukrishnan
Divesh Srivastava
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T AND REGENTS RUTGERS UNIVERSITY
Original Assignee
AT&T AND REGENTS RUTGERS UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AT&T AND REGENTS RUTGERS UNIVERSITY filed Critical AT&T AND REGENTS RUTGERS UNIVERSITY
Priority to US12/512,723 priority Critical patent/US20090292726A1/en
Assigned to AT&T AND THE REGENTS RUTGERS UNIVERSITY reassignment AT&T AND THE REGENTS RUTGERS UNIVERSITY ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: MUTHUKRISHNAN, SHANMUGAVELAYUTHAM, CORMODE, GRAHAM, KORN, PHILLIP RUSSELL, SRIVASTAVA, DIVESH
Publication of US20090292726A1 publication Critical patent/US20090292726A1/en
Assigned to AT&T CORP., RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY reassignment AT&T CORP. CORRECTIVE ASSIGNMENT TO CORRECT THE CORRECT THE ASSIGNEE'S NAME/RECEIVING PARTIES TO: AT&T CORP AND RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY PREVIOUSLY RECORDED ON REEL 023047 FRAME 0622. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT. Assignors: MUTHUKRISHNAN, SHANMUGAVELAYUTHAM, KORN, PHILIP RUSSELL, SRIVASTAVA, DIVESH, CORMODE, GRAHAM
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99944Object-oriented database structure
    • Y10S707/99945Object-oriented database structure processing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10TECHNICAL SUBJECTS COVERED BY FORMER USPC
    • Y10STECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y10S707/00Data processing: database and file management or data structures
    • Y10S707/99941Database schema or data structure
    • Y10S707/99948Application of database or data structure, e.g. distributed, multimedia, or image

Definitions

  • Aggregation along hierarchies is a critical data summarization technique in a large variety of online applications, including decision support (e.g, online analytical processing (OLAP)), network management (e.g., internet protocol (IP) clustering, denial-of-service (DoS) attack monitoring), text (e.g., on prefixes of strings occurring in the text), and extensible markup language (XML) summarization (i.e., on prefixes of root-to-leaf paths in an XML data tree).
  • decision support e.g, online analytical processing (OLAP)
  • IP internet protocol
  • DoS denial-of-service
  • XML extensible markup language
  • data is inherently hierarchical and it is desirable to monitor and maintain aggregates of the data at different levels of the hierarchy over time in a dynamic fashion.
  • a heavy hitter is an element of a data set having a frequency which is greater than or equal to a user-defined threshold.
  • a conventional algorithm for identifying the HHs in the data set maintains a summary structure which allows the frequencies of the elements to be estimated within a pre-defined error bound.
  • the conventional HH algorithm did not account for any hierarchy in the data set. It is also possible to store information for each node in a hierarchy and calculate HHs based on this information. However, the storing of data for all nodes and the amount of calculation is prohibitive. In addition, this method provides superfluous results.
  • a method including receiving a plurality of elements of a data stream, storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein, comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes and identifying each node for which the frequency count is at least as great as the threshold value as a hierarchical heavy hitter (HHH) node and propagating the frequency count of each non-HHH nodes to its corresponding parent nodes.
  • HHH hierarchical heavy hitter
  • a system which includes a receiving element receiving a plurality of elements of a data stream, a storage element storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to a number of elements stored therein, a comparator element comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes, wherein, when the frequency count is at least as great as the fraction, the node is identified as a hierarchical heavy hitter (HHH) node and a propagation element propagating the frequency count of each non-HHH node to its corresponding parent nodes.
  • HHH hierarchical heavy hitter
  • a computer readable storage medium including a set of instructions executable by a processor, the set of instructions operable to receive a plurality of elements of a data stream, store a multi-dimensional data structure in a memory, said multidimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to a number of elements stored therein, compare the frequency count of each node to a threshold value based on a total number of the elements stored in the plurality of nodes, wherein, when the frequency count is at least as great as the threshold value, the node is identified as a hierarchical heavy hitter (HHH) node and propagate the frequency count of each non-HHH node to its corresponding parent nodes.
  • HHH hierarchical heavy hitter
  • FIG. 1 shows an exemplary two-dimensional (“2-D”) data structure.
  • FIGS. 2A-B shows an exemplary embodiment of a portion of a data structure for the purpose of demonstrating an exemplary frequency count propagation according to the present invention.
  • FIG. 3 shows an exemplary method for inserting and compressing data elements in a summary data structure for identifying HHHs in a data structure implementing the overlap case for streaming data according to the present invention.
  • FIG. 4 shows an exemplary method for identifying HHHs in a data structure implementing the overlap case for streaming data according to the present invention.
  • the present invention may be further understood with reference to the following description and the appended drawings, wherein like elements are referred to with the same reference numerals.
  • the exemplary embodiment of the present invention describes a method for identifying hierarchical heavy hitters (“HHHs”) in a multidimensional data structure.
  • HHHs hierarchical heavy hitters
  • the exemplary hierarchical data is described as data representing IP addresses in IP traffic data.
  • the IP addresses are by their nature hierarchical, i.e., each individual address is arranged into subnets, which are within networks, which are within the IP address space. Therefore the collection of multiple data points based on IP addresses, and the generalization of these IP addresses, will result in a hierarchical data structure. The concept of generalization will be described in greater detail below.
  • IP addresses is only exemplary and that the present invention may be applied to any type of data which may be represented hierarchically.
  • Other examples of hierarchical data include data collected based on time (e.g., hour, day, week, etc.) or data collected based on location (e.g., city, county, state, etc.). This type of data may also be stored, arranged and viewed in a hierarchical manner.
  • the hierarchical data may be static or streamed data and the exemplary embodiments of the present invention may be applied to either static or streamed data.
  • the data collected in the IP traffic scenario may be considered streaming data because new data points are continually being added to the set of data points in the data structure.
  • determining HHHs may be continuous as the data changes.
  • static hierarchical data may be sales information which is based on time and location. This information may be collected and stored for analysis at a later time.
  • the general purpose of collecting and storing this data is to mine the data to determine patterns and information from the data. For example, if a specific IP address (or range of IP addresses in the hierarchy) is receiving an unusually high amount of traffic, this may indicate a denial of service attack on the network. In another example, a specific region may show a high number of sales at a particular time indicating that additional salespeople should be staffed at these times. These high traffic points or paths will be indicated by identifying HHHs in the data structure.
  • FIG. 1 shows an exemplary two-dimensional (“2-D”) data structure 1 for which exemplary embodiments of the present invention may be used to determine HHHs.
  • the description of data structure 1 will include terminology and notations that are presented in the formulations that follow.
  • the exemplary 2-D data structure 1 may be used to model two dimensional data associated with IP traffic data.
  • the data is considered two dimensional because there are two attributes which are being used to populate the data structure, i.e., the source address and the destination address.
  • additional dimensions may be added to the data structure by collecting and storing additional information. For example, if the port numbers associated with the source and destination addresses and a time attribute were collected and stored, a data structure with five (5) dimensions could be created.
  • the exemplary embodiments of the present invention may be applied to any multi-dimensional data structure.
  • a typical 32 bit source and destination IP address is in the form of “xxx.xxx.xxx.x” with each octet (8 bits) of data (e.g., xxx) representing a sub-attribute of the attribute.
  • each level of the hierarchy may be considered to correspond to an octet of the IP address, wherein the source address attribute is represented as 1.2.3.4 and the destination address attribute is represented as 5.6.7.8.
  • the data structure 1 models the collected data as N d-dimensional tuples.
  • a tuple refers to a collection of one or more attributes.
  • each node 5 - 125 of data structure 1 is a tuple.
  • the maximum depth of the ith dimension is defined as h i .
  • the generalization of any element on an attribute means that the element is rolled up one level in the hierarchy of that attribute.
  • the generalization of the IP address pair 1.2.3.4, 5.6.7.8 (shown as node 5 ) on the second attribute is 1.2.3.4, 5.6.7.* (shown as node 10 ).
  • An element is fully general on an attribute if it cannot be generalized further.
  • this generalization is denoted by the symbol *.
  • the pair *, 5.6.7.* (shown as node 95 ) is fully general on the first attribute, but not the second.
  • the root node 125 is fully general.
  • the act of generalizing over a defined set of hierarchies generates a hierarchical lattice structure as shown by data structure 1 .
  • Each node in the data structure 1 may be labeled with a vector length d whose ith entry is a non-negative integer that is at most h i , indicating the level of generalization of the node.
  • the pair (1.2.3.4, 5.6.7.8) is at a generalization level [4,4] (node 5 )
  • the pair (*, 5.6.7.*) is at [0,3] (node 95 )
  • the pair (1.2.*, 5.*) is at [2,1] (node 85 ).
  • the parents of any node are those nodes where one attribute has been generalized in one dimension.
  • the parents of a node at level [4,4] are at levels [3,4] (node 15 ) and [4,3] (node 10 ).
  • a node that has one attribute that is fully generalized will only have a single parent, e.g. node 95 at level [0,3] has only one parent node 100 at level [0,2] because the first attribute is fully generalized.
  • a parent of any element e may be referred to as par(e).
  • Two nodes are comparable if every attribute and the specified portion of the label of one node is a prefix of the other on every attribute. For example, a node having level [3,4] is comparable to a node having level [3,2]. In contrast, a node at level [3,4] is not comparable to a node at level [4,3].
  • a Level(i) is the ith level in the data structure corresponding to the sum of the values in the level label.
  • Level(5) no pair of nodes with a distinct label in a particular level (e.g., Level(5)) are comparable. These nodes are described as forming an anti-chain. Other nodes which are not comparable can also form an anti-chain.
  • [2,2] and [1,4] with prefixes (1.2.*, 5.6.*) and (1.*, 5.6.7.8) respectively.
  • a sub-lattice of an element (e) is defined as the set of elements which are related to e under the closure of the parent relation. For example, elements (1.2.3.4,5.6.7.8), (1.2.3.8, 5.6.4.5) and (1.2.3*,5.6.8.*) are all in sub-lattice (1.2.3.*, 5.6.*).
  • element (1.2.3.8,5,6.4.5) is in a sub-lattice of (1.2.3.*,5.6.*) and (1.2.3.*,5.6.8.*) is in a sub-lattice of (1.2.3.*,5.6.*), but (1.2.3.8,5.6.4.5) and (1.2.3.*, 5.6.8.*) are in separate sub-lattices.
  • a frequency count is incremented which represents an occurrence of data at the node.
  • the general problem of finding HHHs is to find all items in the structure whose frequency count exceeds a given fraction ⁇ of the total data points.
  • the propagation of frequency counts is fairly straightforward, i.e., add the count of a rolled up node to its one and only parent.
  • an HHH is defined as follows:
  • the methods described herein may be implemented on any computing device which samples and/or processes data in an online or offline state.
  • the computing device may include a central processing unit (CPU), a memory, an input/output (I/O) interface, etc.
  • the I/O interface may be adapted to receive a data stream from a source, such as a network, database, server, etc.
  • the memory may store all or portions of one or more programs and/or data to implement the described methods.
  • the methods may be implemented in hardware, software, or a combination thereof.
  • FIG. 3 shows an exemplary method 400 for inserting and compressing data elements in a summary data structure for identifying HHHs in a data structure implementing the overlap case for streaming data.
  • streaming data means that new data will be continuously added to the data set.
  • the method 400 compresses the data to eliminate certain data which may be omitted for the purposes of calculating the set of HHHs. While it is possible to maintain multiple independent data structures and information for every label in a lattice data structure in order to calculate the HHHs for a particular point in the lattice, this becomes very expensive in terms of storage space and computation time.
  • the method 400 presents a single data structure that summarizes the whole lattice. This allows for an approximation of the HHHs for the data structure in a single pass (within a defined error amount).
  • the method 400 uses a very small amount of storage space and updates the set of HHHs as the data stream unravels. More specifically, a summary structure T consisting of a set of nodes that correspond to samples from the input stream is maintained. Each node t e ⁇ T consists of an element e from the lattice and a bounded amount of auxiliary information.
  • auxiliary information f e , ⁇ e , g e and m e are maintained, where:
  • the method 400 begins with step 405 where the user supplies an error parameter ⁇ .
  • the method 400 will take one pass through the summary data structure and approximate the HHHs for the streamed data using a minimal amount of storage space and computation time. The approximation of the HHHs is based on the user supplied error parameter. From the following description and formulations, those of skill in the art will understand that as a user specifies tighter error tolerances, the storage space and computation time requirements may increase. Each user will select an error parameter that suits the particular application.
  • step 415 an element is received from the data stream.
  • step 420 it is determined if the node t e exists for the element in the summary data structure T. If the node t e exists, the process continues to step 425 where the f e count of the node is updated and the process loops back to step 415 to retrieve the next element in the stream.
  • step 430 a new node t e is created for the element and the auxiliary information f e ; ⁇ e , g e and m e values are stored in the newly created node.
  • f e f of the element
  • g e is set to 0
  • step 435 fringe nodes are identified.
  • a fringe node is one that does not have any descendants.
  • the compression phase of the method is iterative and is carried out for each of the identified fringe node.
  • step 440 it is determined whether the upper bound on the total count is larger than the current bucket number, i.e., is f e ⁇ g e + ⁇ e , ⁇ b current .
  • the fiinge node is deleted as part of the compression step 445 .
  • the auxiliary values of the parent elements also need to be updated in the compression step 445 .
  • the updating will be described with reference to the left parent, but the same process will be carried out for the right parent.
  • the left parent has become a fringe node as a result of the deletion of the originally scanned node. If it has become a fringe node, it will be an analyzed node in the iterative compression phase. As described above, the same process will be carried out for the right parent.
  • the compression step also reduces the compensating count of the common grandparent (g gpar(e) ) by the value f e ⁇ g e to account for possible overcounting.
  • upper bound f e + ⁇ e
  • g e is no longer speculative and a tighter upper bound can be obtained using f e -g e + ⁇ e . As described above, it is this tighter upper bound that is used to determine the fringe nodes to be compressed.
  • FIGS. 2A-2B depict a portion 300 of the 2-D data structure 1 initially shown in FIG. 1 .
  • the portion 300 will be used to show an example of propagating frequency counts in the compression phase of the streaming overlap case.
  • the portion 300 of the data structure 1 shows a diamond property that is a region of the lattice corresponding to an inclusion-exclusion principle to prevent overcounting frequency counts.
  • the example shows the principle of having a compensating count g e for the common grandparent depicted at the top of the diamond structure.
  • the node 5 of portion 300 in FIGS. 2A-B will be referred to as a child node
  • nodes 10 and 15 will be referred to as parent nodes
  • nodes 20 - 30 will be referred to as grandparent nodes with node 25 being referred to as the common grandparent node.
  • the exemplary frequency count [4] of the child node 5 As shown in FIG. 2A , the exemplary frequency count [4] of the child node 5 . As described above, this frequency count should be propagated to the frequency counts of parent nodes 10 and 15 .
  • the frequency count [4] of the child node 5 is also subtracted from the frequency count [0] of the common grandparent node 25 .
  • the initial frequency count [0] (shown in FIG. 2A ) of the common grandparent node 25 becomes [ ⁇ 4] 25 (shown in FIG. 2B ) after the frequency count [4] of the child node 5 is subtracted therefrom.
  • FIG. 4 shows an exemplary method 500 for identifying HHHs in a data structure implementing the overlap case for streaming data.
  • the method 500 may be used in conjunction with the method 400 to extract HHHs from the summary structure T at any given time.
  • the threshold value ( ⁇ ) for identifying HHHs is defined by the user.
  • certain parameters are set for each of the elements. Specifically, hhhf e is set to f e , hhhg e is set to g e and two boolean operators identified as lstat(e) and rstat(e) are set to 0 (or not set). The function of lstat(e) and rstat(e) are described in greater detail below.
  • step 510 the fringe nodes are identified. Similar to the compression phase of method 400 , the remainder of the method 500 is carried out iteratively for all of the identified fringe nodes.
  • step 515 it is determined if both of the boolean operators lstat(e) and rstat(e) are set. If one or both of the boolean operators are not set, the method continues to step 520 where it is determined if the total count of the node is greater than or equal to the threshold value.
  • the total count for the purposes of identifying an HHH is defined as hhhf e ⁇ hhhg e , + ⁇ e . If the total count is greater than the threshold value, the node is identified as an HHH in step 525 .
  • the two boolean operators are set.
  • the HHH node may be printed out or displayed to the user including its auxiliary information. If the node is identified as an HHH in step 525 , the process loops back to step 515 to begin processing the next fringe node.
  • step 520 If in step 520 the total count does not exceed the threshold, the process continues to step 530 where the count of the parent nodes are reset.
  • step 515 If in step 515 it was determined that both boolean operators were set, the process skips forward to step 530 where the parent counts are reset. However, it should be noted that the reset value is different than the reset value described immediately above where the boolean operators are not set.
  • the compensating count at the parent element should not be used because doing so would result in overcompensation.
  • the boolean operators lstat(e) and rstat(e) assure that this will not occur because when both boolean operators are set, the reset value for the parent does not include the compensating count.
  • step 535 it is determined whether the parent has any additional children. If the parent does not have any additional children, the parent is identified as a fringe node (step 540 ) and the parent is included as a fringe node to be analyzed in the iterative process. If the parent has additional children (step 535 ) or after the parent is set as a fringe node (step 540 ), the method continues to step 545 to reset the common grandparent compensating count.
  • the method 500 described above and represented by the above pseudo code computes the HHHs accurately to ⁇ N and uses storage space bounded by O((H/ ⁇ )log( ⁇ N)). These parameters for the streaming overlap case are similar to a one-dimensional analysis and result in acceptable computation times and storage boundaries
  • the methods 400 and 500 may be extended to any number of dimensions.
  • a negative compensating count g e ( ⁇ ) (similar to g e defined above) and a positive compensating count g e (+) are maintained.
  • some ancestors obtain negative speculative counts, while others obtain positive speculative counts.
  • the present invention may also be used on static data.
  • static data computational speed is not as much of a concern because new data is not being added to the data structure.
  • the method determining HHHs may be iterative and make multiple passes over the data to accurately compute the HHHs.
  • the frequency counts are propagated by splitting the frequency counts of child nodes among the parent nodes, referred to as a split case.
  • a split case For example, referring to FIG. 3A , the frequency count [4] of child node 5 may be split among its parent nodes 10 and 15 , (e.g., 4-0, 3-1, 2-2). In this manner, the common grandparent node 25 will only have a frequency count of [4] as a result of the propagation of the frequency counts from the parent nodes 10 and 15 .
  • the split case may also be used for both static and streamed data. The split case results in a simpler determination of HHHs because the splitting of the frequency count resolves the issues related to the overcompensation of common grandparents presented in the overlap case.
  • the insertion, compression and identification methods for the split case are similar to the overlap case, except that there is no compensating count.

Abstract

A method including receiving a plurality of elements of a data stream, storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein, comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes and identifying each node for which the frequency count is at least as great as the threshold value as a hierarchical heavy hitter (HHH) node and propagating the frequency count of each non-HHH nodes to its corresponding parent nodes.

Description

    INCORPORATION BY REFERENCE
  • The entire disclosure of U.S. patent application Ser. No. 10/802,605, entitled “Method and Apparatus for Identifying Hierarchical Heavy Hitters in a Data Stream” filed Mar. 17, 2004 is incorporated, in its entirety, herein. The entire disclosure of U.S. Provisional Patent Appln. 60/560,666, entitled “Diamond in the Rough: Finding Hierarchical Heavy Hitters in Multi-Dimensional Data” filed Apr. 8, 2004 is incorporated, in its entirety, herein.
  • BACKGROUND
  • Aggregation along hierarchies is a critical data summarization technique in a large variety of online applications, including decision support (e.g, online analytical processing (OLAP)), network management (e.g., internet protocol (IP) clustering, denial-of-service (DoS) attack monitoring), text (e.g., on prefixes of strings occurring in the text), and extensible markup language (XML) summarization (i.e., on prefixes of root-to-leaf paths in an XML data tree). In such applications, data is inherently hierarchical and it is desirable to monitor and maintain aggregates of the data at different levels of the hierarchy over time in a dynamic fashion.
  • A heavy hitter (HH) is an element of a data set having a frequency which is greater than or equal to a user-defined threshold. A conventional algorithm for identifying the HHs in the data set maintains a summary structure which allows the frequencies of the elements to be estimated within a pre-defined error bound. The conventional HH algorithm, however, did not account for any hierarchy in the data set. It is also possible to store information for each node in a hierarchy and calculate HHs based on this information. However, the storing of data for all nodes and the amount of calculation is prohibitive. In addition, this method provides superfluous results. A need exists for identifying hierarchical heavy hitters (“HHHs”) in data sets having multiple dimensions.
  • SUMMARY OF THE INVENTION
  • A method including receiving a plurality of elements of a data stream, storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein, comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes and identifying each node for which the frequency count is at least as great as the threshold value as a hierarchical heavy hitter (HHH) node and propagating the frequency count of each non-HHH nodes to its corresponding parent nodes.
  • A system which includes a receiving element receiving a plurality of elements of a data stream, a storage element storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to a number of elements stored therein, a comparator element comparing the frequency count of each node to a threshold value based on a total number of the elements stored in the nodes, wherein, when the frequency count is at least as great as the fraction, the node is identified as a hierarchical heavy hitter (HHH) node and a propagation element propagating the frequency count of each non-HHH node to its corresponding parent nodes.
  • A computer readable storage medium including a set of instructions executable by a processor, the set of instructions operable to receive a plurality of elements of a data stream, store a multi-dimensional data structure in a memory, said multidimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to a number of elements stored therein, compare the frequency count of each node to a threshold value based on a total number of the elements stored in the plurality of nodes, wherein, when the frequency count is at least as great as the threshold value, the node is identified as a hierarchical heavy hitter (HHH) node and propagate the frequency count of each non-HHH node to its corresponding parent nodes.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 shows an exemplary two-dimensional (“2-D”) data structure.
  • FIGS. 2A-B shows an exemplary embodiment of a portion of a data structure for the purpose of demonstrating an exemplary frequency count propagation according to the present invention.
  • FIG. 3 shows an exemplary method for inserting and compressing data elements in a summary data structure for identifying HHHs in a data structure implementing the overlap case for streaming data according to the present invention.
  • FIG. 4 shows an exemplary method for identifying HHHs in a data structure implementing the overlap case for streaming data according to the present invention.
  • DETAILED DESCRIPTION
  • The present invention may be further understood with reference to the following description and the appended drawings, wherein like elements are referred to with the same reference numerals. The exemplary embodiment of the present invention describes a method for identifying hierarchical heavy hitters (“HHHs”) in a multidimensional data structure. The multidimensional data structure and methods for identifying the HHHs therein will be discussed in detail below.
  • In the exemplary embodiments, the exemplary hierarchical data is described as data representing IP addresses in IP traffic data. The IP addresses are by their nature hierarchical, i.e., each individual address is arranged into subnets, which are within networks, which are within the IP address space. Therefore the collection of multiple data points based on IP addresses, and the generalization of these IP addresses, will result in a hierarchical data structure. The concept of generalization will be described in greater detail below.
  • However, those of skill in the art will understand that the use of IP addresses is only exemplary and that the present invention may be applied to any type of data which may be represented hierarchically. Other examples of hierarchical data include data collected based on time (e.g., hour, day, week, etc.) or data collected based on location (e.g., city, county, state, etc.). This type of data may also be stored, arranged and viewed in a hierarchical manner.
  • The hierarchical data may be static or streamed data and the exemplary embodiments of the present invention may be applied to either static or streamed data. For example, the data collected in the IP traffic scenario may be considered streaming data because new data points are continually being added to the set of data points in the data structure. Thus, determining HHHs may be continuous as the data changes. However, it is also possible to take a snapshot of the data at a particular point in time (static data) and perform the HHH analysis on this static data. An example of static hierarchical data may be sales information which is based on time and location. This information may be collected and stored for analysis at a later time. Again, there are any number of examples of hierarchical data that may be streaming, static or either depending on the data collection methods.
  • The general purpose of collecting and storing this data is to mine the data to determine patterns and information from the data. For example, if a specific IP address (or range of IP addresses in the hierarchy) is receiving an unusually high amount of traffic, this may indicate a denial of service attack on the network. In another example, a specific region may show a high number of sales at a particular time indicating that additional salespeople should be staffed at these times. These high traffic points or paths will be indicated by identifying HHHs in the data structure.
  • U.S. patent application Ser. No. 10/802,605, entitled “Method and Apparatus for Identifying Hierarchical Heavy Hitters in a Data Stream” filed Mar. 17, 2004 which is incorporated by reference, in its entirety, herein, describes exemplary methods for identifying HHH's in a one-dimensional hierarchical data structure. The exemplary embodiment of the present invention is directed at identifying HHHs in multi-dimensional data structures. These multi-dimensional data structures present problems for identifying HHHs that are not present in a one-dimensional data structure. For example, one-dimensional data structures do not present the issue of common ancestors that multi-dimensional data structures present (e.g., a child node having two parent nodes with one common grandparent node). The exemplary embodiments will provide solutions for the unique issues presented for identifying HHHs in multi-dimensional data structures.
  • Initially, FIG. 1 shows an exemplary two-dimensional (“2-D”) data structure 1 for which exemplary embodiments of the present invention may be used to determine HHHs. The description of data structure 1 will include terminology and notations that are presented in the formulations that follow. The exemplary 2-D data structure 1 may be used to model two dimensional data associated with IP traffic data. In this example, the data is considered two dimensional because there are two attributes which are being used to populate the data structure, i.e., the source address and the destination address. Those of skill in the art will understand that additional dimensions may be added to the data structure by collecting and storing additional information. For example, if the port numbers associated with the source and destination addresses and a time attribute were collected and stored, a data structure with five (5) dimensions could be created. Thus, even though described with reference to a 2-D data structure, the exemplary embodiments of the present invention may be applied to any multi-dimensional data structure.
  • A typical 32 bit source and destination IP address is in the form of “xxx.xxx.xxx.x” with each octet (8 bits) of data (e.g., xxx) representing a sub-attribute of the attribute. Thus, in the example of data structure 1, each level of the hierarchy may be considered to correspond to an octet of the IP address, wherein the source address attribute is represented as 1.2.3.4 and the destination address attribute is represented as 5.6.7.8.
  • The data structure 1 models the collected data as N d-dimensional tuples. A tuple refers to a collection of one or more attributes. As shown in FIG. 1, each node 5-125 of data structure 1 is a tuple. Thus, throughout this description, the terms node and tuple may be used interchangeably to describe a collection of one or more attributes. The maximum depth of the ith dimension is defined as hi. In this example, N is the total number of data points collected (e.g., the number of hits for the particular source/destination nodes in the data structure), d=2 for the two dimensional attribute data (e.g., source address, destination address) and h1=h2=4 since each of the attributes have four sub-attributes.
  • The generalization of any element on an attribute means that the element is rolled up one level in the hierarchy of that attribute. For example, the generalization of the IP address pair 1.2.3.4, 5.6.7.8 (shown as node 5) on the second attribute is 1.2.3.4, 5.6.7.* (shown as node 10). An element is fully general on an attribute if it cannot be generalized further. In the data structure 1, this generalization is denoted by the symbol *. For example, the pair *, 5.6.7.* (shown as node 95) is fully general on the first attribute, but not the second. The root node 125 is fully general. Thus, the act of generalizing over a defined set of hierarchies generates a hierarchical lattice structure as shown by data structure 1.
  • Each node in the data structure 1 may be labeled with a vector length d whose ith entry is a non-negative integer that is at most hi, indicating the level of generalization of the node. For example, the pair (1.2.3.4, 5.6.7.8) is at a generalization level [4,4] (node 5), the pair (*, 5.6.7.*) is at [0,3] (node 95) and the pair (1.2.*, 5.*) is at [2,1] (node 85). The parents of any node are those nodes where one attribute has been generalized in one dimension. For example, the parents of a node at level [4,4] (node 5) are at levels [3,4] (node 15) and [4,3] (node 10). A node that has one attribute that is fully generalized will only have a single parent, e.g. node 95 at level [0,3] has only one parent node 100 at level [0,2] because the first attribute is fully generalized. For notation purposes, a parent of any element e may be referred to as par(e).
  • Two nodes are comparable if every attribute and the specified portion of the label of one node is a prefix of the other on every attribute. For example, a node having level [3,4] is comparable to a node having level [3,2]. In contrast, a node at level [3,4] is not comparable to a node at level [4,3]. A Level(i) is the ith level in the data structure corresponding to the sum of the values in the level label. For example, Level(8)=[4,4] (node 5); Level(5)=[1,4] (node 50), [2,3] (node 45), [3,2] (node 40) and [4,1] (node 35); and Level(0)=[0,0] (node 125). No pair of nodes with a distinct label in a particular level (e.g., Level(5)) are comparable. These nodes are described as forming an anti-chain. Other nodes which are not comparable can also form an anti-chain. For example, consider labels [2,2] and [1,4] with prefixes (1.2.*, 5.6.*) and (1.*, 5.6.7.8), respectively. The total number of levels in any data structure is given by L=1+Σihi. Thus, in the example of data structure 1, L=1+4+4=9.
  • Finally, a sub-lattice of an element (e) is defined as the set of elements which are related to e under the closure of the parent relation. For example, elements (1.2.3.4,5.6.7.8), (1.2.3.8, 5.6.4.5) and (1.2.3*,5.6.8.*) are all in sub-lattice (1.2.3.*, 5.6.*). Thus, the sub-lattice of a set of elements P is defined as sub-lattice(P)=∪pεPsub-lattice(p). It should be noted that in the above example, element (1.2.3.8,5,6.4.5) is in a sub-lattice of (1.2.3.*,5.6.*) and (1.2.3.*,5.6.8.*) is in a sub-lattice of (1.2.3.*,5.6.*), but (1.2.3.8,5.6.4.5) and (1.2.3.*, 5.6.8.*) are in separate sub-lattices.
  • As elements are collected and nodes are added to the data structure 1, a frequency count is incremented which represents an occurrence of data at the node. The general problem of finding HHHs is to find all items in the structure whose frequency count exceeds a given fraction φ of the total data points. In a one-dimensional data structure, the propagation of frequency counts is fairly straightforward, i.e., add the count of a rolled up node to its one and only parent. However, in the multi-dimensional case, it is not readily apparent how to compute the frequency counts at various nodes within the data structure 1 because, for example, each node may have two or more parents.
  • In a first exemplary embodiment, referred to as the overlap case, the frequency count for any child node is passed to all its parents, except where the child node has been identified as an HHH. However, as will be described in greater detail below, there are subtleties to the overlap case which prevents overcounting due to the roll up of frequency counts to both parents, In the overlap case, an HHH is defined as follows:
      • Given a set S of elements e having corresponding frequency counts fe and L=Σihi. An HHH may be defined inductively based on a threshold φ, HHHL contains all heavy hitters eεS such that fe≧└φN┘. The overlap count of an element p at Level(l) in the lattice where l<L is given by f′(p)=Σfe: eεS∩{sub-lattice(p)−Sub-lattice(HHH1+L)}. The set HHHl is defined as the set HHHl+L∪{pεLevel(l)̂f′(p)}≧└φN┘. The HHHs in the overlap case for the set S is the set HHH0.
  • The methods described herein may be implemented on any computing device which samples and/or processes data in an online or offline state. For example, the computing device may include a central processing unit (CPU), a memory, an input/output (I/O) interface, etc. The I/O interface may be adapted to receive a data stream from a source, such as a network, database, server, etc. The memory may store all or portions of one or more programs and/or data to implement the described methods. In addition, the methods may be implemented in hardware, software, or a combination thereof.
  • FIG. 3 shows an exemplary method 400 for inserting and compressing data elements in a summary data structure for identifying HHHs in a data structure implementing the overlap case for streaming data. As would be understood by those of skill in the art, streaming data means that new data will be continuously added to the data set. Thus, for a streaming case, it is very important that any methods for determining HHHs have a minimal processing time so that the results are current. In addition, since new data is being continuously added, the method 400 compresses the data to eliminate certain data which may be omitted for the purposes of calculating the set of HHHs. While it is possible to maintain multiple independent data structures and information for every label in a lattice data structure in order to calculate the HHHs for a particular point in the lattice, this becomes very expensive in terms of storage space and computation time.
  • Thus, the method 400 presents a single data structure that summarizes the whole lattice. This allows for an approximation of the HHHs for the data structure in a single pass (within a defined error amount). The method 400 uses a very small amount of storage space and updates the set of HHHs as the data stream unravels. More specifically, a summary structure T consisting of a set of nodes that correspond to samples from the input stream is maintained. Each node teεT consists of an element e from the lattice and a bounded amount of auxiliary information.
  • In the 2-D summary data structure example, auxiliary information fe, Δe, ge and me are maintained, where:
      • fe is a lower bound on the total count that is straightforwardly rolled up (directly or indirectly) into e,
      • Δe is the difference between an upper bound on the total count that is straightforwardly rolled up into e and the lower bound fe,
        • ge is an upper bound on the total compensating count, based on counts of rolled up grandchildren of e, and
      • me=max (fd(e)−gd(e)+Δd(e)), over all descendants d(e) of e that have been rolled up into e.
  • Referring to FIG. 3, the method 400 begins with step 405 where the user supplies an error parameter ε. As described above, the method 400 will take one pass through the summary data structure and approximate the HHHs for the streamed data using a minimal amount of storage space and computation time. The approximation of the HHHs is based on the user supplied error parameter. From the following description and formulations, those of skill in the art will understand that as a user specifies tighter error tolerances, the storage space and computation time requirements may increase. Each user will select an error parameter that suits the particular application. In step 410, the input stream is conceptually divided into buckets of width (w=┌1/ε┐). The current bucket number is defined as bcurrent=└εN┘.
  • The method will then go through two alternating phases of insertion and compression. The following steps are related to the insertion phase. In step 415, an element is received from the data stream. In step 420 it is determined if the node te exists for the element in the summary data structure T. If the node te exists, the process continues to step 425 where the fe count of the node is updated and the process loops back to step 415 to retrieve the next element in the stream.
  • If it was determined in step 420 that the node te did not exist, the process continues to step 430 where a new node te is created for the element and the auxiliary information fe; Δe, ge and me values are stored in the newly created node. Specifically, fe=f of the element, ge is set to 0 and Δe=me=bcurrent−1. However, then the two parent elements (if they exist in the data structure) are also used estimate the values of the auxiliary information. Specifically, if the left parent exists and mlpar(e)<me, then Δe=me=mlpar(e). Similarly, if the right parent exists and mrpar(e)<me, then Δe=me=mrpar(e).
  • This completes the insertion phase of the method 400. The following is exemplary pseudo code for the insertion process:
  • Insert (e,f):
    01 if te exists then fe + = f;
    02 else {
    03 if (lpar(e) in domain) then Insert (lpar(e), 0);
    04 if (rpar(e) in domain) then Insert (rpar(e), 0);
    05 create te with (fe = f, ge = 0);
    06 Δe = me = b current −1;
    07 if (lpar(e) in domain) and (mlpar(e) <me) {
    08 Δe = me = m lpar(e); }
    09 if (rpar(e) in domain) and (mrpar(e) <me) {
    10 Δe = me = mrpar(e) ; }}
  • The following steps are related to the compression phase of the method 400. In step 435, fringe nodes are identified. A fringe node is one that does not have any descendants. The compression phase of the method is iterative and is carried out for each of the identified fringe node. For each of the identified fringe nodes, in step 440, it is determined whether the upper bound on the total count is larger than the current bucket number, i.e., is fe−gee, <bcurrent.
  • If the total count is less than the current bucket number, the fiinge node is deleted as part of the compression step 445. However, since the node is deleted, the auxiliary values of the parent elements also need to be updated in the compression step 445. The updating will be described with reference to the left parent, but the same process will be carried out for the right parent. If the left parent exists, the flpar(e) is updated using the f and g, of the deleted node, i.e. flpar(e)+=fe−ge. Similarly, mlpar(e) is updated in the form mlpar(e)=max(mlpar(e), fe−gee). Finally, it is determined if the left parent has become a fringe node as a result of the deletion of the originally scanned node. If it has become a fringe node, it will be an analyzed node in the iterative compression phase. As described above, the same process will be carried out for the right parent. In addition, the compression step also reduces the compensating count of the common grandparent (ggpar(e)) by the value fe−ge to account for possible overcounting.
  • For non-fringe nodes in the summary structure T, the compensating count ge is speculative and is not taken into account for estimating the upper bound on the total count (e.g., upper bound=fee). However, for fringe nodes of the summary structure, ge is no longer speculative and a tighter upper bound can be obtained using fe-gee. As described above, it is this tighter upper bound that is used to determine the fringe nodes to be compressed.
  • FIGS. 2A-2B depict a portion 300 of the 2-D data structure 1 initially shown in FIG. 1. The portion 300 will be used to show an example of propagating frequency counts in the compression phase of the streaming overlap case. The portion 300 of the data structure 1 shows a diamond property that is a region of the lattice corresponding to an inclusion-exclusion principle to prevent overcounting frequency counts. The example shows the principle of having a compensating count ge for the common grandparent depicted at the top of the diamond structure. For the purpose of this example, the node 5 of portion 300 in FIGS. 2A-B will be referred to as a child node, nodes 10 and 15 will be referred to as parent nodes and nodes 20-30 will be referred to as grandparent nodes with node 25 being referred to as the common grandparent node.
  • As shown in FIG. 2A, the exemplary frequency count [4] of the child node 5. As described above, this frequency count should be propagated to the frequency counts of parent nodes 10 and 15. The initial frequency count [0] (shown in FIG. 2A) of each parent node 10 and 15 becomes [4] (shown in FIG. 2B) after the frequency count [4] from the child node 5 is propagated. It should be noted that the initial frequency count of [0] is only exemplary and may be any value based on the actual monitored data.
  • However, the frequency count [4] of the child node 5 is also subtracted from the frequency count [0] of the common grandparent node 25. The initial frequency count [0] (shown in FIG. 2A) of the common grandparent node 25 becomes [−4] 25 (shown in FIG. 2B) after the frequency count [4] of the child node 5 is subtracted therefrom. As described above, the [−4] frequency count of the common grandparent node may be considered the compensating count so that when the frequency counts of the parent nodes 10 and 15 are each propagated to the common grandparent node 25, the frequency count will be equal to [4] (−4+4+4=4). Without implementing compensating count, propagation of the frequency count [4] of the child node 5 would result in the frequency count [8] of the common grandparent node 25. This overcounting would lead to erroneous determinations of HHH nodes in the 2-D data structure 1.
  • This completes the compression phase of the method 400. The following is exemplary pseudo code for the compression process:
  • Compress:
    01 for each te in fringe do {
    02 if (fe + Δe ≦ bcurrent) {
    03 if (lpar(e) in domain) {
    04 fl par (e) + = fe −ge
    05 ml par(e) = max (ml par(e), fe−ge + Δe );
    06 if (lpar(e) has no more children) {
    07 add lpar(e) to fringe; }}
    08 if (rpar(e) in domain) {
    09 fr par (e) + = fe −ge
    10 mr par(e) = max (mr par(e),fe−ge + Δe );
    11 if (rpar(e) has no more children) {
    12 add rpar(e) to fringe; }}
    13 if (gpar(e) in domain) ggpar (e) + = fe − ge;
    14 delete te ; }}
  • FIG. 4 shows an exemplary method 500 for identifying HHHs in a data structure implementing the overlap case for streaming data. The method 500 may be used in conjunction with the method 400 to extract HHHs from the summary structure T at any given time. In the initial step 505, the threshold value (φ) for identifying HHHs is defined by the user. In addition, certain parameters are set for each of the elements. Specifically, hhhfe is set to fe, hhhge is set to ge and two boolean operators identified as lstat(e) and rstat(e) are set to 0 (or not set). The function of lstat(e) and rstat(e) are described in greater detail below.
  • In step 510, the fringe nodes are identified. Similar to the compression phase of method 400, the remainder of the method 500 is carried out iteratively for all of the identified fringe nodes. In step 515, it is determined if both of the boolean operators lstat(e) and rstat(e) are set. If one or both of the boolean operators are not set, the method continues to step 520 where it is determined if the total count of the node is greater than or equal to the threshold value. The total count for the purposes of identifying an HHH is defined as hhhfe−hhhge, +Δe. If the total count is greater than the threshold value, the node is identified as an HHH in step 525. As part of this identification, the two boolean operators are set. In addition, the HHH node may be printed out or displayed to the user including its auxiliary information. If the node is identified as an HHH in step 525, the process loops back to step 515 to begin processing the next fringe node.
  • If in step 520 the total count does not exceed the threshold, the process continues to step 530 where the count of the parent nodes are reset. As described above, where a child node is not identified as an HHH, the frequency count will be propagated to the parent nodes. For example, the frequency count of the left parent will be reset based on the following hhhflpar(e)+=max(0, hhhfe−hhhge). The right parent will be reset in a similar manner.
  • If in step 515 it was determined that both boolean operators were set, the process skips forward to step 530 where the parent counts are reset. However, it should be noted that the reset value is different than the reset value described immediately above where the boolean operators are not set. The reset value for the parents in the case where the boolean operators are set is hhhflpar(e)+=max(0, hhhfe). Again, the right parent will be reset in a similar manner. As can be seen from the above, when two elements that share a parent are both HHHs, the compensating count at the parent element should not be used because doing so would result in overcompensation. The boolean operators lstat(e) and rstat(e) assure that this will not occur because when both boolean operators are set, the reset value for the parent does not include the compensating count.
  • After the parent counts have been reset in step 530, the method continues to step 535 where it is determined whether the parent has any additional children. If the parent does not have any additional children, the parent is identified as a fringe node (step 540) and the parent is included as a fringe node to be analyzed in the iterative process. If the parent has additional children (step 535) or after the parent is set as a fringe node (step 540), the method continues to step 545 to reset the common grandparent compensating count. The common grandparent compensating count is reset to hhhggpar(e)+=max(0, hhhfe−hhhge). The method then continues to iteratively go through all the identified fringe nodes
  • This completes the HHH identification method 500. The following is exemplary pseudo code for the HHH identification method:
  • Output (ø);
    01 let hhhfe = fe, hhhge for all e;
    02 let lstat(e) = rstat(e) = 0 for all e;
    03 for each te in fringe do {
    04 if ((
    Figure US20090292726A1-20091126-P00001
    lstat(e) or
    Figure US20090292726A1-20091126-P00001
    rstat (e)) and
    05 (hhhfe − hhhg, + Δe ≧ └φN┘)){
    06 print (e, hhhfe − hhhge,fe − ge, Δe);
    07 lstat(e) = rstat (e) = 1;}
    08 else {
    09 if (lpar(e) in domain) and
    10 (
    Figure US20090292726A1-20091126-P00001
    lstat(e) or
    Figure US20090292726A1-20091126-P00001
    rstat (e) ){
    11 hhhflpar (e) + = max(0, hhhfe − hhhge); }
    12 else if (lpar(e) in domain) and
    13 (lstat(e) and rstat(e)) {
    14 hhhflpar (e) + = max(0, hhhfe − hhhge); }
    15 if (lpar(e) in domain) {
    16 if (lpar(e) has no more children) {
    17 addllpar(e) to fringe with
    18 lstat(lpar(e)) = lstat(e)); {{
    19 if (rpar(e) in domain) and
    20 (
    Figure US20090292726A1-20091126-P00001
    lstat(e) or
    Figure US20090292726A1-20091126-P00001
    rstat (e) ){
    21 hhhfrpar (e) + = max(0, hhhfe − hhhge); }
    22 else if (rpar(e) in domain) and
    23 (lstat(e) and rstat(e)) {
    24 hhhfrpar (e) + = max(0, hhhfe − hhhge); }
    25 if (rpar(e) in domain) {
    26 if (rpar(e) has no more children) {
    27 addlrpar(e) to fringe with
    28 lstat(rpar(e)) = lstat(e)); {{
    29 if (gpar(e) in domain) {
    30 hhhgpar (e) + = max(0, hhhfe − hhhge); }}}
  • The method 500 described above and represented by the above pseudo code computes the HHHs accurately to εN and uses storage space bounded by O((H/ε)log(εN)). These parameters for the streaming overlap case are similar to a one-dimensional analysis and result in acceptable computation times and storage boundaries
  • As described above, the methods 400 and 500 may be extended to any number of dimensions. In the higher dimensions, a negative compensating count ge(−) (similar to ge defined above) and a positive compensating count ge(+) are maintained. When an element is compressed, some ancestors obtain negative speculative counts, while others obtain positive speculative counts.
  • The above methods described the overlap case for streamed data. However, as described above, the present invention may also be used on static data. In the case of static data, computational speed is not as much of a concern because new data is not being added to the data structure. Thus, the method determining HHHs may be iterative and make multiple passes over the data to accurately compute the HHHs. In this case, the error parameter may be set to 0, i.e, ε=0.
  • In another embodiment, the frequency counts are propagated by splitting the frequency counts of child nodes among the parent nodes, referred to as a split case. For example, referring to FIG. 3A, the frequency count [4] of child node 5 may be split among its parent nodes 10 and 15, (e.g., 4-0, 3-1, 2-2). In this manner, the common grandparent node 25 will only have a frequency count of [4] as a result of the propagation of the frequency counts from the parent nodes 10 and 15. Similar to the overlap case, the split case may also be used for both static and streamed data. The split case results in a simpler determination of HHHs because the splitting of the frequency count resolves the issues related to the overcompensation of common grandparents presented in the overlap case.
  • The following shows the exemplary pseudo code for the insertion phase, the compression phase and the identification phase for the streaming split case:
  • Insert (e,f):
    01 if te exists then fe + = f;
    02 else {
    03 for ( i = 1; i ≦d; i ++){
    04 if (par(e, i) in domain) then {
    05 Insert (par(e, i), 0); }}
    06 create te with (fe = f);
    07 Δe = me = b current − 1;
    08 for (i = 1; i ≦d; i ++) {
    09 if (par (e, i) in domain) and mpar(e,i) < me) {
    10 Δe = me = mpar(e,i); }}}
    Compress:
    01 for each te in fringe do {
    02 iffe + Δe ≦ b current) {
    03 for (i =1; i ≦d; i ++){
    04 if (par(e, i) in domain) then {
    05 fpar (e,i) + = s (e,i) * f4 ;
    /* s(e,i) is the split function */
    06 mpar(e,i) = max (mpar(e,i), fe + Δe);
    07 if (par(e,i) has no more children) {
    08 add par (e,i) to fringe; }}}
    09 delete te; }}
    Output (ø):
    01 let hhhfe = fe for all e;
    02 for each te in fringe do {
    03 if (hhhfe + Δe ≧└øN┘) {
    04 print (e, hhhfe,fe, Δe); ]
    05 else {
    06 for (i = 1; i ≦d; i ++) {
    07 if (par(e, i) in domain) then {
    08 hhhf par (e,i) + = s (e,i) * fe ;
    09 if (par(e,i) has no more children) {
    10 add par (e,i) to fringe; }}}}}
  • As will be apparent from a review of the exemplary pseudo code, the insertion, compression and identification methods for the split case are similar to the overlap case, except that there is no compensating count.
  • It will be apparent to those skilled in the art that various modifications may be made in the present invention, without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims (17)

1-23. (canceled)
24. A method, comprising:
receiving a plurality of elements of a data stream;
storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein;
determining, as a function of the frequency count, which nodes correspond to a hierarchical heavy hitter (HHH) node;
propagating the frequency count of each non-HHH nodes to its corresponding parent nodes
identifying each node without a descendant as a fringe node; and
deleting each fringe node for which the frequency count is less than a product of an error factor and the total number of the elements stored in the plurality of nodes.
25. The method of claim 24, wherein the frequency count of each HHH node is not propagated to its corresponding parent nodes.
26. The method of claim 24, wherein the multi-dimensional data structure is one of a two-dimensional data structure, a three-dimensional data structure, a four-dimensional data structure and a five-dimensional data structure.
27. The method of claim 24, wherein the frequency count of each node differs from an actual frequency count by less than a specified error factor.
28. A method, comprising:
receiving a plurality of elements of a data stream;
storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to the number of elements stored therein;
determining, as a function of the frequency count, which nodes correspond to a hierarchical heavy hitter (HHH) node;
propagating the frequency count of each non-HHH nodes to its corresponding parent nodes;
determining whether one of the nodes corresponds to one of the received elements;
when a node is determined to correspond to the one of the received elements, inserting the one of the received elements into the corresponding node; and
incrementing the frequency count of the corresponding node by an amount equal to the frequency count of the one of the received elements.
29. The method of claim 28, further comprising:
creating a new node corresponding to each received element for which there is no corresponding node.
30. The method of claim 24, further comprising:
storing, for each element, auxiliary information including data for propagating the frequency counts.
31. The method of claim 24, wherein a total frequency count of each of the non-HHH nodes is propagated to its corresponding parent node.
32. The method of claim 24, wherein a total frequency count of each of the non-HHH nodes is propagated by splitting the total frequency count and propagating a split portion of the total frequency count to its corresponding parent nodes.
33. The method of claim 24, further comprising:
propagating the frequency counts of each of the non-HHH parent nodes to a corresponding common grandparent node.
34. The method of claim 33, wherein the common grandparent node includes a compensating count to prevent overcounting of the frequency counts from the parent frequency counts.
35. A system, comprising:
a receiving element receiving a plurality of elements of a data stream;
a storage element storing a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to a number of elements stored therein;
a determination element determining, as a function of the frequency count, which nodes correspond to a hierarchical heavy hitter (HHH) node; and
a propagation element propagating the frequency count of each non-HHH node to its corresponding parent nodes and propagating the frequency counts of parent nodes to a common grandparent node, wherein the common grandparent node includes a compensating count to prevent overcounting of the frequency counts from the parent nodes.
36. The system of claim 35, wherein the compensating counts include a positive compensating count and a negative compensating count.
37. The system of claim 35, wherein the frequency count of each node differs from an actual frequency count by less than a specified error factor.
38. The system of claim 35, wherein the HHH nodes are identified from one of streaming data and static data.
39. A computer readable storage medium including a set of instructions executable by a processor, the set of instructions configured to:
receive a plurality of elements of a data stream;
store a multi-dimensional data structure in a memory, said multi-dimensional data structure storing the plurality of elements as a hierarchy of nodes, each node having a frequency count corresponding to a number of elements stored therein;
determining, as a function of the frequency count, which nodes correspond to a hierarchical heavy hitter (HHH) node; and
propagate the frequency count of each non-HHH node to its corresponding parent nodes, wherein the frequency count of each non-HHH node is propagated by splitting the total frequency count and propagating a split portion of the total frequency count to its parent nodes.
US12/512,723 2005-06-10 2009-07-30 System and Method for Identifying Hierarchical Heavy Hitters in Multi-Dimensional Data Abandoned US20090292726A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/512,723 US20090292726A1 (en) 2005-06-10 2009-07-30 System and Method for Identifying Hierarchical Heavy Hitters in Multi-Dimensional Data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/149,699 US7590657B1 (en) 2005-06-10 2005-06-10 System and method for identifying hierarchical heavy hitters in a multidimensional environment
US12/512,723 US20090292726A1 (en) 2005-06-10 2009-07-30 System and Method for Identifying Hierarchical Heavy Hitters in Multi-Dimensional Data

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US11/149,699 Continuation US7590657B1 (en) 2005-06-10 2005-06-10 System and method for identifying hierarchical heavy hitters in a multidimensional environment

Publications (1)

Publication Number Publication Date
US20090292726A1 true US20090292726A1 (en) 2009-11-26

Family

ID=41058895

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/149,699 Expired - Fee Related US7590657B1 (en) 2005-06-10 2005-06-10 System and method for identifying hierarchical heavy hitters in a multidimensional environment
US12/512,723 Abandoned US20090292726A1 (en) 2005-06-10 2009-07-30 System and Method for Identifying Hierarchical Heavy Hitters in Multi-Dimensional Data

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US11/149,699 Expired - Fee Related US7590657B1 (en) 2005-06-10 2005-06-10 System and method for identifying hierarchical heavy hitters in a multidimensional environment

Country Status (1)

Country Link
US (2) US7590657B1 (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080289039A1 (en) * 2007-05-18 2008-11-20 Sap Ag Method and system for protecting a message from an xml attack when being exchanged in a distributed and decentralized network system
US20100274785A1 (en) * 2009-04-24 2010-10-28 At&T Intellectual Property I, L.P. Database Analysis Using Clusters
US20110066600A1 (en) * 2009-09-15 2011-03-17 At&T Intellectual Property I, L.P. Forward decay temporal data analysis
US20110078199A1 (en) * 2009-09-30 2011-03-31 Eric Williamson Systems and methods for the distribution of data in a hierarchical database via placeholder nodes
US20110161282A1 (en) * 2009-09-30 2011-06-30 Eric Williamson Systems and methods for distribution of data in a lattice-based database via placeholder nodes
US20110161374A1 (en) * 2009-09-30 2011-06-30 Eric Williamson Systems and methods for conditioned distribution of data in a lattice-based database using spreading rules
US20110161378A1 (en) * 2009-09-30 2011-06-30 Eric Williamson Systems and methods for automatic propagation of data changes in distribution operations in hierarchical database
US20110158106A1 (en) * 2009-12-31 2011-06-30 Eric Williamson Systems and methods for generating a push-up alert of fault conditions in the distribution of data in a hierarchical database
US8495087B2 (en) 2011-02-22 2013-07-23 International Business Machines Corporation Aggregate contribution of iceberg queries
US8538938B2 (en) 2010-12-02 2013-09-17 At&T Intellectual Property I, L.P. Interactive proof to validate outsourced data stream processing
US8612649B2 (en) 2010-12-17 2013-12-17 At&T Intellectual Property I, L.P. Validation of priority queue processing
CN103593460A (en) * 2013-11-25 2014-02-19 方正国际软件有限公司 Data hierarchical storage system and data hierarchical storage method
US8984013B2 (en) 2009-09-30 2015-03-17 Red Hat, Inc. Conditioning the distribution of data in a hierarchical database
US9798771B2 (en) 2010-08-06 2017-10-24 At&T Intellectual Property I, L.P. Securing database content
US11100072B2 (en) * 2017-07-31 2021-08-24 Aising Ltd. Data amount compressing method, apparatus, program, and IC chip

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7590657B1 (en) * 2005-06-10 2009-09-15 At&T Corp. System and method for identifying hierarchical heavy hitters in a multidimensional environment
US8799754B2 (en) * 2009-12-07 2014-08-05 At&T Intellectual Property I, L.P. Verification of data stream computations using third-party-supplied annotations
US8972404B1 (en) 2011-12-27 2015-03-03 Google Inc. Methods and systems for organizing content

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5293466A (en) * 1990-08-03 1994-03-08 Qms, Inc. Method and apparatus for selecting interpreter for printer command language based upon sample of print job transmitted to printer
US20060074907A1 (en) * 2004-09-27 2006-04-06 Singhal Amitabh K Presentation of search results based on document structure
US7590657B1 (en) * 2005-06-10 2009-09-15 At&T Corp. System and method for identifying hierarchical heavy hitters in a multidimensional environment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7424489B1 (en) * 2004-01-23 2008-09-09 At&T Corp. Methods and apparatus for space efficient adaptive detection of multidimensional hierarchical heavy hitters
US7437385B1 (en) * 2004-01-23 2008-10-14 At&T Corp. Methods and apparatus for detection of hierarchical heavy hitters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5293466A (en) * 1990-08-03 1994-03-08 Qms, Inc. Method and apparatus for selecting interpreter for printer command language based upon sample of print job transmitted to printer
US20060074907A1 (en) * 2004-09-27 2006-04-06 Singhal Amitabh K Presentation of search results based on document structure
US7590657B1 (en) * 2005-06-10 2009-09-15 At&T Corp. System and method for identifying hierarchical heavy hitters in a multidimensional environment

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080289039A1 (en) * 2007-05-18 2008-11-20 Sap Ag Method and system for protecting a message from an xml attack when being exchanged in a distributed and decentralized network system
US8316443B2 (en) * 2007-05-18 2012-11-20 Sap Ag Method and system for protecting a message from an XML attack when being exchanged in a distributed and decentralized network system
US8161048B2 (en) 2009-04-24 2012-04-17 At&T Intellectual Property I, L.P. Database analysis using clusters
US20100274785A1 (en) * 2009-04-24 2010-10-28 At&T Intellectual Property I, L.P. Database Analysis Using Clusters
US20110066600A1 (en) * 2009-09-15 2011-03-17 At&T Intellectual Property I, L.P. Forward decay temporal data analysis
US8595194B2 (en) 2009-09-15 2013-11-26 At&T Intellectual Property I, L.P. Forward decay temporal data analysis
US8984013B2 (en) 2009-09-30 2015-03-17 Red Hat, Inc. Conditioning the distribution of data in a hierarchical database
US8996453B2 (en) * 2009-09-30 2015-03-31 Red Hat, Inc. Distribution of data in a lattice-based database via placeholder nodes
US20110161378A1 (en) * 2009-09-30 2011-06-30 Eric Williamson Systems and methods for automatic propagation of data changes in distribution operations in hierarchical database
US20110161374A1 (en) * 2009-09-30 2011-06-30 Eric Williamson Systems and methods for conditioned distribution of data in a lattice-based database using spreading rules
US20110161282A1 (en) * 2009-09-30 2011-06-30 Eric Williamson Systems and methods for distribution of data in a lattice-based database via placeholder nodes
US9031987B2 (en) * 2009-09-30 2015-05-12 Red Hat, Inc. Propagation of data changes in distribution operations in hierarchical database
US8909678B2 (en) 2009-09-30 2014-12-09 Red Hat, Inc. Conditioned distribution of data in a lattice-based database using spreading rules
US20110078199A1 (en) * 2009-09-30 2011-03-31 Eric Williamson Systems and methods for the distribution of data in a hierarchical database via placeholder nodes
US8315174B2 (en) 2009-12-31 2012-11-20 Red Hat, Inc. Systems and methods for generating a push-up alert of fault conditions in the distribution of data in a hierarchical database
US20110158106A1 (en) * 2009-12-31 2011-06-30 Eric Williamson Systems and methods for generating a push-up alert of fault conditions in the distribution of data in a hierarchical database
US9965507B2 (en) 2010-08-06 2018-05-08 At&T Intellectual Property I, L.P. Securing database content
US9798771B2 (en) 2010-08-06 2017-10-24 At&T Intellectual Property I, L.P. Securing database content
US8538938B2 (en) 2010-12-02 2013-09-17 At&T Intellectual Property I, L.P. Interactive proof to validate outsourced data stream processing
US8612649B2 (en) 2010-12-17 2013-12-17 At&T Intellectual Property I, L.P. Validation of priority queue processing
US8499003B2 (en) 2011-02-22 2013-07-30 International Business Machines Corporation Aggregate contribution of iceberg queries
US8495087B2 (en) 2011-02-22 2013-07-23 International Business Machines Corporation Aggregate contribution of iceberg queries
CN103593460A (en) * 2013-11-25 2014-02-19 方正国际软件有限公司 Data hierarchical storage system and data hierarchical storage method
US11100072B2 (en) * 2017-07-31 2021-08-24 Aising Ltd. Data amount compressing method, apparatus, program, and IC chip

Also Published As

Publication number Publication date
US7590657B1 (en) 2009-09-15

Similar Documents

Publication Publication Date Title
US7590657B1 (en) System and method for identifying hierarchical heavy hitters in a multidimensional environment
US7668856B2 (en) Method for distinct count estimation over joins of continuous update stream
Cormode et al. Finding hierarchical heavy hitters in streaming data
Cormode et al. An improved data stream summary: the count-min sketch and its applications
US7657503B1 (en) System and method for generating statistical descriptors for a data stream
Castillo et al. Know your neighbors: Web spam detection using the web topology
US8069210B2 (en) Graph based bot-user detection
Berinde et al. Space-optimal heavy hitters with strong error bounds
Yun et al. An efficient mining algorithm for maximal weighted frequent patterns in transactional databases
US20110270853A1 (en) Dynamic Storage and Retrieval of Process Graphs
US10659486B2 (en) Universal link to extract and classify log data
WO2011139393A1 (en) Dynamic adaptive process discovery and compliance
US20090083266A1 (en) Techniques for tokenizing urls
Cormode et al. Time-decaying aggregates in out-of-order streams
CN113206831A (en) Data acquisition privacy protection method facing edge calculation
US20050131946A1 (en) Method and apparatus for identifying hierarchical heavy hitters in a data stream
Zhao et al. SpaceSaving $^\pm $: An Optimal Algorithm for Frequency Estimation and Frequent items in the Bounded Deletion Model
Preisach et al. Ensembles of relational classifiers
CN112968870A (en) Network group discovery method based on frequent itemset
Tanbeer et al. Mining regular patterns in data streams
US8892490B2 (en) Determining whether a point in a data stream is an outlier using hierarchical trees
US20050044094A1 (en) Expressing frequent itemset counting operations
Saruladha et al. LOMPT: an efficient and scalable ontology matching algorithm
Liang et al. Continuously maintaining approximate quantile summaries over large uncertain datasets
Lahiri et al. Finding correlated heavy-hitters over data streams

Legal Events

Date Code Title Description
AS Assignment

Owner name: AT&T AND THE REGENTS RUTGERS UNIVERSITY, NEW YORK

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:CORMODE, GRAHAM;KORN, PHILLIP RUSSELL;MUTHUKRISHNAN, SHANMUGAVELAYUTHAM;AND OTHERS;REEL/FRAME:023047/0622;SIGNING DATES FROM 20050815 TO 20050818

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION

AS Assignment

Owner name: RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY, NEW J

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE CORRECT THE ASSIGNEE'S NAME/RECEIVING PARTIES TO: AT&T CORP AND RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY PREVIOUSLY RECORDED ON REEL 023047 FRAME 0622. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNORS:CORMODE, GRAHAM;KORN, PHILIP RUSSELL;MUTHUKRISHNAN, SHANMUGAVELAYUTHAM;AND OTHERS;SIGNING DATES FROM 20120423 TO 20130325;REEL/FRAME:030111/0097

Owner name: AT&T CORP., NEW YORK

Free format text: CORRECTIVE ASSIGNMENT TO CORRECT THE CORRECT THE ASSIGNEE'S NAME/RECEIVING PARTIES TO: AT&T CORP AND RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY PREVIOUSLY RECORDED ON REEL 023047 FRAME 0622. ASSIGNOR(S) HEREBY CONFIRMS THE ASSIGNMENT;ASSIGNORS:CORMODE, GRAHAM;KORN, PHILIP RUSSELL;MUTHUKRISHNAN, SHANMUGAVELAYUTHAM;AND OTHERS;SIGNING DATES FROM 20120423 TO 20130325;REEL/FRAME:030111/0097