WO1999039295A1 - Method and apparatus for associations discovery - Google Patents

Method and apparatus for associations discovery Download PDF

Info

Publication number
WO1999039295A1
WO1999039295A1 PCT/US1999/002525 US9902525W WO9939295A1 WO 1999039295 A1 WO1999039295 A1 WO 1999039295A1 US 9902525 W US9902525 W US 9902525W WO 9939295 A1 WO9939295 A1 WO 9939295A1
Authority
WO
WIPO (PCT)
Prior art keywords
products
transactions
association
transaction
product
Prior art date
Application number
PCT/US1999/002525
Other languages
French (fr)
Inventor
Ivan A. Small
Ph. D. Lounette M. Dyer
Original Assignee
Cogit Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cogit Corporation filed Critical Cogit Corporation
Priority to AU25871/99A priority Critical patent/AU2587199A/en
Publication of WO1999039295A1 publication Critical patent/WO1999039295A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising

Definitions

  • the present invention relates generally to processing data in databases and, more particularly, to computing affinity among products in a transaction database.
  • One component of market research is the analysis of purchase patterns to detect combinations of goods and/or services that are often purchased together.
  • product may refer to goods, services, or other items included in a purchase transaction.
  • computer programs are frequently used to process data representing purchase transactions. These data might be stored in a database as sets of transactions (e.g. , purchases at a supermarket), where each transaction contains a number of products. It is of interest to market analysts when a collection of products is present together in a large number of transactions. If these products are together in at
  • association the collection of products is referred to as an "association” and the pre-defined number is referred to as the "minimum support” for the association.
  • a “maximal association” is an association that is not contained in another association. Accordingly, the "associations discovery” problem is to find all associations in a set of transactions having a given minimum support.
  • the brute-force approach to finding all associations is transaction-based: take two products, and go through all transactions to count how many transactions contain these two products. This is the number of 2-product associations. Then continue with larger numbers of products, determining 3-product, 4- product, etc. associations, until there is no larger association left.
  • a computer program implementing this simple approach may require a lot of memory, especially if the minimum support is low, and there are many collections of products for which there is at least minimum support. In fact, it is difficult to predict how much memory will be needed for a given problem.
  • this approach is implemented on a fixed- memory (e.g. , standard microprocessor-based) computer, once the program runs out of memory, processing slows down considerably.
  • AIS AIS is an iterative approach that takes a set of associations, each with k products, and attempts to extend each of those associations with another product, and then
  • a "candidate association” is a set of products that may or may not be an association.
  • L (k) is the set of all associations with exactly k products and C (k) is a set of candidate associations with exactly k products.
  • Pseudocode for an implementation of AIS is presented in Table 1, which is followed by a discussion of the code.
  • AIS starts out with all associations that contain exactly one product, i.e.,
  • L ( 1 ) (line 101). Identifying these associations requires a pass over all transactions, counting the number of occurrences of each product.
  • AIS processes a set of associations, each of which has k- 1 products.
  • AIS examines each transaction for a possible extension of each association (lines 104 through 116). Specifically, AIS first checks whether an association has all its products in the current transaction (line 105); then the system tries to extend the association by adding products from the current transaction that are not already in the association (line 107);
  • AIS as presented above, is a simple implementation of an associations discovery problem solution.
  • the number of candidate associations that are considered is typically large and includes many more candidate associations than actual associations.
  • extensions of this basic IBM algorithm concentrate on how to reduce the number of candidate associations, and on how to count the support for a candidate association efficiently.
  • the number of candidate associations is usually reduced by observing that if there is a subset of a candidate association that is not an association, the candidate association is not an association, either.
  • Intelligent hashing schemes are used to count the number of transactions that form the support for a candidate association.
  • This hashing is the Achilles' heel of AIS because set C (k) can get quite large.
  • the check to determine whether a candidate association is counted in C (k) has to be fast (line 108).
  • the standard method to speed up the check is to create a hash table for C ( k ) .
  • the size of C ( k ) can be a problem; if there are N products that have at least minimum support, then C ( 2 ) contains N*(N-1) candidate associations. If many of these candidate associations are associations, the number of candidate associations at the next level is even larger.
  • the hash table can take up a significant portion of main memory, and if the hash table cannot fit in main memory, performance of the system degrades significantly. Further, whether or not
  • the invention is a software application designed to solve the associations discovery problem.
  • An exemplary embodiment of the invention is referred to herein as the Dervish system.
  • the Dervish system takes an orthogonal approach to that used by traditional implementations. Specifically, the Dervish system approach is product-based rather than transaction-based. For each product, the Dervish system stores the identities of the transactions that contain that product. This takes a fixed amount of memory that is not extended subsequently. The system keeps track of the transactions that have products in common by judiciously swapping the order of the transactions for a given product, and by keeping pointers to identify sets of transactions.
  • the set of products and the set of transactions can both be ordered. This ordering can be chosen by the user. Ordering the set of products and the set of transactions results in a unique index for each product and a unique index for each
  • the main task for the Dervish system is to augment an existing association with a new product, to form a new association.
  • This new product generally has an index that is higher than that of any of the products already in the existing association.
  • the Dervish system does not directly store the identities of the products that constitute the current association; rather, the products in the current association can be inferred from the pointers that identify the set of transactions that form the support for the current association.
  • the transaction and product information is stored in a database. If all data structures that are needed to execute the Dervish system fit in the main memory of the computer on which the Dervish system is running, then the data only need to be read once. Should the main memory not be sufficient to store all the necessary data structures, then there is an embodiment of the Dervish system that segments the associations discovery problem into parts, each of which is analyzed, yielding a number of candidate associations. In this segmented embodiment, there is a second pass through the data to check which of these candidate associations are actual associations. In yet another embodiment of the Dervish system, the problem is also split into parts, where each part is processed in parallel. This processing yields candidate associations and, after a parallel check, a number of actual associations is identified. In any of these embodiments, or combinations thereof, the data are read from the database (or a copy thereof stored on local disk) at most twice.
  • the amount of memory the Dervish system uses is related to the number of products and the number of transactions, which is known in advance.
  • the amount of memory used by traditional systems is based on the number of associations in the set of transactions, which is not known in advance.
  • the performance of the Dervish system is predictable and does not degrade when the number of candidate associations is large.
  • the Dervish system does not use a hash table to count the number of transactions that support a candidate association, whether or not the hash table will fit in memory is not an issue, and the Dervish system can track an arbitrary number of products.
  • the Dervish system maintains a list of transactions that support an association, metrics for an association (e.g., total transaction cost or gross profit) can be accumulated without reprocessing all the transactions.
  • the Dervish system produces maximal associations on-the-fly, thereby increasing the efficiencies gained when parallel processing is used.
  • Figures 1 A and IB respectively show a pictorial and corresponding flow diagram for an exemplary embodiment of the present invention, as applied to retail transactions at a supermarket.
  • Figure 2 shows a flow diagram for the major procedures that implement an embodiment of the Dervish 200 system.
  • Figure 3 shows a flow diagram for an implementation of the Metrics 300 procedure of the Dervish 200 system.
  • Figure 4 shows a flow diagram for an implementation of the Whirl 400 procedure of the Dervish 200 system.
  • the Dervish system provides a solution to the associations discovery problem. To present a complete explanation of the Dervish system, first the computing environment in which the system runs is discussed. Next an exemplary application of the Dervish system and other possible applications are discussed. Then the system terminology and theory are discussed. Lastly, a detailed description of an embodiment of the Dervish system is given.
  • the invention may be practiced in the context of an operating system resident on a general purpose computer, for example, the mainframe computers and/or microprocessor- based workstations produced by Digital Equipment Corporation, IBM, Sun Microsystems, and Hewlett-Packard.
  • the computer has resident thereon an operating system, for example, Windows/95 or Windows/NT from Microsoft, or any variant of the Unix operating system.
  • Database storage and access may be provided using a standard relational database system, for example, as provided by Oracle, Sybase, Informix, or IBM, or using a flat file database. Alternatively, data access can be provided by a cash register data storage system.
  • the invention may be implemented in an object-oriented programming language such as C++.
  • the present invention can be implemented in any high-level programming language, either compiled or interpreted, on any appropriate computer having a operating system and a database-access providing mechanism.
  • the functionality of the invention described herein could also be implemented using any kind of computing apparatus, system, or technology, including hardware or with a combination of hardware and software.
  • Such hardware includes a general purpose processor, a micro-processor, a program logic array, an application-specific integrated circuit, and any other devices having sufficient processing capability to perform the functionality.
  • the invention could be practiced in a distributed computing environment, including the Internet.
  • Figures 1 A and IB show an exemplary application of the Dervish system.
  • the management of a supermarket chain wants to find out if lowering the price of carbonated beverages ("sodas") to below cost entices customers to buy more products than they would otherwise, or if the average customer comes into the store only to buy these sodas, at a loss to the store.
  • the latter is called cherry-picking, and is a behavior the supermarket chain wants to avoid.
  • the supermarket chain records the products in each transaction using a bar-code scanner (step 110) and stores the data in a database (step 1 15). Also stored in the database are the total cost of the transaction and the gross margin of the transaction.
  • the input to the Dervish system could therefore be a database that contains the transaction data ordered by store, and by week of the year.
  • the supermarket management runs Dervish on a computer (step 120) to generate a set of associations with associated metrics.
  • This output is stored in another database for further analysis (step 125).
  • the associations contain sets of products that are frequently bought together, for each store, and for each week of the year.
  • the metrics include, for each such association, the average size of a transaction and the average profit (or loss) for a transaction that contains that association.
  • an analyst organizes the associations and their metrics, by means of a visualization tool, e.g., a pie chart or graph (step 130). From this analysis, the analyst generates a report for each store in the chain (step 135). Based on the findings in the report, the store can take action, such as changing the prices of certain items or changing the layout of items in the store (step 140). Nominally, this is where the process ends. It is, of course, possible to repeat this process, based on the new circumstances in the store.
  • a visualization tool e.g., a pie chart or graph
  • the Dervish system for associations discovery can be applied in any number of fields to analyze the concurrent or sequential existence of items or events, i.e., to discover associations. Without limitation, examples of such applications follow:
  • Businesses can use associations discovery to find patterns of events that occur often. For example, in banking, patterns of opening and closing accounts can be determined. At an even finer level of detail, account transactions information, such as deposit and withdrawal patterns, can be determined. • Retailers can determine ordering patterns of large ticket items, for example, an appliance purchase sequence might be a refrigerator, followed by a washer/dryer, followed by a dishwasher.
  • Sales forces in any line of commerce, can use the associations that the Dervish system identifies to select a product for marketing purposes. For example, if there is an association that contains products a, b, and c, then salespeople can scan the transaction data to find all of the people who have purchased products a and b, and market product c to them.
  • Marketing departments can use associations based on aggregated purchase data to create behavioral profiles for households. For example, if the transactions/purchases have a customer or household ID, one can take the union of all products purchased in a time period (i.e., multiple market baskets), such as seasonally or yearly, and create associations across the household. In this case one may also want to have a minimum threshold for the associations and/or products based on volume or dollars. This can be done on panel data, such as Nielsen, or on purchase data from a retailer. These behavioral imprints can be used to design marketing programs, or to direct a product offer to particular households.
  • Telecommunications companies can use associations to find patterns of calling/called city pairs, which can be used for designing calling programs, detecting fraud, and doing capacity planning. Patterns of routing can also be detected between the switching points in the network by constraining the associations to be ordered.
  • Insurance companies in diverse insurance market segments are interested in detecting fraud.
  • medical insurance companies can use the Dervish system to detect when unnecessary procedures (e.g., tests, treatments) are routinely added to a common occurrence or diagnosis.
  • the associations would consist of heterogeneous sets that include the diagnosis, along with the treatments.
  • a high occurrence of unnecessary procedures for a particular diagnosis indicates widespread fraud.
  • a broken arm would typically call for an x-ray and a cast.
  • unnecessary procedures e.g., blood typing
  • automobile insurance companies can use the Dervish system to detect when a ring of people, acting in different roles, file false claims.
  • John Doe is the patient
  • Jane Smith is the doctor
  • Tom Jones is a witness
  • Jane Smith is the patient
  • Tom Jones the doctor
  • John Doe the witness
  • Physicians can track the symptoms of a set of patients.
  • the Dervish system can be used on a patient database to find out if certain symptoms occur together frequently. For example, researchers can use the Dervish system to discover if multiple minor symptoms occurring together often leads to major surgery.
  • Behavioral investigators can use the Dervish system to discover if certain behaviors often go together. Using a database cataloguing behaviors, the Dervish system can be used to discover clustering of behaviors.
  • Geneticists want to discover if certain traits appear in an organism more frequently than can be expected with random distributions.
  • the traits are stored in a database, from which the Dervish system can discover clustered traits. Using this information, geneticists can then possibly link these sets of traits to genetic information, such as is stored in sequences of DNA, or to mutation characteristics.
  • Demographers classify populations according to gender, race, age and income. Demographers can use the Dervish system to discover how these variables are clustered in order to assess implications for zoning decisions, school districts, etc.
  • City planners can use the Dervish system on a database with historical information on maintenance tasks to find out if certain types of city maintenance tasks occur together more frequently than others. For instance, minor water main leakage and major sewer main repairs, or power failure and graffiti damage. Quickly identifying certain sets of multiple maintenance tasks occurring in short order may lead to a new requirement for preventative maintenance in order to prevent additional tasks from occurring.
  • Electronic commerce (e-commerce) applications on the World Wide Web (WWW) can use the Dervish system in real-time environments. Using historical purchase behaviors of customers, such applications can determine which products are often purchased together, and then use these sets of products to select
  • Path-tracking applications can use the Dervish system to detect common navigation paths through WWW web sites.
  • the Dervish system is supplied with the tracking information for each individual user, consisting of a sequence of web pages.
  • the Dervish system creates associations from which it can be determined, based on the current web page the user is reading, what the next page or pages are likely to be. Based on this information, such path-tracking applications can present custom web pages to the user.
  • Path-tracking applications can also be used with other interactive technologies, such as inbound and outbound telemarketing and voice navigation systems for call centers.
  • T refers to the set of all transactions. This is the set that is stored in the database, from which the associations have to be extracted. A linear order is assigned to the transactions, that is,
  • T ⁇ tj ⁇ 0 ⁇ i ⁇ M ⁇
  • P refers to the set of all products.
  • a linear order is assigned to the products, that is,
  • a transaction t which is a set of products, is therefore a subset of P. Also, multiple transactions can have the same set of products.
  • the user Before running the Dervish system, the user sets the "minimum support" to define the minimum number of transactions in which a collection of products must be present to form an association.
  • Embodiments of the Dervish system can obtain the minimum support (or, more generally, a numerical "transaction threshold") from user input. As those skilled in the art will appreciate, other methods could be used to obtain the minimum support, including reading a value from a file or hardcoding a value in the system itself. As will be apparent to one of ordinary skill in the art, obtaining a numerical transaction threshold is an example of a step that could be implemented by an input module as embodied in hardware and/or software. Minimum support is referred to by a positive integer, s.
  • Q is a set of products, a subset of P
  • Supt(Q) is the set of transactions, such that each transaction in the set contains all products in Q. That is,
  • I Supt(Q) has at least s elements
  • Q is an association with minimum support s (or an association, for short)
  • Supt(Q) is the support for the association.
  • each subset of an association is also an association with the same minimum support.
  • a maximal association is an association that is not contained in any other association with the same minimum support.
  • the associations discovery problem is, given a positive integer s, to find all associations in T with minimum support s. Given
  • the Dervish system finds all associations with minimum support s in the list of transactions. Each association the Dervish system finds is unique.
  • C(t) For each transaction t there is an associated metric, denoted C(t) as a matter of convenience because cost is often (but not necessarily) the quantity being measured.
  • C(Q) For each association Q, the cost C(Q) of the association could be computed as:
  • a metric is a measure associated with each association, and is computed by accumulating a measure of each transaction supporting the association. Still more generally, some metrics might not be accumulated over all the products in each transaction, but only over those associated products in each transaction. In any case, a metric related to a transaction may be referred to as a transaction metric.
  • a metric can be a scalar or an array. For instance, total transaction cost could be tracked by itself or with gross profit per transaction at the same time.
  • the metric computation need not be a simple accumulation. The metric computation could be done, for example, by any scalar or vector operation.
  • the following is a high-level description of one implementation of the Dervish system, consisting of an initial phase followed by a recursive phase. That is, at the end of the recursive phase, the Dervish system returns to the start of the recursive phase.
  • the associations that consist of a single product are computed as well as their support.
  • these supports are combined to compute supports for candidate associations, where each candidate association consists of more than one product.
  • the following implementation of the Dervish system is one in which the transactions that form the support for each association are computed by means of set intersections.
  • the Dervish system finds the set of all transactions that contain this product. That is, the Dervish system computes for ally ⁇ N
  • the Dervish system selects those products that occur in at least s transactions, and thus form a 1 -product association, that is
  • each subset of an association is also an association with the same minimum support.
  • 1 -product associations are the smallest possible subsets of all possible associations. Therefore, a product that is not a 1 -product association cannot be part of any other associations.
  • Determining a subset of transactions supporting an association (e.g., Supt( ⁇ pk, pt))) from within a set of transactions supporting a smaller association (e.g., Supt( ⁇ pt ⁇ )) is an example of a step that could be implemented by control logic as embodied in hardware and/or software, as will be apparent to one of ordinary skill in the art.
  • the Dervish system finds that S «pt( ⁇ , p v ⁇ ) is the support for an association, for some index v, then there are at least s transactions with both product pk and p v .
  • the Dervish system has now also generated a sequence of sets
  • the Dervish system continues at the beginning of the recursion.
  • the Dervish system finds the smallest index w (w > v) such that Supt( ⁇ pfc,p v , p w ⁇ ) forms the support of an association.
  • the Dervish system continues this process of augmenting the size of an association until no more products can be added to the association, that is, until the association is a maximal association. Then the Dervish system attempts to augment the remaining associations.
  • the Dervish system finds an association, with any number of products, the Dervish system only tries to add products to the association whose index is larger than
  • the Dervish system finds a maximal association. Since the initial set of associations is finite, the Dervish system always terminates.
  • the set of transactions (7) is stored in a transaction database.
  • the transaction database is represented as a matrix. Each column of the matrix represents a product, and contains an array of all transactions that contain that product. Each transaction is initially represented by a row in the matrix.
  • One of the operations performed by Dervish 200 is to swap entries within a product column. This means that a single row then no longer represents a transaction; instead, each transaction is represented as a linked list, linking the columns containing all the products in the transaction.
  • the support for each candidate association (i.e., the set of transactions) is preferably represented with a single index into an array; in such a case, the memory use is related to the number of transactions and products.
  • the first procedure is the top-level procedure, Dervish 200, which searches for all associations.
  • Dervish 200 invokes the second procedure, Metrics 300, which returns a list of products in the association and computes the associated metrics.
  • Metrics 300 which returns a list of products in the association and computes the associated metrics.
  • metrics computation is not necessary for all applications, and may be included or omitted as appropriate.
  • Whirl 400 searches for new candidate associations by attempting to augment the association just computed with any of the remaining products.
  • the first structure, item is a two-dimensional array, whose first dimension is the set of products and whose second dimension is the set of transactions that contain the corresponding product.
  • the second structure, prod is a one- dimensional array that contains information about each product, including a number of indices into array item, which keep track of candidate associations.
  • prod The data for the set of products are stored in prod.
  • Each product/?/ has a data structure, prod ( j ) , which includes the variables name and trans.
  • the name variable contains the name of the product.
  • the trans variable is an array whose size is no larger than the number of products.
  • prod ( j ) . trans (k) is defined, it is an index into item ( j ) (e.g., indicates a location within item ( j ) ) , so that, for 0 ⁇ n ⁇ k, item ( j ) (n) has a set of pointers that represent a transaction containing the current candidate association containing product/?/.
  • a set of pointers may be referred to as, or constitute, a transaction; a transaction product combination may be referred to simply as an item.
  • each entry in prod ( j ) . trans ( k ) may be referred to as an element of the trans variable.
  • Dervish 200 represents the transaction/product combination in item with two pointers, product and right.
  • the first pointer, product points to prod ( j ) ;
  • the second pointer, right points to the next item in transaction / , if any. If there is no next item in transaction t ⁇ , the right pointer points to a structure containing the metric for this transaction, C(t).
  • Dervish 200 defines prod ( j ) . trans ( 0 ) as the number of transactions that have prod ( j ) , for all j .
  • prod ( j ) . trans ( 0 ) is the size of the support for the candidate association containing only product pj, as well as the number of transactions containing product pj.
  • the Dervish system creates item ( j ) , an array of items of size prod ( j ) . trans ( 0 ) .
  • Dervish 200 then fills the array with the transactions that contain product prod ( j ) .
  • the associations discovery process begins.
  • the trans variables for the products could be stored in a single two-dimensional array, or could be incorporated into the item structure.
  • the objective of Dervish 200 is to find all prod ( j ) . trans (k) whose value is at least min_support.
  • the process Dervish 200 uses to achieve this objective is described with reference to Figure 2.
  • Dervish 200 begins by initializing all data structures (as described in the previous section), and setting k to 0, because the initial association has 0 products, that is, no association has been found yet (step 205). Now Dervish 200 starts the main loop (steps
  • Dervish 200 scans through all products in the prod structure, from left to right, to see if Dervish 200 can augment the current association.
  • Dervish 200 starts with the first product, by setting index j to 0 (step 215). If index j is smaller than N (where N is the number of products), that is, if j is the index of a product (step 220), then prod ( j ) .
  • trans (k) is the size of the support for the candidate association that consists of the current association augmented with the j ,h product. Now if prod ( j ) .
  • trans (k) is at least min_support (step 230), then this candidate association is an actual association.
  • Dervish 200 can store this association and compute the metrics for this association (step 240).
  • Step 240 consists of computing the products in the current association and accumulating metrics. Step 240 is described in the following section.
  • Dervish 200 sets prod ( j ) . trans (k) to -1, to insure that this association is found once only.
  • Dervish 200 finds an association, Dervish 200 calls a procedure, such as Whirl 400, which computes a set of candidate associations consisting of the current association, augmented with each of the products with an index larger than j (step 245).
  • a procedure such as Whirl 400
  • Whirl 400 An implementation of such a procedure, Whirl 400, is illustrated in Figure 4 and is discussed in detail below.
  • Dervish 200 increments k (step 250), as the size of the current association is now one larger than the size of the previous association. Dervish 200 then returns to the beginning of the main loop (step 210), to attempt to find even larger associations.
  • Dervish 200 finds that prod ( j ) . trans (k) is not at least as large as min_support (step 230), Dervish 200 moves on to the next candidate association by incrementing j (step 235) and returning to step 220. If, in step 220, index j does not point to a product, then Dervish 200 cannot find any more candidate associations that form an association. Dervish 200 then sets prod ( i ) . trans (k) to 0 for all i and decrements k (step 225) to reduce the size of the current association. Dervish 200 then returns to the start of the main loop (step 210).
  • -21- Dervish 200 now repeats the main loop, attempting to increase the size of the new current association. If k is decremented in step 225 to be negative, then there are no more associations to be discovered and Dervish 200 exits the main loop (step 210), and the system terminates.
  • Dervish 200 starts with an association, and tries to augment the association. If Dervish 200 succeeds, Dervish 200 attempts to augment the association further by adding any of the remaining products. If Dervish 200 does not succeed, Dervish 200 drops the last product that was added to the association and attempts to augment the resulting association with any of the other remaining products.
  • Dervish 200 includes the steps of: (a) forming a candidate association by augmenting a previously determined association with an additional product; (b) testing whether a candidate association has minimum support; and (c) if minimum support exists, accepting as a set, the set of transactions forming the minimum support.
  • steps that could be implemented, respectively, by: (a) augmentation logic; (b) testing logic; and (c) acceptance logic, each as embodied in hardware and/or software.
  • trans (k) 0 for all i dlO k- - / / try smaller size dll ⁇ else ⁇ / / prod ( j ) .
  • trans (k) - 1 / / mark as done dl5 whirl (k, j ) dl6 k++ / / try larger association dl7 ⁇ dl8 ⁇
  • step 240 Dervish 200 constructs a list of the products in an association and computes the associated metrics. As described herein, the metrics are computed by invoking a separate procedure, Metrics 300.
  • Metrics 300 a separate procedure
  • the steps for these processes can be performed in whole or in part by the Dervish or Whirl procedures, or a combination thereof. In other words, the labels and organization of the various procedures is a matter of convenience only and does not affect the underlying functionality to be performed.
  • Step 240 the system is in a state where an association has been found. Since only the support of the association has been stored directly, the system now has to compute the products that comprise the association and the metrics for the association. The following describes how to perform these computations.
  • step 240 the first prod ( j ) .
  • trans (k) items of prod ( j ) are the transactions that support the association. Finding the products that constitute the association is done as follows. The product with the highest index in the association is
  • Metrics 300 is an implementation of step 240 of Dervish 200 that computes the set of products forming the current association and the metrics associated with this association. Referring now to Figure 3, the following describes an implementation of Metrics 300.
  • the current association is of size k+1, and the last product to be added to the current association is product prod ( j ) .
  • the variable that holds the products in the current association, Assn is set to include product prod ( j )
  • variable p is set to j
  • variable q is set to k - 1 (step 310).
  • step 320 When step 320 succeeds, prod (p) is a product in the current association, and prod (p) is added to variable Assn (step 330). Then Metrics 300 continues to search for the next product in the current association by decrementing q (step 335), and returning to the start of the main loop (step 315).
  • Steps 310-335 are examples of steps that could be implemented by logic configured to identify the associated products as embodied in hardware and/or software, as will be apparent to one of ordinary skill in the art.
  • Metrics 300 accumulates the metrics for the association by computing the metrics for each of the transactions that forms the support for the current association. The metrics accumulation takes place in steps 340 through 370. For each of the transactions that forms the support for the association, Metrics 300 searches for the metrics pointer, then accumulates the associated metrics.
  • the metrics accumulation is initiated by setting variable t to 0 (step 340). As long as t is smaller than prod ( j ) . trans ( k ) , there is another transaction for which the metrics have to be found (step 345). The transaction for which the metrics have to be found is the transaction represented by item ( j ) ( t ) . Metrics 300 sets variable current to it em ( j ) (t ) (step 350). As long as variable current does not point to a metrics pointer (step 355), the value of current is replaced with the value of current . right (step 360).
  • Step 355 Eventually, current will be a metrics pointer (step 355 succeeds). Then the metrics that this pointer points to are accumulated with previously accumulated metrics (step 365), and the procedure moves on to the next transaction by incrementing the value of t (step 370) and returning to step 345.
  • Metrics 300 ends. Steps 340-370 are examples of steps that could be implemented by measurement logic as embodied in hardware and/or software, as will be apparent to one of ordinary skill in the art.
  • the Whirl 400 procedure is an implementation of step 245 of Dervish 200 that, given the current association, computes candidate associations that contain the current association.
  • This section first presents a high level discussion of Whirl 400. This is followed by a discussion of a flow diagram and pseudo-code for an implementation of Whirl 400. In the conceptual basis, several assumptions are made. Assume an association of size k+l has been found, that is, there are at least min_support transactions that have k+1 products in common. If prod ( j ) is the product among these with the highest index, and Q is the set of the other k products, then Supt ⁇ Q ⁇ prod ( j ) ⁇ ) is the support for the association.
  • Whirl 400 which has inputs j and k, finds all transactions, for each m where m > j , containing all products in Q u ⁇ prod ( j ) ⁇ , as well containing prod (m) .
  • Whirl 400 separates the transactions that have prod (m) into a group that also contains all products in Q ⁇ prod ( j ) ⁇ and a group that does not. In so doing, Whirl 400 computes new candidate associations.
  • Whirl 400 After initializing counter trans (k+l ) for each product (step 405), Whirl 400 consists of two nested loops. The outer loop (steps 410 through 440) goes through all transactions that form the support for the current association, where prod ( j ) is the last product to be added to this association. Once all these transactions have been processed (when step 410 fails), execution of Whirl 400 terminates. For each of these transactions (when step 410 succeeds), Whirl 400 considers each of the remaining products (that is, products with an index higher than j ).
  • Whirl 400 follows the right pointer for each of these transactions (step 415) until the pointer points to a field that is not a transaction but a metric pointer (step 420). In this case there are no more products within the current transaction, Whirl 400 moves on to the next transaction (step 440), and returns to the start of the outer loop (step 410).
  • Whirl 400 moves this transaction to the list of transactions containing all products in the current association as well as the current product, by means of a swap (step 425).
  • This swap includes updating right pointers to the swapped items as necessary.
  • reordering the transactions by swapping is just one example of a procedure for selectively reordering the transactions.
  • step 425 is an example of a step that could implemented by ordering logic as embodied in hardware and/or software.
  • Whirl 400 increments the number of transactions that form the support for the current association augmented with the current product (step 430). Whirl 400 then moves on to the next transaction containing the current product (step 435) and returns to the start of the inner loop (step 420). Both loops terminate in all cases.
  • the Dervish system reads the transaction data into the data structures.
  • the array of items for a product is represented as a column in Table 6.
  • the first element of each array (the first transaction) is at the top of the column.
  • the trans ( 0 ) index for each product is at the bottom of the table; the index is the number of transactions that contain the corresponding product.
  • transactions in a product column are represented by transaction id's, rather than by product and right pointers. (In order to focus on the process, the metrics are not included in this example.)
  • the Dervish system starts at the left-most column, product a. This product is in transactions 1 , 4, 5, and 7. For all columns to the right of a, the Dervish system moves transactions 1, 4, 5, and 7, if they occur, to the top of those columns, by swapping them with other transaction numbers. For instance, transaction 2 does not have product a, and transaction 2 is the second entry in column b, while transaction 4 has product a, and is the third entry in column b. The Dervish system swaps transaction numbers 2 and 4 in column b. This swapping can be done by the Whirl procedure.
  • the Dervish system tracks the number of transactions from 1, 4, 5, and 7 that occur in each column and stores that number in trans ( 1 ) .
  • this index remains 0, as the Dervish system has no use for product a at this point.
  • the Dervish system has processed product a, so the Dervish system can set its index trans ( 0 ) to -1.
  • the result of processing product a is Table 7.
  • the Dervish system scans the indices of trans ( 1 ) left to right, and determines that the first index larger than the minimum support (3 in this example) is in column b. This means that the Dervish system has found an association, with two products, that contains product b. To determine the other product, the Dervish system scans the indices of trans ( 0 ) from left to right and finds that the last column with index -1 is column a. Hence products a and b form an association, as they occur in four different transactions.
  • the Dervish system scans the trans ( 2 ) indices and notes that product c is part of an association with three products.
  • the other two products in this association are products a and b.
  • the Dervish system can try to extend the association with additional products. However, when the Dervish system scans the indices in trans ( 2 ) , there are no more products with at least minimum support. Therefore, the association of products a, b, and c is maximal. Next, the Dervish system determines that there are no other associations that contain a and b.
  • the Dervish system now goes back to scanning the trans ( 1 ) indices, from left to right.
  • the first product to have minimum support is c, so product a and product c form an association. Since none of the other trans ( 1 ) indices has at least minimum support, the association of a and c is maximal.
  • the Dervish system goes back to scanning the trans(O) indices, where the
  • the Dervish system can only add product f, as trans ( 2 ) for f is three. But the Dervish system finds that trans ( 3 ) for f is two, so that products b, c, and d form a maximal association. Also, b, c, and f form a maximal association.
  • products c and d form a maximal association, because the association cannot be augmented with the only other candidate, product f.
  • Products c and f also form a maximal association.
  • the Dervish system finds that d and e form an association, which cannot be augmented with f, and that d and f form a maximal association.
  • the Dervish system finds that d and e form an association, which cannot be augmented with f, and that d and f form a maximal association.
  • the Dervish system concludes that e and f do not form an association and that product f forms a maximal association.
  • the Dervish system found that each product forms an association by itself.
  • the Dervish system found the following associations of multiple products, in the order they were generated: ⁇ a,b ⁇ , ⁇ a,b,c ⁇ , ⁇ a,c ⁇ , ⁇ b,c ⁇ , ⁇ b,c,d ⁇ , ⁇ b,c,f ⁇ , ⁇ b,d ⁇ , ⁇ b,e ⁇ , ⁇ b,f ⁇ , ⁇ c,d ⁇ , ⁇ c,f ⁇ , ⁇ d,e ⁇ , and ⁇ d,f ⁇ .
  • the previous example is used, with the addition of metrics.
  • two types of metrics are accumulated, average transaction size (e.g., number of products) and average gross profit per transaction.
  • Two columns representing the metrics for each transaction are added to the transaction table.
  • Table 15 is the state of the Dervish system when this association is found. It is the same table as Table 9.
  • the metrics information in Table 16 includes, for each transaction, the size of the transaction and the dollar amount for the transaction. In addition, Table 16 includes the cost for each product.
  • the metrics accumulation procedure finds the metrics for each transaction. While the Dervish system accumulates the transaction size and gross profit metrics simultaneously, different approaches are used for each metric.
  • the metrics procedure accumulates the transaction size metric. For transaction 1. there are four products, so the first entry in the "size” column is 4. The metrics procedure stores that one transaction has been processed and that the accumulated size is 4. Then, for transaction 2, there are also four products. The metrics procedure stores that two transactions have been processed and that the accumulated size is 8. Then the procedure processes transaction 6, which has five products, and stores that three transactions have been processed, with an accumulated size of 13. Finally, to get the average transaction size, the procedure divides the accumulated size, 13, by the number of transactions processed, 3, to get an average transaction size of 4.33.
  • the part of the metrics procedure that computes the average profit is somewhat more complicated. Again, the procedure processes the three transactions in order. For transaction 1, the "amount" column is $7.48. To obtain the gross profit on this transaction, the procedure finds the products that constitute the transaction (a, b, c, and d) and their costs in the second table (Table 16). The total cost for these four products is
  • the invention is not limited to the embodiment described above. Any number of optimizations and enhancements are possible. Some of these are useful, for example, to reduce the run-time or the memory requirement for processing actual transaction databases. Other embodiments of the Dervish system are particularly useful for product categories, i.e., products that are organized into a product hierarchy. In some such embodiments, additional variables might be added to the existing data structures. Still
  • the number of operations is reduced if the products are ordered in prod and item, from the product with the smallest number of transactions, to the product with the largest number of transactions. This is because the first product is checked most often for inclusion in an association, and the product that occurs in the least number of transactions is the statistically least probable product to appear in an association.
  • Another embodiment of the Dervish system takes into account the fact that prod ( j ) . trans ( k ) is always 0 for all k > j . Therefore, there is no need to allocate memory for these items, because no part of the system refers to them.
  • step 215 Having a list of pointers to all products in the association also makes it easier to produce the current association (step 240), because it eliminates the procedure of scanning through the set of trans (k) indices.
  • Dervish system adds a left pointer to each of the right pointers described previously. Whereas the right pointer points to the next product item in a transaction, the left pointer points to the previous product item in a transaction. Now the list of products in a transaction is a doubly-linked list, rather than a singly-linked list. This facilitates the swapping of items, such as is performed in step 425 of Whirl 400. • As will be obvious to one skilled in the art, any data structure described above that is an index into an array can be replaced with a pointer to the address of a location in the array. Similarly, any pointer can be replaced with an index into an array.
  • the inner loop examines products and the outer loop examines transactions. It is possible to swap the two loops.
  • the Whirl and Metrics procedures can be combined. In the embodiments above, step 355 in Metrics 300 and step 420 in Whirl 400 test for the same condition, in a similar loop. Rather than perform each loop once in Metrics and once in Whirl, there can be just a single loop. Such a combination of loops speeds up the system, at no expense in memory requirements.
  • the embodiment of the Dervish system presented in section 4 performs a modified depth-first search. It is possible to implement the Dervish system as a pure depth- first search. That is, the Dervish system attempts to augment the current association with a single product, then augment the resulting association, before attempting to augment the current transaction with other products.
  • the depth-first search uses a minimal amount of memory (since only one association is stored at any time), and the search mechanism produces associations while the system runs, rather than at the end. It is also possible to implement the Dervish system as a breadth-first search. In such an implementation, the Dervish system attempts to find all products that form an association with the current association, then attempts to augment each of the resulting associations. More generally, those skilled in the art will realize that virtually any heuristic search method can be used as well.
  • Hierarchical Products In a typical application of the associations discovery problem, the set of products can be divided into categories or sub-categories. For instance, a supermarket can have the category "beverages” with a sub-category "fruit juices," both of which contain a particular apple juice product.
  • the associations discovery problem can be generalized to include these "hierarchical products.”
  • the term "product” should be understood to include products (per se), categories, and/or sub-categories. For instance, in the above example, it is possible that there is no association that includes apple juice (product) and hamburger patties (product), but that there is an association that includes fruit juices (sub-category) and hamburger patties (product).
  • Another embodiment of the Dervish system could handle associations involving hierarchical products as a simple extension of finding associations involving products. For each hierarchical product, the Dervish system computes all transactions that contain that hierarchical product. The Dervish system then adds this hierarchical product to the list of products, as if the hierarchical product were an actual product. The Dervish system will find each association that involves hierarchical products. Treating hierarchical products as if they were actual products may lead to the generation of trivial associations. For instance, in the above example, if apple juice and hamburger patties form an association, then fruit juices and hamburger patties also form an association. To weed out such associations, a simple check can be added to the condition for the inner loop of Whirl 400, in step 420. Specifically, when a hierarchical product is checked for inclusion in the current association, a check is performed to
  • the -40- determine whether the hierarchical product includes any of the products in the current association. If the hierarchical product includes any of the products in the current association, the hierarchical product is skipped.
  • the hierarchical products that contain the product should be later in the order of products. Then, if the list of products is ordered from least-frequent to most-frequent, a hierarchical product is guaranteed to occur in the list of products after all of the products (and hierarchical products) the hierarchical product contains. If the hierarchical product contains exactly one product, care should be taken that the hierarchical product occurs in the list after the product it contains. 7.4. Other Embodiments of Metrics Accumulation
  • step 240 of Dervish 200 There are several ways that metrics can be accumulated for the Dervish System. This is the implementation of step 240 of Dervish 200. Which method should be chosen depends on the complexity of the metric accumulation function, and on the number of associations for which metrics need to be calculated.
  • the Dervish system keeps the data structures representing the sets of products and transactions as before. Each transaction is represented by a linked list; each element of the linked list represents a product in that transaction, and the last element of the list is a pointer to the metric for the transaction. If there are many products in each transaction, traversing the entire list of products for each transaction may take too much time. In that case, an embodiment reduces the metrics accumulation time by using more memory.
  • this embodiment For each product in each transaction, this embodiment has a pointer directly to the metrics for that transaction. Once the Dervish system identifies an association, and the transactions that form its support, the Dervish system follows the metrics pointer, and accumulates the metrics. This reduces the time to search for the metrics to a constant, as the loop that includes steps 355 and 360 in Metrics 300 is eliminated.
  • Another embodiment only includes the metrics pointer for each 27 product in a transaction. This reduces the memory consumption (compared to the previous embodiment), while maintaining a constant search time for the metrics.
  • Yet another embodiment is to do a cascaded metrics accumulation. This is useful when the user wants to get all associations and their metrics, not just the maximal associations.
  • the Dervish system uses the result for larger associations (that is, with more products), to compute the metrics for smaller ones. For instance, assume an association ⁇ a, b, c ⁇ was discovered after association ⁇ a, b ⁇ , and assume the Dervish system has computed the metrics for association ⁇ a, b, c ⁇ . Now there is a simple split in the transaction for association ⁇ a, b ⁇ between those transactions that contain product c, and those that do not. For each transaction in association ⁇ a, b ⁇ , the Dervish system follows the right pointers, as before.
  • the Dervish system does not have to calculate the metrics for this transaction, as the metrics for this transaction are already included in the metrics for association ⁇ a, b, c ⁇ . If, however, the right pointer never points to an item of product c, then the Dervish system has to calculate its metrics, and accumulate the metrics for product c with the remaining metrics. This requires more bookkeeping, but may be faster if the metrics computation is complicated.
  • Still another embodiment is to do metrics accumulation on variables that are not pre-defined.
  • the actual metrics accumulation occurs in step 365.
  • the average size and average gross product per transaction for an association were computed.
  • One can compute such metrics by defining steps or procedures, and the variables they use, when the system is constructed. This is known as static metrics accumulation.
  • steps or procedures for certain operations such as taking the average of a list of numbers
  • steps or procedures for certain operations such as taking the average of a list of numbers
  • the variables that have to be used for metrics accumulation are defined, as well as the specific operations that are to take place on these variables. These specifics can be supplied through a script or by other means. This is known as dynamic metrics accumulation.
  • the Dervish system is memory-efficient and has a maximum memory requirement that can be calculated before running the system. Still, it is possible that the memory requirement is larger than the size of the main memory of the computer. In that case, rather than declaring too much memory, and have the computer slow down, the associations discovery problem can be partitioned into sub-problems, and the Dervish system can aggregate intermediate results.
  • This method of partitioning the problem and then aggregating intermediate results can also be used to parallelize the Dervish system. Rather than loading each sub-problem in main memory, the Dervish system can compute the results for each sub-problem on a different processor. The aggregation of results, too, can be done in parallel.
  • the following describes an implementation of an "out-of-memory" version of the Dervish system and discusses how this is the basis for a "parallel" version.
  • This method can be viewed as including the steps of: (a) creating multiple sub-processes; (b) determining intermediate candidate association results for each sub-process; and (c) aggregating these intermediate results.
  • steps that could be implemented, respectively, by: (a) a partitioning module; (b) an iteration module; and (c) aggregation logic, each as embodied in hardware and/or software. 7.6. Checking Support for a Candidate Association
  • the partitioning of the associations discovery problem is a partitioning of the set of transactions into distinct parts (i.e., partitions or sub-problems).
  • the partitioning process is described in detail below.
  • the Dervish system is run for each partition, and the result of solving each sub-problem is a number of candidate associations (i.e., a list of products that might have minimum support).
  • the Dervish system For each candidate association, the Dervish system counts the number of transactions that contain all products in the candidate association. At this point, the Dervish system only has the count for a particular partition. To count the total number of transactions that contain each product in a candidate association, the Dervish system takes
  • the Dervish system moves the transactions that also contain the first product up and the remaining transactions down.
  • the trans ( 0 ) index for this product is the number of transactions that have both products
  • the first trans ( 0 ) transactions for this product are the transactions that have both products.
  • the Dervish system continues to do this for each remaining product in the candidate association.
  • the trans ( 0 ) index for the last product is the number of transactions that have all the products in the candidate association.
  • the Dervish system makes such a count for each partition and the sum of these counts is the total number of transactions that contain all products in the candidate association. Finally, the Dervish system checks if this sum is at least as large as the minimum support.
  • the Dervish system starts out by scanning the list of products from left to right.
  • the first product in the candidate association is product b, and this product is contained in six transactions.
  • the Dervish system determines which of the six transactions contain the next product, c. There are five such transactions.
  • the Dervish system determines which of these five transactions contain product f.
  • the Dervish system uses Whirl to move the transactions that also have product c up (6, 2, and 7) and the ones that do not have product c down (3 and 8).
  • the resulting table is Table 18.
  • the Dervish system determines that there are three transactions in this partition that contain products b, c, and f.
  • This counting procedure One should note several things about this counting procedure. First, nothing was changed in the columns for products a, d, and e, those products not in the candidate association. Second, the Dervish system only made a single sweep through the columns of the products that were part of the candidate association. In this counting procedure, there is no backtracking, and Whirl is invoked only once for each product, whereas when searching for all candidate associations, Whirl may be invoked multiple times. It follows that this counting procedure is much faster than generating all candidate associations.
  • the user estimates the worst-case memory requirement for the number of products and transactions. If the memory requirement is larger than the size of the main memory, the Dervish system has to partition the set of transactions. If the total memory requirement is N times the size of the main memory, then the Dervish system partitions the set of transactions into N partitions of roughly equal size.
  • the Dervish system does not depend on any particular partitioning method. While it is necessary that each transaction occur in exactly one partition, it is immaterial which transactions are lumped together in any given partition. It is desirable that the memory requirement for each partition be smaller than the size of the main memory.
  • the Dervish system processes each partitioned set of transactions.
  • s is the minimum support for the entire set of transactions. If there is an association, that is, a set of products that occur in at least s transactions, then there must be a partitioned set for which the same set of products occur in at least s/N transactions.
  • each partition the user sets the minimum support to the smallest integer that is at least s/N, and the user runs the Dervish system (or any associations discovery system).
  • Each association found for each partition is a candidate association for the entire set of transactions. A procedure for counting the number of transactions that contain each candidate association was discussed above.
  • partitioning allows processing of data sets too large to fit in main memory, there is some associated processing overhead.
  • the runtime overhead in partitioning the set of transactions is relatively minor.
  • the Dervish system partitions the set of transactions and processes each partition in turn, yielding a
  • the Dervish system then loads each partition in memory again and counts the number of transactions that contain each candidate association.
  • a Parallel Processing System From the previous out-of-memory version, one can distill an implementation of the Dervish system that will run on a multiprocessor machine.
  • the Dervish system partitions the set of transactions into N sets, as before, where N is at most the number of processors.
  • the Dervish system assigns each partition to a processor and processes each partition, using minimum support s/N, to yield a number of candidate associations.
  • the Dervish system counts, for each candidate association, in how many transactions all products in the candidate association occurs. This, too, can be done in parallel, except for the accumulation of every count for each partition. In contrast to the out-of-memory version, it is not necessary to load each partition into memory twice. Because of the overhead for counting candidate associations and possibly creating candidate associations that are not associations, the parallel implementation of the Dervish system will not run N times as fast as the sequential implementation. But for large data sets, there is a significant speedup in executing the system in parallel. Note that the parallel implementation of the Dervish system also works if the memory requirement is larger than the total main memory available on the parallel machine. In that case, the Dervish system can use the out-of-memory implementation on each processor. In effect, the Dervish system can cascade an out-of-memory partition on top of a parallel partition or cascade a parallel partition on top of an out-of-memory partition.
  • a parallel implementation of the Dervish system need not run on a single multiprocessor machine. For example, it is possible to do parallel processing on a network of single processor or multiprocessor machines. Clearly, any interconnection of computers can be used for a parallel implementation of the Dervish system.
  • the Dervish system takes a product-based approach rather than a transaction- based approach and uses far less memory than conventional systems.
  • the Dervish system is therefore capable of solving much larger data sets in less time.
  • the Dervish system run out of memory, there is an efficient method to partition the problem into sub-problems. While this increases the run-time, running out of memory does not cause an excessive increase in run-time, as it does with the traditional systems.
  • there is a natural division of the associations discovery problem that allows the Dervish system to run on any number of parallel processors.

Abstract

The technology for discovering associations among products in a database of transaction data includes an association as a set of products which occur in at least a predetermined number of transactions (115) or minimum support for the association. The technology starts with an association of products and determines a set of transactions (115) supporting that association. To augment the association, the technology selects from the supporting set of transactions a next set of transactions having minimum support for the original association plus an additional product. The technology can also compute metrics (125) for an association including the average size and the average profit/loss for a transaction containing the association.

Description

METHOD AND APPARATUS FOR ASSOCIATIONS DISCOVERY
REFERENCES CITED
1. U.S. Patents
5,615,341 3/1997 Agrawal et al. 395/210
2. Other Publications
Agarwal, R., Imielinski, T., and Swami, A., "Mining Association Rules between Sets of Items in Large Databases," 1993, Proceedings of the 1993 ACMSIGMOD Conference, pp. 207 - 216.
Agarwal, R., and Srikant, R., "Fast Algorithms for Mining Association Rules," 1994, Proceedings of the 20th International Conference on Very Large
Databases, pp. 487 - 499.
Agarwal, R., and Shim, K., "Developing Tightly-Coupled Data Mining Applications on a Relational Database System," 1996, Proceedings of the 2nd International Conference on Knowledge Discovery in Databases and Data Mining, pp. 287 - 290.
TECHNICAL FIELD
The present invention relates generally to processing data in databases and, more particularly, to computing affinity among products in a transaction database.
BACKGROUND OF THE INVENTION One component of market research is the analysis of purchase patterns to detect combinations of goods and/or services that are often purchased together. The term "product" may refer to goods, services, or other items included in a purchase transaction. To facilitate purchase pattern analysis, computer programs are frequently used to process data representing purchase transactions. These data might be stored in a database as sets of transactions (e.g. , purchases at a supermarket), where each transaction contains a number of products. It is of interest to market analysts when a collection of products is present together in a large number of transactions. If these products are together in at
-1- least a pre-defined number of transactions, the collection of products is referred to as an "association" and the pre-defined number is referred to as the "minimum support" for the association. A "maximal association" is an association that is not contained in another association. Accordingly, the "associations discovery" problem is to find all associations in a set of transactions having a given minimum support. The brute-force approach to finding all associations is transaction-based: take two products, and go through all transactions to count how many transactions contain these two products. This is the number of 2-product associations. Then continue with larger numbers of products, determining 3-product, 4- product, etc. associations, until there is no larger association left. A computer program implementing this simple approach may require a lot of memory, especially if the minimum support is low, and there are many collections of products for which there is at least minimum support. In fact, it is difficult to predict how much memory will be needed for a given problem. In addition, when this approach is implemented on a fixed- memory (e.g. , standard microprocessor-based) computer, once the program runs out of memory, processing slows down considerably.
1. The IBM Approach to the Associations Discovery Problem
Traditional implementations of solutions to the associations discovery problem are based on algorithms primarily developed at IBM. These algorithms are based on an algorithm developed at the IBM Almaden Research Center, by Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Their paper, "Mining Association Rules between Sets of Items in Large Databases," was published in the Proceedings of the 1993 ACM SIGMOD Conference in Washington, DC. While implementations of IBM-based algorithms improve upon the brute force transaction-based approach sketched above, they nonetheless still suffer from shortcomings.
To understand these shortcomings, consider the following explanation of the initial, and simplest, implementation of an IBM system known as the AIS algorithm ("AIS"). AIS is an iterative approach that takes a set of associations, each with k products, and attempts to extend each of those associations with another product, and then
-2- checks if the resulting candidate association has minimum support. A "candidate association" is a set of products that may or may not be an association.
In this explanation of AIS, L (k) is the set of all associations with exactly k products and C (k) is a set of candidate associations with exactly k products. Pseudocode for an implementation of AIS is presented in Table 1, which is followed by a discussion of the code.
TABLE 1. Pseudo-code for an implementation of the IBM system
101 compute (l)
102 k = 2
102 while (L(k-l) not empty set) {
103 C(k) = empty set
104 foreach (transaction t) {
105 L(t) = subset (L(k-l) , t) // assocs contained in t
106 foreach (1 in L(t) ) {
107 C(t) = all 1-product extensions of 1 in t // candidates contained in t
108 foreach (c in C(t)) {
109 if (c in C(k) ) {
110 add 1 to count of c in its entry in C(k)
111 } else {
112 add c to C (k) with count of 1
113 }
114 }
115 }
116 }
117 L(k) = {c in C(k) | c . count >= min_support}
118 }
119 result: union of all sets L(k)
Figure imgf000005_0001
AIS starts out with all associations that contain exactly one product, i.e.,
L ( 1 ) (line 101). Identifying these associations requires a pass over all transactions, counting the number of occurrences of each product. In the iterative loop (lines 102 through 118), AIS processes a set of associations, each of which has k- 1 products. AIS examines each transaction for a possible extension of each association (lines 104 through 116). Specifically, AIS first checks whether an association has all its products in the current transaction (line 105); then the system tries to extend the association by adding products from the current transaction that are not already in the association (line 107);
-3- then it adds the resulting candidate associations to C ( k ) , and then it counts how many transactions contain that candidate association (lines 109 through 113). Finally, AIS weeds out those candidate associations that do not have at least minimum support (line 117). Implementations of algorithms based on AIS have the same structure as just discussed. In other words, these implementations attempt to extend the associations already identified by adding one product to each, and then looping through all transactions to count the support for those candidate associations. Therefore, the following discussion of disadvantages inherent in implementation of AIS and other IBM- based algorithms applies equally to these implementations.
AIS, as presented above, is a simple implementation of an associations discovery problem solution. With this approach, the number of candidate associations that are considered is typically large and includes many more candidate associations than actual associations. Because processing time and memory requirements correlate to the number of candidate associations, extensions of this basic IBM algorithm concentrate on how to reduce the number of candidate associations, and on how to count the support for a candidate association efficiently. The number of candidate associations is usually reduced by observing that if there is a subset of a candidate association that is not an association, the candidate association is not an association, either. Intelligent hashing schemes are used to count the number of transactions that form the support for a candidate association.
This hashing is the Achilles' heel of AIS because set C (k) can get quite large. In order to achieve reasonable performance, the check to determine whether a candidate association is counted in C (k) has to be fast (line 108). The standard method to speed up the check is to create a hash table for C ( k ) . However, the size of C ( k ) can be a problem; if there are N products that have at least minimum support, then C ( 2 ) contains N*(N-1) candidate associations. If many of these candidate associations are associations, the number of candidate associations at the next level is even larger. As a result, the hash table can take up a significant portion of main memory, and if the hash table cannot fit in main memory, performance of the system degrades significantly. Further, whether or not
-4- the hash table will fit in memory cannot be predicted, which means that an arbitrary number of products cannot be tracked efficiently.
Other shortcomings are inherent in implementations of IBM-based algorithms. First, such algorithms require N passes through the transaction data, where N is the size of the largest association. Because N is not known in advance, performance is not predictable. Second, IBM-based algorithms do not maintain a list of transactions that support an association. Consequently, to accumulate metrics (e.g., total transaction cost or gross profit) for an association, it is necessary to loop once more through all transactions, which degrades performance. Third, IBM-based algorithms only produce maximal associations as the final step, rather than on-the-fly. This means that if parallel processing is used, the benefits are necessarily reduced. For a parallel implementation of IBM-based algorithms, distributed over a number of processors, the hash table has to be duplicated for each of those processors, increasing the already large memory requirements. Accordingly, there is a need for a solution to the associations discovery problem that offers advantages over traditional implementations.
SUMMARY OF THE INVENTION
The invention is a software application designed to solve the associations discovery problem. An exemplary embodiment of the invention is referred to herein as the Dervish system. The Dervish system takes an orthogonal approach to that used by traditional implementations. Specifically, the Dervish system approach is product-based rather than transaction-based. For each product, the Dervish system stores the identities of the transactions that contain that product. This takes a fixed amount of memory that is not extended subsequently. The system keeps track of the transactions that have products in common by judiciously swapping the order of the transactions for a given product, and by keeping pointers to identify sets of transactions.
In the Dervish system, the set of products and the set of transactions can both be ordered. This ordering can be chosen by the user. Ordering the set of products and the set of transactions results in a unique index for each product and a unique index for each
-5- transaction. The main task for the Dervish system is to augment an existing association with a new product, to form a new association. This new product generally has an index that is higher than that of any of the products already in the existing association. The Dervish system does not directly store the identities of the products that constitute the current association; rather, the products in the current association can be inferred from the pointers that identify the set of transactions that form the support for the current association.
It is assumed that the transaction and product information is stored in a database. If all data structures that are needed to execute the Dervish system fit in the main memory of the computer on which the Dervish system is running, then the data only need to be read once. Should the main memory not be sufficient to store all the necessary data structures, then there is an embodiment of the Dervish system that segments the associations discovery problem into parts, each of which is analyzed, yielding a number of candidate associations. In this segmented embodiment, there is a second pass through the data to check which of these candidate associations are actual associations. In yet another embodiment of the Dervish system, the problem is also split into parts, where each part is processed in parallel. This processing yields candidate associations and, after a parallel check, a number of actual associations is identified. In any of these embodiments, or combinations thereof, the data are read from the database (or a copy thereof stored on local disk) at most twice.
The major advantages of the Dervish system over those implementing traditional transaction-based approaches are obvious. First, the amount of memory the Dervish system uses is related to the number of products and the number of transactions, which is known in advance. The amount of memory used by traditional systems is based on the number of associations in the set of transactions, which is not known in advance.
Consequently, the performance of the Dervish system is predictable and does not degrade when the number of candidate associations is large. Second, because the Dervish system does not use a hash table to count the number of transactions that support a candidate association, whether or not the hash table will fit in memory is not an issue, and the Dervish system can track an arbitrary number of products. Third, because the Dervish system maintains a list of transactions that support an association, metrics for an association (e.g., total transaction cost or gross profit) can be accumulated without reprocessing all the transactions. Last, the Dervish system produces maximal associations on-the-fly, thereby increasing the efficiencies gained when parallel processing is used.
BRIEF DESCRIPTION OF THE DRAWINGS
Figures 1 A and IB respectively show a pictorial and corresponding flow diagram for an exemplary embodiment of the present invention, as applied to retail transactions at a supermarket.
Figure 2 shows a flow diagram for the major procedures that implement an embodiment of the Dervish 200 system.
Figure 3 shows a flow diagram for an implementation of the Metrics 300 procedure of the Dervish 200 system.
Figure 4 shows a flow diagram for an implementation of the Whirl 400 procedure of the Dervish 200 system.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENT(S)
The Dervish system provides a solution to the associations discovery problem. To present a complete explanation of the Dervish system, first the computing environment in which the system runs is discussed. Next an exemplary application of the Dervish system and other possible applications are discussed. Then the system terminology and theory are discussed. Lastly, a detailed description of an embodiment of the Dervish system is given.
In the following descriptions and pseudo-code, specific steps, procedures, and other specifics are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without the specific details. 1. System Implementation
The invention may be practiced in the context of an operating system resident on a general purpose computer, for example, the mainframe computers and/or microprocessor- based workstations produced by Digital Equipment Corporation, IBM, Sun Microsystems, and Hewlett-Packard. The computer has resident thereon an operating system, for example, Windows/95 or Windows/NT from Microsoft, or any variant of the Unix operating system. Database storage and access may be provided using a standard relational database system, for example, as provided by Oracle, Sybase, Informix, or IBM, or using a flat file database. Alternatively, data access can be provided by a cash register data storage system. The invention may be implemented in an object-oriented programming language such as C++. More generally, it will be apparent to those skilled in the art that the present invention can be implemented in any high-level programming language, either compiled or interpreted, on any appropriate computer having a operating system and a database-access providing mechanism. As those skilled in the art will appreciate, the functionality of the invention described herein could also be implemented using any kind of computing apparatus, system, or technology, including hardware or with a combination of hardware and software. Such hardware includes a general purpose processor, a micro-processor, a program logic array, an application-specific integrated circuit, and any other devices having sufficient processing capability to perform the functionality. In addition, the invention could be practiced in a distributed computing environment, including the Internet.
2. An Exemplary Application of the Invention for Transaction Processing
Figures 1 A and IB show an exemplary application of the Dervish system. In this example, the management of a supermarket chain wants to find out if lowering the price of carbonated beverages ("sodas") to below cost entices customers to buy more products than they would otherwise, or if the average customer comes into the store only to buy these sodas, at a loss to the store. The latter is called cherry-picking, and is a behavior the supermarket chain wants to avoid.
-8- Referring to Figures 1A or IB, the supermarket chain records the products in each transaction using a bar-code scanner (step 110) and stores the data in a database (step 1 15). Also stored in the database are the total cost of the transaction and the gross margin of the transaction. The input to the Dervish system could therefore be a database that contains the transaction data ordered by store, and by week of the year. Now the supermarket management runs Dervish on a computer (step 120) to generate a set of associations with associated metrics. This output is stored in another database for further analysis (step 125). In this example, the associations contain sets of products that are frequently bought together, for each store, and for each week of the year. The metrics include, for each such association, the average size of a transaction and the average profit (or loss) for a transaction that contains that association.
Next, an analyst organizes the associations and their metrics, by means of a visualization tool, e.g., a pie chart or graph (step 130). From this analysis, the analyst generates a report for each store in the chain (step 135). Based on the findings in the report, the store can take action, such as changing the prices of certain items or changing the layout of items in the store (step 140). Nominally, this is where the process ends. It is, of course, possible to repeat this process, based on the new circumstances in the store.
In the case of the example of cherry-picking behavior of sodas, the analysis of the output of the Dervish system might reveal that for the occasional store customer, the average transaction had a moderate profit, and little cherry-picking occurred. For the customers enrolled in a frequent-buyer program, however, Dervish might identify specific stores in specific regions of the country where many frequent buyers bought large quantities of sodas without purchasing many other products; in other words, these customers were cherry-picking sodas.
3. Other Applications of the Invention
The Dervish system for associations discovery can be applied in any number of fields to analyze the concurrent or sequential existence of items or events, i.e., to discover associations. Without limitation, examples of such applications follow:
-9- • Businesses can use associations discovery to find patterns of events that occur often. For example, in banking, patterns of opening and closing accounts can be determined. At an even finer level of detail, account transactions information, such as deposit and withdrawal patterns, can be determined. • Retailers can determine ordering patterns of large ticket items, for example, an appliance purchase sequence might be a refrigerator, followed by a washer/dryer, followed by a dishwasher.
• Sales forces, in any line of commerce, can use the associations that the Dervish system identifies to select a product for marketing purposes. For example, if there is an association that contains products a, b, and c, then salespeople can scan the transaction data to find all of the people who have purchased products a and b, and market product c to them.
• Marketing departments can use associations based on aggregated purchase data to create behavioral profiles for households. For example, if the transactions/purchases have a customer or household ID, one can take the union of all products purchased in a time period (i.e., multiple market baskets), such as seasonally or yearly, and create associations across the household. In this case one may also want to have a minimum threshold for the associations and/or products based on volume or dollars. This can be done on panel data, such as Nielsen, or on purchase data from a retailer. These behavioral imprints can be used to design marketing programs, or to direct a product offer to particular households.
• Telecommunications companies can use associations to find patterns of calling/called city pairs, which can be used for designing calling programs, detecting fraud, and doing capacity planning. Patterns of routing can also be detected between the switching points in the network by constraining the associations to be ordered.
• Insurance companies are interested in the simultaneous occurrences of multiple damages, such as fire and flood, or fire and earthquake. These companies can use
■10- the Dervish system on their claims database to find out if the severity, duration, and claim amounts vary among these different types of disasters.
• Insurance companies in diverse insurance market segments are interested in detecting fraud. For instance, medical insurance companies can use the Dervish system to detect when unnecessary procedures (e.g., tests, treatments) are routinely added to a common occurrence or diagnosis. The associations would consist of heterogeneous sets that include the diagnosis, along with the treatments. A high occurrence of unnecessary procedures for a particular diagnosis indicates widespread fraud. For example, a broken arm would typically call for an x-ray and a cast. However, unnecessary procedures (e.g., blood typing) might also have been performed or billed. Likewise, automobile insurance companies can use the Dervish system to detect when a ring of people, acting in different roles, file false claims. For example, in one claim, John Doe is the patient, Jane Smith is the doctor, and Tom Jones is a witness. In another, Jane Smith is the patient, Tom Jones the doctor, and John Doe the witness. By finding sets of names of people that occur often somewhere in a claim, these rings of fraudulent claimants can be detected.
• Physicians can track the symptoms of a set of patients. The Dervish system can be used on a patient database to find out if certain symptoms occur together frequently. For example, researchers can use the Dervish system to discover if multiple minor symptoms occurring together often leads to major surgery.
• Psychologists can use the Dervish system to discover in their patient database what multiple neuroses occur more frequently than others and how the severity and duration vary. • In psychometrics, one can tabulate, and store in a database, the errors that are made on standard tests. Researchers can use the Dervish system on this database to detect patterns of errors that occur frequently, which may in turn indicate learning disabilities.
-1 1- • Behavioral investigators can use the Dervish system to discover if certain behaviors often go together. Using a database cataloguing behaviors, the Dervish system can be used to discover clustering of behaviors.
• Sociologists are interested in discovering patterns among the population. They can use the Dervish system on a population database to identify, for example, attitudes, interests, backgrounds, and family structure that occur together frequently.
• Geneticists want to discover if certain traits appear in an organism more frequently than can be expected with random distributions. The traits are stored in a database, from which the Dervish system can discover clustered traits. Using this information, geneticists can then possibly link these sets of traits to genetic information, such as is stored in sequences of DNA, or to mutation characteristics.
• Demographers classify populations according to gender, race, age and income. Demographers can use the Dervish system to discover how these variables are clustered in order to assess implications for zoning decisions, school districts, etc.
• City planners can use the Dervish system on a database with historical information on maintenance tasks to find out if certain types of city maintenance tasks occur together more frequently than others. For instance, minor water main leakage and major sewer main repairs, or power failure and graffiti damage. Quickly identifying certain sets of multiple maintenance tasks occurring in short order may lead to a new requirement for preventative maintenance in order to prevent additional tasks from occurring.
• In chemistry and pharmacology, multiple compounds may interact in unexpected ways. Researchers can use the Dervish system on a compound database to discover what compounds react and to identify associated reactions.
• Electronic commerce (e-commerce) applications on the World Wide Web (WWW) can use the Dervish system in real-time environments. Using historical purchase behaviors of customers, such applications can determine which products are often purchased together, and then use these sets of products to select
-12- marketing messages, offers, or suggestions. Currently, these messages, offers, and suggestions are often implemented using collaborative filtering technology, as is seen on sites such as Amazon (http://www.amazon.com). E-commerce applications can also be used with other interactive technologies, such as inbound and outbound telemarketing and voice navigation systems for call centers.
• Path-tracking applications can use the Dervish system to detect common navigation paths through WWW web sites. In this case, the Dervish system is supplied with the tracking information for each individual user, consisting of a sequence of web pages. The Dervish system creates associations from which it can be determined, based on the current web page the user is reading, what the next page or pages are likely to be. Based on this information, such path-tracking applications can present custom web pages to the user. Path-tracking applications can also be used with other interactive technologies, such as inbound and outbound telemarketing and voice navigation systems for call centers.
4. High-Level System Description and Terminology
4.1. Terminology
The foregoing examples illustrate that the system described herein may be used to determine associations among virtually any group of definable quantities. These can be products, types of services, types of usage patterns, effects, or virtually any other quantity or occurrence that can be categorized or defined. As a matter of convenience, the term "product" will be used throughout to refer to any such quantities, and the term "transaction" to refer to a combination of products occurring together. Such a combination of products could include all possible products or various sub-combinations thereof. Herein, T refers to the set of all transactions. This is the set that is stored in the database, from which the associations have to be extracted. A linear order is assigned to the transactions, that is,
(1) T = {tj \ 0 ≤ i < M}
-13- Herein, P refers to the set of all products. A linear order is assigned to the products, that is,
(2) P = {pj, \ 0 ≤j < N)
A transaction t, which is a set of products, is therefore a subset of P. Also, multiple transactions can have the same set of products.
Before running the Dervish system, the user sets the "minimum support" to define the minimum number of transactions in which a collection of products must be present to form an association. Embodiments of the Dervish system can obtain the minimum support (or, more generally, a numerical "transaction threshold") from user input. As those skilled in the art will appreciate, other methods could be used to obtain the minimum support, including reading a value from a file or hardcoding a value in the system itself. As will be apparent to one of ordinary skill in the art, obtaining a numerical transaction threshold is an example of a step that could be implemented by an input module as embodied in hardware and/or software. Minimum support is referred to by a positive integer, s. Herein, Q is a set of products, a subset of P, and Supt(Q) is the set of transactions, such that each transaction in the set contains all products in Q. That is,
(3) Supt(Q) = {t e T \ q e Q =>q e t)
I Supt(Q) has at least s elements, then Q is an association with minimum support s (or an association, for short), and Supt(Q) is the support for the association. Identifying a set of transactions supporting an association is an example of a step that could be implemented by an identification module as embodied in hardware and/or software, as will be apparent to one of ordinary skill in the art.
Note that each subset of an association is also an association with the same minimum support. A maximal association is an association that is not contained in any other association with the same minimum support. The associations discovery problem is, given a positive integer s, to find all associations in T with minimum support s. Given
-14- a minimum support s, the Dervish system finds all associations with minimum support s in the list of transactions. Each association the Dervish system finds is unique.
For each transaction t there is an associated metric, denoted C(t) as a matter of convenience because cost is often (but not necessarily) the quantity being measured. For each association Q, the cost C(Q) of the association could be computed as:
(4) C(Q) = Accum t eSupt(Q) C(t)
More generally, a metric is a measure associated with each association, and is computed by accumulating a measure of each transaction supporting the association. Still more generally, some metrics might not be accumulated over all the products in each transaction, but only over those associated products in each transaction. In any case, a metric related to a transaction may be referred to as a transaction metric. A metric can be a scalar or an array. For instance, total transaction cost could be tracked by itself or with gross profit per transaction at the same time. In addition, the metric computation need not be a simple accumulation. The metric computation could be done, for example, by any scalar or vector operation.
4.2. An Implementation of the Dervish System
The following is a high-level description of one implementation of the Dervish system, consisting of an initial phase followed by a recursive phase. That is, at the end of the recursive phase, the Dervish system returns to the start of the recursive phase. In the initial phase of this implementation, the associations that consist of a single product are computed as well as their support. In the recursive phase, these supports are combined to compute supports for candidate associations, where each candidate association consists of more than one product.
The following implementation of the Dervish system is one in which the transactions that form the support for each association are computed by means of set intersections.
-15- 4.3. Initial Phase of the System
For each product pj in P, the Dervish system finds the set of all transactions that contain this product. That is, the Dervish system computes for ally < N
(5) Supt({pβ) = {tj \ pj e ti)
The Dervish system selects those products that occur in at least s transactions, and thus form a 1 -product association, that is
(6) pj is selected iff \Supt({pj})\ > s
In general, each subset of an association is also an association with the same minimum support. Further, 1 -product associations are the smallest possible subsets of all possible associations. Therefore, a product that is not a 1 -product association cannot be part of any other associations.
4.4. Recursive Phase
Let k be the smallest index such that Supt({p2 ) is the support for an association. The starting point is a sequence of sets:
(7) Supt( {pt}) where t > k
These sets have been computed in the initial phase and represent those members of ordered set P (see relation 2) that exhibit at least the minimum support s.
For all such products/^ with t > k the Dervish system computes Swpt({p , >t}):
(8) Supt({pic pt}) = {ti I pk e tj andp e t{
Note that Supt( {pk, p{\ ) is a subset of Supt( {pt} ), since
(9) Supt( {pk Pt)) = Supt( {pk} ) n Supt( {pt})
Determining a subset of transactions supporting an association (e.g., Supt({pk, pt))) from within a set of transactions supporting a smaller association (e.g., Supt({pt})) is an example of a step that could be implemented by control logic as embodied in hardware and/or software, as will be apparent to one of ordinary skill in the art.
-16- If there is no such index t such that Swpt( {/? , /?/}) forms the support for an association, then there is no association containing product/^ other than {/?&}. In that case, the Dervish system continues with the next smallest index, k ' say, such that Supt({pk'}) is the support for an association and attempts to augment association {pk'} with other products.
If, on the other hand, the Dervish system finds that S«pt({ , pv}) is the support for an association, for some index v, then there are at least s transactions with both product pk and pv. The Dervish system has now also generated a sequence of sets
(10) Supt^k pm}) for all m ≥ v
representing all 2-product associations involving product pfc
At this point, the Dervish system continues at the beginning of the recursion. In other words, the Dervish system finds the smallest index w (w > v) such that Supt({pfc,pv, pw}) forms the support of an association. The Dervish system continues this process of augmenting the size of an association until no more products can be added to the association, that is, until the association is a maximal association. Then the Dervish system attempts to augment the remaining associations.
4.5. Search Strategy
Consider the case where, for a given association Q, the Dervish system finds a product ? with which to augment association Q, and form association R. Then the Dervish system computes, and keeps track of, all other individual products with which Q can be augmented to form an association, before the Dervish system attempts to augment association R. This search for associations is a combination of breadth-first and depth- first search. This search method is for illustrative purposes only. It will be apparent to one skilled in the art that a breadth-first search or a depth-first search, or any other search strategy, may be employed to compute all associations.
4.6 Termination
Once the Dervish system finds an association, with any number of products, the Dervish system only tries to add products to the association whose index is larger than
-17- that of any of the products already in the association. Hence for each association, the Dervish system finds a maximal association. Since the initial set of associations is finite, the Dervish system always terminates.
5. A Pseudo-Code Embodiment of the Dervish System This section describes, in more detail (including pseudo-code) an exemplary embodiment of the Dervish system. This will be described with respect to Figure 2 below, as Dervish 200.
5.1. Data Representation
In Dervish 200, the set of transactions (7) is stored in a transaction database. The transaction database is represented as a matrix. Each column of the matrix represents a product, and contains an array of all transactions that contain that product. Each transaction is initially represented by a row in the matrix. One of the operations performed by Dervish 200 is to swap entries within a product column. This means that a single row then no longer represents a transaction; instead, each transaction is represented as a linked list, linking the columns containing all the products in the transaction.
In Dervish 200, the support for each candidate association (i.e., the set of transactions) is preferably represented with a single index into an array; in such a case, the memory use is related to the number of transactions and products.
5.2. Recursion vs. Iteration The high-level embodiment of the Dervish system described above was detailed as a recursive procedure, using tail recursion. However, those skilled in the art understand that it is possible to replace any recursive process with an iterative process and vice versa. In the Dervish 200 embodiment described here, that tail recursion is replaced with iteration both to illustrate an alternative implementation of one aspect of the system and because iteration is often computationally more efficient than recursion. Therefore, the Dervish system described here consists of two nested loops for an iterative process in which an induction stack is represented by the indices of support for candidate associations.
-18- 5.3. The Dervish System
In this exemplary embodiment of the Dervish system, three interacting procedures are used. The first procedure is the top-level procedure, Dervish 200, which searches for all associations. When Dervish 200 finds an association, Dervish 200 invokes the second procedure, Metrics 300, which returns a list of products in the association and computes the associated metrics. Of course, metrics computation is not necessary for all applications, and may be included or omitted as appropriate. Finally Dervish 200 invokes the third procedure, Whirl 400, which searches for new candidate associations by attempting to augment the association just computed with any of the remaining products. These candidate associations are then examined by Dervish 200, and the process is repeated until there are no more candidate associations.
5.4. Data Structures and Initialization
There are two main data structures in this embodiment of the Dervish system, prod and item. The first structure, item, is a two-dimensional array, whose first dimension is the set of products and whose second dimension is the set of transactions that contain the corresponding product. The second structure, prod, is a one- dimensional array that contains information about each product, including a number of indices into array item, which keep track of candidate associations.
The data for the set of products are stored in prod. Each product/?/ has a data structure, prod ( j ) , which includes the variables name and trans. The name variable contains the name of the product.
The trans variable is an array whose size is no larger than the number of products. When prod ( j ) . trans (k) is defined, it is an index into item ( j ) (e.g., indicates a location within item ( j ) ) , so that, for 0 < n < k, item ( j ) (n) has a set of pointers that represent a transaction containing the current candidate association containing product/?/. In this embodiment, such a set of pointers may be referred to as, or constitute, a transaction; a transaction product combination may be referred to simply as an item. Also, each entry in prod ( j ) . trans ( k ) may be referred to as an element of the trans variable.
-19- If transaction t\ contains product prod ( j ) , Dervish 200 represents the transaction/product combination in item with two pointers, product and right. The first pointer, product, points to prod ( j ) ; the second pointer, right, points to the next item in transaction / , if any. If there is no next item in transaction t\, the right pointer points to a structure containing the metric for this transaction, C(t).
Dervish 200 defines prod ( j ) . trans ( 0 ) as the number of transactions that have prod ( j ) , for all j . In other words, prod ( j ) . trans ( 0 ) is the size of the support for the candidate association containing only product pj, as well as the number of transactions containing product pj. For each product prod ( j ) , the Dervish system creates item ( j ) , an array of items of size prod ( j ) . trans ( 0 ) . Dervish 200 then fills the array with the transactions that contain product prod ( j ) . Now item ( j ) (k) contains the kt transaction (i.e., the set of pointers representing the transaction) that has product/?/. Where min_support is a positive integer denoting the minimum support, if prod ( j ) . trans ( 0 ) is smaller than mi n_support, then product prod ( j ) does not have to be considered in the attempt to augment the current association; in other words, the memory assigned for this product can be deleted, and reused elsewhere.
After Dervish 200 initializes prod and item the associations discovery process begins. As those skilled in the art will appreciate, other embodiments of the Dervish system could implement these data structures differently. For example, the trans variables for the products could be stored in a single two-dimensional array, or could be incorporated into the item structure.
5.5. The Dervish Procedure
The objective of Dervish 200 is to find all prod ( j ) . trans (k) whose value is at least min_support. The process Dervish 200 uses to achieve this objective is described with reference to Figure 2.
Dervish 200 begins by initializing all data structures (as described in the previous section), and setting k to 0, because the initial association has 0 products, that is, no association has been found yet (step 205). Now Dervish 200 starts the main loop (steps
-20- 210 - 250). If the current association size is non-negative (step 210), then Dervish 200 scans through all products in the prod structure, from left to right, to see if Dervish 200 can augment the current association. Dervish 200 starts with the first product, by setting index j to 0 (step 215). If index j is smaller than N (where N is the number of products), that is, if j is the index of a product (step 220), then prod ( j ) . trans (k) is the size of the support for the candidate association that consists of the current association augmented with the j,h product. Now if prod ( j ) . trans (k) is at least min_support (step 230), then this candidate association is an actual association. Dervish 200 can store this association and compute the metrics for this association (step 240). Step 240 consists of computing the products in the current association and accumulating metrics. Step 240 is described in the following section. In addition, Dervish 200 sets prod ( j ) . trans (k) to -1, to insure that this association is found once only.
Once Dervish 200 finds an association, Dervish 200 calls a procedure, such as Whirl 400, which computes a set of candidate associations consisting of the current association, augmented with each of the products with an index larger than j (step 245). An implementation of such a procedure, Whirl 400, is illustrated in Figure 4 and is discussed in detail below.
After Whirl 400 has computed a set of candidate associations, Dervish 200 increments k (step 250), as the size of the current association is now one larger than the size of the previous association. Dervish 200 then returns to the beginning of the main loop (step 210), to attempt to find even larger associations.
If Dervish 200 finds that prod ( j ) . trans (k) is not at least as large as min_support (step 230), Dervish 200 moves on to the next candidate association by incrementing j (step 235) and returning to step 220. If, in step 220, index j does not point to a product, then Dervish 200 cannot find any more candidate associations that form an association. Dervish 200 then sets prod ( i ) . trans (k) to 0 for all i and decrements k (step 225) to reduce the size of the current association. Dervish 200 then returns to the start of the main loop (step 210).
-21- Dervish 200 now repeats the main loop, attempting to increase the size of the new current association. If k is decremented in step 225 to be negative, then there are no more associations to be discovered and Dervish 200 exits the main loop (step 210), and the system terminates. To summarize, in each iteration of the main loop (steps 210 - 250), Dervish 200 starts with an association, and tries to augment the association. If Dervish 200 succeeds, Dervish 200 attempts to augment the association further by adding any of the remaining products. If Dervish 200 does not succeed, Dervish 200 drops the last product that was added to the association and attempts to augment the resulting association with any of the other remaining products.
Thus, Dervish 200 includes the steps of: (a) forming a candidate association by augmenting a previously determined association with an additional product; (b) testing whether a candidate association has minimum support; and (c) if minimum support exists, accepting as a set, the set of transactions forming the minimum support. As will be apparent to one of ordinary skill in the art, these are examples of steps that could be implemented, respectively, by: (a) augmentation logic; (b) testing logic; and (c) acceptance logic, each as embodied in hardware and/or software.
Pseudo-code for Dervish 200 is given in Table 2. This code is equivalent to the flow diagram description in Figure 2.
-22- TABLE 2. Pseudo-code for Dervish 200
Dervish : dO l k = 0 / / the current association has 0 products d02 while (k >= 0 ) { d03 // as long as there are candidate assocs d04 j = 0 / / scan products left to right d05 while ( j < N && prod (j ) . trans (k) < min_support ) j ++ d06 d07 if ( j == N) { d08 / / there are no current assocs of size k d09 prod ( i ) . trans (k) = 0 for all i dlO k- - / / try smaller size dll } else { / / prod ( j ) . trans (k) >= min_support dl2 / / success ! dl3 metrics ( j , k) / /compute association and metrics dl4 prod ( j ) . trans (k) = - 1 / / mark as done dl5 whirl (k, j ) dl6 k++ / / try larger association dl7 } dl8 }
5.6. The Metrics Procedure As discussed above, in step 240 Dervish 200 constructs a list of the products in an association and computes the associated metrics. As described herein, the metrics are computed by invoking a separate procedure, Metrics 300. However, those skilled in the art will understand that the steps for these processes can be performed in whole or in part by the Dervish or Whirl procedures, or a combination thereof. In other words, the labels and organization of the various procedures is a matter of convenience only and does not affect the underlying functionality to be performed.
When Dervish 200 reaches step 240, the system is in a state where an association has been found. Since only the support of the association has been stored directly, the system now has to compute the products that comprise the association and the metrics for the association. The following describes how to perform these computations.
In step 240, the first prod ( j ) . trans (k) items of prod ( j ) are the transactions that support the association. Finding the products that constitute the association is done as follows. The product with the highest index in the association is
-23- prod ( j ) . The other products are found by going through the prod ( j ' ) . trans (k ' ) indices for each k ' < k and j ' < j . For each k ' , the highest index j ' for which prod ( j ' ) . trans ( k ' ) = - 1 is a member of the association. To find the metrics for the association, Dervish 200 takes all these transaction items and follows their right pointer until Dervish 200 reaches a metric structure (that is, a non-item). Then Dervish 200 computes and accumulates the metrics for each transaction. The following describes this process in detail.
Metrics 300 is an implementation of step 240 of Dervish 200 that computes the set of products forming the current association and the metrics associated with this association. Referring now to Figure 3, the following describes an implementation of Metrics 300.
The current association is of size k+1, and the last product to be added to the current association is product prod ( j ) . In the initial step of Metrics 300, the variable that holds the products in the current association, Assn, is set to include product prod ( j ) , variable p is set to j , and variable q is set to k - 1 (step 310).
Steps 315 through 335 find all products in the current association. Recall that only the support for the current association is stored directly, not the association itself. The current association itself has to be computed from its support, as follows. As long as q is non-negative (step 315), there are more products to be found in the current association. In that case, Metrics 300 checks if prod (p) . trans (q) = - 1 (step 320). If the check fails, Metrics 300 inspects the previous product, by decrementing p (step 325) repeatedly, until step 320 succeeds.
When step 320 succeeds, prod (p) is a product in the current association, and prod (p) is added to variable Assn (step 330). Then Metrics 300 continues to search for the next product in the current association by decrementing q (step 335), and returning to the start of the main loop (step 315). Steps 310-335 are examples of steps that could be implemented by logic configured to identify the associated products as embodied in hardware and/or software, as will be apparent to one of ordinary skill in the art.
-24- After all products in the current association are found (i.e., when step 315 fails), Metrics 300 accumulates the metrics for the association by computing the metrics for each of the transactions that forms the support for the current association. The metrics accumulation takes place in steps 340 through 370. For each of the transactions that forms the support for the association, Metrics 300 searches for the metrics pointer, then accumulates the associated metrics.
The metrics accumulation is initiated by setting variable t to 0 (step 340). As long as t is smaller than prod ( j ) . trans ( k ) , there is another transaction for which the metrics have to be found (step 345). The transaction for which the metrics have to be found is the transaction represented by item ( j ) ( t ) . Metrics 300 sets variable current to it em ( j ) (t ) (step 350). As long as variable current does not point to a metrics pointer (step 355), the value of current is replaced with the value of current . right (step 360).
Eventually, current will be a metrics pointer (step 355 succeeds). Then the metrics that this pointer points to are accumulated with previously accumulated metrics (step 365), and the procedure moves on to the next transaction by incrementing the value of t (step 370) and returning to step 345. When there are no more transactions in the support for the current association (when step 345 fails), Metrics 300 ends. Steps 340-370 are examples of steps that could be implemented by measurement logic as embodied in hardware and/or software, as will be apparent to one of ordinary skill in the art.
Pseudo-code for Metrics 300 is given in Table 3. This code is equivalent to the flow diagram description in Figure 3.
-25- TABLE 3. Pseudo-code for Metrics 300
Metrics (j , k) : mOl Assn = {prod(j)} m02 p = j m03 q = k - 1 m04 while (q >= 0) { m05 while (prod (p) . trans (q) <> -1) P-- m06 add prod(p) to Assn m07 q-- m08 } m09 mlO t = 0 mil while (t < prod (j ). trans (k) ) { ml2 current = item(j) (t) ml3 while (current <> " metric pointer" ) { ml4 current = current . right ml5 } ml6 accumulate current ml7 t++ ml8 }
Figure imgf000028_0001
5. 7. The Whirl Procedure The Whirl 400 procedure is an implementation of step 245 of Dervish 200 that, given the current association, computes candidate associations that contain the current association. This section first presents a high level discussion of Whirl 400. This is followed by a discussion of a flow diagram and pseudo-code for an implementation of Whirl 400. In the conceptual basis, several assumptions are made. Assume an association of size k+l has been found, that is, there are at least min_support transactions that have k+1 products in common. If prod ( j ) is the product among these with the highest index, and Q is the set of the other k products, then Supt {Q {prod ( j ) } ) is the support for the association. Furthermore, assume prod (m) . trans (k+l ) is defined for all m >= j and that for all t < prod (m) . trans (k+l ) the transaction represented by item (m) ( t ) contains every product in set Q. Assume also that for all t >=
-26- prod (m) . trans (k+l ) the transaction represented by item (m) ( t ) does not contain every product in set Q .
Given these assumptions and definitions, Whirl 400, which has inputs j and k, finds all transactions, for each m where m > j , containing all products in Q u {prod ( j ) } , as well containing prod (m) . In other words, for each m > j Whirl 400 separates the transactions that have prod (m) into a group that also contains all products in Q {prod ( j ) } and a group that does not. In so doing, Whirl 400 computes new candidate associations.
Referring now to Figure 4, the following describes an implementation of Whirl 400. After initializing counter trans (k+l ) for each product (step 405), Whirl 400 consists of two nested loops. The outer loop (steps 410 through 440) goes through all transactions that form the support for the current association, where prod ( j ) is the last product to be added to this association. Once all these transactions have been processed (when step 410 fails), execution of Whirl 400 terminates. For each of these transactions (when step 410 succeeds), Whirl 400 considers each of the remaining products (that is, products with an index higher than j ). To do this, Whirl 400 follows the right pointer for each of these transactions (step 415) until the pointer points to a field that is not a transaction but a metric pointer (step 420). In this case there are no more products within the current transaction, Whirl 400 moves on to the next transaction (step 440), and returns to the start of the outer loop (step 410).
If there is a product for the current transaction (step 420 succeeds), Whirl 400 moves this transaction to the list of transactions containing all products in the current association as well as the current product, by means of a swap (step 425). This swap includes updating right pointers to the swapped items as necessary. As those skilled in the art will appreciate, reordering the transactions by swapping is just one example of a procedure for selectively reordering the transactions. As those skilled in the art will also appreciate, step 425 is an example of a step that could implemented by ordering logic as embodied in hardware and/or software.
-27- Since Whirl 400 now has a new transaction, Whirl 400 increments the number of transactions that form the support for the current association augmented with the current product (step 430). Whirl 400 then moves on to the next transaction containing the current product (step 435) and returns to the start of the inner loop (step 420). Both loops terminate in all cases.
Pseudo-code for Whirl 400 is given in Table 4. A swap function (step w07) exchanges the two items in the item array to which the function's two arguments point. This code is equivalent to the flow diagram description in Figure 4.
TABLE 4. Pseudo-code for Whirl 400
Whirl (k, j) : wOl foreach (m) prod (m) . trans (k+l) = 0 w02 t = 0 w03 while (t < prod (j ) . trans (k) ) { w04 current_item = item(j) (t) .right w05 while (current_item != " metric pointer" ) { w06 current_prod = current_item. product w07 swap (current_prod. trans (k+l) , current_item) w08 current_prod. trans (k+l) ++ w09 current item = current item. right wlO } wll t++ wl2 }
Figure imgf000030_0001
6. An Example of the Operation of the Dervish System
6.1. Associations Discovery As an example of the operation of the Dervish system, consider a set of eight transactions, and six products, a, b, c, d, e, and f. Minimum support is three. The distribution of the products over the transactions is in Table 5.
-28- TABLE 5. Distributions of products in transactions
Trans. # prod, a prod, b prod, c prod, d prod, e prod, f
1 a b c d
2 b c d f
3 d e f
4 a b c e
5 a b d e
6 b c d e f
7 a b c f
8 d f
Figure imgf000031_0001
As the first step, the Dervish system reads the transaction data into the data structures. The array of items for a product is represented as a column in Table 6. The first element of each array (the first transaction) is at the top of the column. In addition, the trans ( 0 ) index for each product is at the bottom of the table; the index is the number of transactions that contain the corresponding product. In this example, transactions in a product column are represented by transaction id's, rather than by product and right pointers. (In order to focus on the process, the metrics are not included in this example.)
TABLE 6. Initial state of the Dervish system
a b c d e f
1 1 1 1 3 2
4 2 2 2 4 3
5 4 4 3 5 6
7 5 6 5 6 7
6 7 6 8
7 8
Figure imgf000031_0002
Trans(O) 4 6 5 6 4 5
-29- The Dervish system starts at the left-most column, product a. This product is in transactions 1 , 4, 5, and 7. For all columns to the right of a, the Dervish system moves transactions 1, 4, 5, and 7, if they occur, to the top of those columns, by swapping them with other transaction numbers. For instance, transaction 2 does not have product a, and transaction 2 is the second entry in column b, while transaction 4 has product a, and is the third entry in column b. The Dervish system swaps transaction numbers 2 and 4 in column b. This swapping can be done by the Whirl procedure.
Meanwhile, the Dervish system tracks the number of transactions from 1, 4, 5, and 7 that occur in each column and stores that number in trans ( 1 ) . For column a, this index remains 0, as the Dervish system has no use for product a at this point. Now the Dervish system has processed product a, so the Dervish system can set its index trans ( 0 ) to -1. The result of processing product a is Table 7.
TABLE 7. After processing product a
a b c d e f
1 1 1 1 4 7
4 4 4 5 5 3
5 5 7 3 3 6
7 7 6 2 6 2
6 2 6 8
2 8
Trans(O) -1 6 5 6 4 5
Figure imgf000032_0001
Trans(1) 0 4 3 2 2 1
The Dervish system scans the indices of trans ( 1 ) left to right, and determines that the first index larger than the minimum support (3 in this example) is in column b. This means that the Dervish system has found an association, with two products, that contains product b. To determine the other product, the Dervish system scans the indices of trans ( 0 ) from left to right and finds that the last column with index -1 is column a. Hence products a and b form an association, as they occur in four different transactions.
-30- Next the Dervish system takes the transactions that form the support for this association and checks which of the remaining products are in each of these transactions. This results in Table 8.
TABLE 8. Finding associations containing a and b
a b c d e f
1 1 1 1 4 7
4 4 4 5 5 3
5 5 7 3 3 6
7 7 6 2 6 2
6 2 6 8
2 8
Trans(O) -1 6 5 6 4 5
Trans(1) 0 -1 3 2 2 1
Figure imgf000033_0001
Trans(2) 0 0 3 2 2 1
Now the Dervish system scans the trans ( 2 ) indices and notes that product c is part of an association with three products. The other two products in this association are products a and b.
At this point, the Dervish system can try to extend the association with additional products. However, when the Dervish system scans the indices in trans ( 2 ) , there are no more products with at least minimum support. Therefore, the association of products a, b, and c is maximal. Next, the Dervish system determines that there are no other associations that contain a and b.
The Dervish system now goes back to scanning the trans ( 1 ) indices, from left to right. The first product to have minimum support is c, so product a and product c form an association. Since none of the other trans ( 1 ) indices has at least minimum support, the association of a and c is maximal. Next, the Dervish system goes back to scanning the trans(O) indices, where the
Dervish system finds that product b has minimum support. The Dervish system attempts to find products that form an association with product b and the result is Table 9.
-31- TABLE 9. Finding associations containing b but not a
a b c d e f
1 l l I 4 2
4 2 2 2 5 6
5 4 4 5 6 7
7 5 6 6 3 3
6 7 3 8
7 8
Trans(O) -1 -1 5 6 4 5
Figure imgf000034_0001
Trans(1) 0 0 5 4 3 3
Note that at this point column a is no longer needed, as the Dervish system has found all associations that contain product a. In other words, the Dervish system can guarantee that there are no more associations that can be generated containing product a.
Scanning the trans ( 1 ) indices, the Dervish system notes that product c forms an association with product b, as both b and c occur simultaneously in five transactions. Continuing to add to this association yields Table 10.
TABLE 10. Finding associations containing b and c
a b c d e f
1 1 l 1 4 2
4 2 2 2 6 6
5 4 4 6 5 7
7 5 6 5 3 3
6 7 3 8
7 8
Trans(O) -1 -1 5 6 4 5
Trans(1) 0 0 -1 4 3 3
Figure imgf000034_0002
Trans(2) 0 0 0 3 2 3
-32- Since trans (2 ) in column d has minimum support, the Dervish system has found an association of three products, one of which is product d. Scanning for the rightmost columns in trans ( 0 ) and trans ( 1 ) with a -1 entry, the Dervish system finds that the other two products in this association are b and c.
If the Dervish system attempts to add products to this association, the Dervish system can only add product f, as trans ( 2 ) for f is three. But the Dervish system finds that trans ( 3 ) for f is two, so that products b, c, and d form a maximal association. Also, b, c, and f form a maximal association.
Now the Dervish system goes back to scanning each trans ( 1 ) index. The next one with minimum support is in column d. The Dervish system executes Whirl for the remaining columns, which yields Table 11.
TABLE 11. Finding associations containing b and d
a b c d e f
1 1 1 1 6 2
4 2 2 2 5 6
5 4 4 6 4 7
7 5 6 5 3 3
6 7 3 8
7 8
Trans(O) -1 -1 5 6 4 5
Trans(1) 0 0 -1 -1 3 3
Figure imgf000035_0001
Trans(2) 0 0 0 0 2 2
Here neither e nor f forms an association with b and d. Back in the trans ( 1 ) indices, the Dervish system notes that e forms an association with b, which does not include f, and that f forms and association with b, which is also maximal.
Now the Dervish system goes back to the trans ( 0 ) indices, where product c has minimum support. (At this point, the Dervish system no longer needs columns a and b, and column c remains unchanged.) Continuing the Dervish system yields Table 12.
-33- TABLE 12. Finding associations containing c
a b c d e f l 1 1 1 6 6
4 2 2 2 4 2
5 4 4 6 5 7
7 5 6 5 3 3
6 7 3 8
7 8
Trans(Q) -1 -1 -1 6 4 5
Figure imgf000036_0001
Trans(1) 0 0 0 3 2 3
Here, products c and d form a maximal association, because the association cannot be augmented with the only other candidate, product f. Products c and f also form a maximal association.
Returning to the trans ( 0 ) indices, the Dervish system finds an association with product d. Attempting to augment this association yields Table 13.
TABLE 13. Finding associations containing d
a b c d e f
1 l 1 1 6 6
4 2 2 2 5 2
5 4 4 6 3 3
7 5 6 5 4 8
6 7 3 7
7 8
Trans(O) -1 -1 -1 -1 4 5
Figure imgf000036_0002
Trans(1) 0 0 0 0 3 4
The Dervish system finds that d and e form an association, which cannot be augmented with f, and that d and f form a maximal association. The Dervish system
-34- subsequently finds that product e forms an association. The final Dervish system table is Table 14.
TABLE 14. Final result for the Dervish system
a b c d e f
1 l 1 1 6 6
4 2 2 2 5 3
5 4 4 6 3 2
7 5 6 5 4 8
6 7 3 7
7 8
Trans(O) -1 -1 -1 -1 -1 5
Figure imgf000037_0001
Trans(1) 0 0 0 0 0 2
The Dervish system concludes that e and f do not form an association and that product f forms a maximal association.
The Dervish system found that each product forms an association by itself. In addition, the Dervish system found the following associations of multiple products, in the order they were generated: {a,b}, {a,b,c}, {a,c}, {b,c}, {b,c,d}, {b,c,f}, {b,d}, {b,e}, {b,f}, {c,d}, {c,f}, {d,e}, and {d,f}. If the associations that are contained in another one are weeded out, the following maximal associations result: {a,b,c}, {b,c,d}, {b,c,f}, {b,e}, {d,e}, and {d,f}. By checking the list of transactions, one can see that there are no other associations.
6.2. Metrics Accumulation
The following is an example of metrics accumulation. The previous example is used, with the addition of metrics. In this example, two types of metrics are accumulated, average transaction size (e.g., number of products) and average gross profit per transaction. Two columns representing the metrics for each transaction are added to the transaction table.
-35- This example is a computation of the metrics for the association with products b, c, and d. Table 15 is the state of the Dervish system when this association is found. It is the same table as Table 9. The metrics information in Table 16 includes, for each transaction, the size of the transaction and the dollar amount for the transaction. In addition, Table 16 includes the cost for each product.
TABLE 15. Associations containing b, c, and d
a b c d e f
1 1 1 1 4 2
4 2 2 2 6 6
5 4 4 6 5 7
7 5 6 5 3 3
6 7 3 8
7 8
Trans(O) -1 -1 5 6 4 5
Trans(1) 0 0 -1 4 3 3
Figure imgf000038_0001
Trans(2) 0 0 0 3 2 3
TABLE 16. Transaction amounts and product costs
Transaction 1 2 3 4 5 6 7 8
Size 4 4 3 4 4 5 4 2 Amount $7.48 $6.93 $5.89 $9.27 $9.04 $9.77 $5.68 $3.58
Product a b c d e f
Cost SI .12 $3.08 S0.99 $1.48 $0.84 $ 1.28
Figure imgf000038_0002
Three transactions form the support for the association containing products b, c, and d: transactions 1, 2, and 6. The metrics accumulation procedure finds the metrics for each transaction. While the Dervish system accumulates the transaction size and gross profit metrics simultaneously, different approaches are used for each metric.
-36- The following describes how the metrics procedure accumulates the transaction size metric. For transaction 1. there are four products, so the first entry in the "size" column is 4. The metrics procedure stores that one transaction has been processed and that the accumulated size is 4. Then, for transaction 2, there are also four products. The metrics procedure stores that two transactions have been processed and that the accumulated size is 8. Then the procedure processes transaction 6, which has five products, and stores that three transactions have been processed, with an accumulated size of 13. Finally, to get the average transaction size, the procedure divides the accumulated size, 13, by the number of transactions processed, 3, to get an average transaction size of 4.33.
The part of the metrics procedure that computes the average profit is somewhat more complicated. Again, the procedure processes the three transactions in order. For transaction 1, the "amount" column is $7.48. To obtain the gross profit on this transaction, the procedure finds the products that constitute the transaction (a, b, c, and d) and their costs in the second table (Table 16). The total cost for these four products is
$6.67, yielding a profit on transaction 1 of $0.81. For transaction 2, the procedure finds a total amount of $6.93 and the total cost of its products of $6.83, yielding a profit of $0.10. The procedure now stores that two transactions have been processed, for a total profit of $0.91. With transaction 6, the amount of the transaction is $9.77, the total cost of its products is $7.67, and the profit is $2.10. The total accumulated profit now is $3.01 for three transactions. As the final operation, the procedure computes that the average gross profit is $1.00 per transaction in this association.
7. Other Embodiments of the Dervish System
The invention is not limited to the embodiment described above. Any number of optimizations and enhancements are possible. Some of these are useful, for example, to reduce the run-time or the memory requirement for processing actual transaction databases. Other embodiments of the Dervish system are particularly useful for product categories, i.e., products that are organized into a product hierarchy. In some such embodiments, additional variables might be added to the existing data structures. Still
-37- other embodiments of the Dervish system could have different approaches to metrics accumulation. Still other embodiments of the Dervish system are optimized to solve the associations discovery problem on a computer with parallel processing capabilities. Such embodiments can reduce run time for a given data set or allow processing of a data set that cannot completely fit in main memory. The following sections include examples and descriptions of some of these enhancements to, or alternate embodiments of, the Dervish system.
7.1. Other Embodiments of Data Structures
• In one embodiment of the Dervish system, the number of operations is reduced if the products are ordered in prod and item, from the product with the smallest number of transactions, to the product with the largest number of transactions. This is because the first product is checked most often for inclusion in an association, and the product that occurs in the least number of transactions is the statistically least probable product to appear in an association. • Another embodiment of the Dervish system takes into account the fact that prod ( j ) . trans ( k ) is always 0 for all k > j . Therefore, there is no need to allocate memory for these items, because no part of the system refers to them.
• Still another embodiment of the Dervish system recognizes that if prod ( j ) . trans ( k) < s, where s is the minimum support, then there is no need to compute prod ( j ) . trans ( t ) for all t > k. In other words, in this embodiment there is no need to invoke the Whirl procedure because product pj is not part of an association, and there is no point in attempting to increase the size of the association with other products.
• Yet another embodiment of the Dervish system takes into account that setting prod ( j ) . trans (k) to -1 once an association is found with product/?/ (step
240) is not required. In this embodiment, it is possible to allocate a list of pointers that point to each of the products in the current association. To augment the association, it is sufficient only to scan products with indices greater than the last added product, rather than search through all products (which is what happens
-38- after step 215 is executed). Having a list of pointers to all products in the association also makes it easier to produce the current association (step 240), because it eliminates the procedure of scanning through the set of trans (k) indices.
• Another embodiment of the Dervish system adds a left pointer to each of the right pointers described previously. Whereas the right pointer points to the next product item in a transaction, the left pointer points to the previous product item in a transaction. Now the list of products in a transaction is a doubly-linked list, rather than a singly-linked list. This facilitates the swapping of items, such as is performed in step 425 of Whirl 400. • As will be obvious to one skilled in the art, any data structure described above that is an index into an array can be replaced with a pointer to the address of a location in the array. Similarly, any pointer can be replaced with an index into an array.
7.2. Other Embodiments of Function Descriptions
• The embodiments of Whirl and the Dervish system presented in section 5 are iterative. A recursive implementation of either is also possible, as was originally described in section 4. This typically will reduce the amount of code, but often increases the execution time.
• In the embodiment of Whirl presented above, the inner loop examines products and the outer loop examines transactions. It is possible to swap the two loops. • The Whirl and Metrics procedures can be combined. In the embodiments above, step 355 in Metrics 300 and step 420 in Whirl 400 test for the same condition, in a similar loop. Rather than perform each loop once in Metrics and once in Whirl, there can be just a single loop. Such a combination of loops speeds up the system, at no expense in memory requirements. • The embodiment of the Dervish system presented in section 4 performs a modified depth-first search. It is possible to implement the Dervish system as a pure depth- first search. That is, the Dervish system attempts to augment the current association with a single product, then augment the resulting association, before attempting to augment the current transaction with other products. The
-39- advantages of the depth-first search are that such a search uses a minimal amount of memory (since only one association is stored at any time), and the search mechanism produces associations while the system runs, rather than at the end. It is also possible to implement the Dervish system as a breadth-first search. In such an implementation, the Dervish system attempts to find all products that form an association with the current association, then attempts to augment each of the resulting associations. More generally, those skilled in the art will realize that virtually any heuristic search method can be used as well.
7.3. Hierarchical Products In a typical application of the associations discovery problem, the set of products can be divided into categories or sub-categories. For instance, a supermarket can have the category "beverages" with a sub-category "fruit juices," both of which contain a particular apple juice product. The associations discovery problem can be generalized to include these "hierarchical products." Thus, as used herein, the term "product" should be understood to include products (per se), categories, and/or sub-categories. For instance, in the above example, it is possible that there is no association that includes apple juice (product) and hamburger patties (product), but that there is an association that includes fruit juices (sub-category) and hamburger patties (product).
Another embodiment of the Dervish system could handle associations involving hierarchical products as a simple extension of finding associations involving products. For each hierarchical product, the Dervish system computes all transactions that contain that hierarchical product. The Dervish system then adds this hierarchical product to the list of products, as if the hierarchical product were an actual product. The Dervish system will find each association that involves hierarchical products. Treating hierarchical products as if they were actual products may lead to the generation of trivial associations. For instance, in the above example, if apple juice and hamburger patties form an association, then fruit juices and hamburger patties also form an association. To weed out such associations, a simple check can be added to the condition for the inner loop of Whirl 400, in step 420. Specifically, when a hierarchical product is checked for inclusion in the current association, a check is performed to
-40- determine whether the hierarchical product includes any of the products in the current association. If the hierarchical product includes any of the products in the current association, the hierarchical product is skipped.
Note that in this case, for each product, the hierarchical products that contain the product should be later in the order of products. Then, if the list of products is ordered from least-frequent to most-frequent, a hierarchical product is guaranteed to occur in the list of products after all of the products (and hierarchical products) the hierarchical product contains. If the hierarchical product contains exactly one product, care should be taken that the hierarchical product occurs in the list after the product it contains. 7.4. Other Embodiments of Metrics Accumulation
There are several ways that metrics can be accumulated for the Dervish System. This is the implementation of step 240 of Dervish 200. Which method should be chosen depends on the complexity of the metric accumulation function, and on the number of associations for which metrics need to be calculated. For the simple metrics accumulation method, as described in the implementation of Metrics 300, the Dervish system keeps the data structures representing the sets of products and transactions as before. Each transaction is represented by a linked list; each element of the linked list represents a product in that transaction, and the last element of the list is a pointer to the metric for the transaction. If there are many products in each transaction, traversing the entire list of products for each transaction may take too much time. In that case, an embodiment reduces the metrics accumulation time by using more memory. For each product in each transaction, this embodiment has a pointer directly to the metrics for that transaction. Once the Dervish system identifies an association, and the transactions that form its support, the Dervish system follows the metrics pointer, and accumulates the metrics. This reduces the time to search for the metrics to a constant, as the loop that includes steps 355 and 360 in Metrics 300 is eliminated.
-41- Another embodiment only includes the metrics pointer for each 27 product in a transaction. This reduces the memory consumption (compared to the previous embodiment), while maintaining a constant search time for the metrics.
Yet another embodiment is to do a cascaded metrics accumulation. This is useful when the user wants to get all associations and their metrics, not just the maximal associations. In this accumulation, the Dervish system uses the result for larger associations (that is, with more products), to compute the metrics for smaller ones. For instance, assume an association {a, b, c} was discovered after association {a, b}, and assume the Dervish system has computed the metrics for association {a, b, c}. Now there is a simple split in the transaction for association {a, b} between those transactions that contain product c, and those that do not. For each transaction in association {a, b}, the Dervish system follows the right pointers, as before. If this points to an item of product c, the Dervish system does not have to calculate the metrics for this transaction, as the metrics for this transaction are already included in the metrics for association {a, b, c}. If, however, the right pointer never points to an item of product c, then the Dervish system has to calculate its metrics, and accumulate the metrics for product c with the remaining metrics. This requires more bookkeeping, but may be faster if the metrics computation is complicated.
Still another embodiment is to do metrics accumulation on variables that are not pre-defined. In the embodiment of Metrics 300 described above, the actual metrics accumulation occurs in step 365. In this embodiment, the average size and average gross product per transaction for an association were computed. One can compute such metrics by defining steps or procedures, and the variables they use, when the system is constructed. This is known as static metrics accumulation. Alternatively, one can define steps or procedures for certain operations (such as taking the average of a list of numbers) when the system is constructed, but omit the variables on which these operations are to take place. Then, when the system is run, the variables that have to be used for metrics accumulation are defined, as well as the specific operations that are to take place on these variables. These specifics can be supplied through a script or by other means. This is known as dynamic metrics accumulation.
-42- 7.5. Out-of-Memory and Parallel Versions
The Dervish system is memory-efficient and has a maximum memory requirement that can be calculated before running the system. Still, it is possible that the memory requirement is larger than the size of the main memory of the computer. In that case, rather than declaring too much memory, and have the computer slow down, the associations discovery problem can be partitioned into sub-problems, and the Dervish system can aggregate intermediate results.
This method of partitioning the problem and then aggregating intermediate results can also be used to parallelize the Dervish system. Rather than loading each sub-problem in main memory, the Dervish system can compute the results for each sub-problem on a different processor. The aggregation of results, too, can be done in parallel. The following describes an implementation of an "out-of-memory" version of the Dervish system and discusses how this is the basis for a "parallel" version.
This method can be viewed as including the steps of: (a) creating multiple sub-processes; (b) determining intermediate candidate association results for each sub-process; and (c) aggregating these intermediate results. As will be apparent to one of ordinary skill in the art, these are examples of steps that could be implemented, respectively, by: (a) a partitioning module; (b) an iteration module; and (c) aggregation logic, each as embodied in hardware and/or software. 7.6. Checking Support for a Candidate Association
The partitioning of the associations discovery problem is a partitioning of the set of transactions into distinct parts (i.e., partitions or sub-problems). The partitioning process is described in detail below. The Dervish system is run for each partition, and the result of solving each sub-problem is a number of candidate associations (i.e., a list of products that might have minimum support).
For each candidate association, the Dervish system counts the number of transactions that contain all products in the candidate association. At this point, the Dervish system only has the count for a particular partition. To count the total number of transactions that contain each product in a candidate association, the Dervish system takes
-43- the set of transactions that has the first product in the candidate association and then takes the next product in the candidate association. Using Whirl, the Dervish system moves the transactions that also contain the first product up and the remaining transactions down. Now the trans ( 0 ) index for this product is the number of transactions that have both products, and the first trans ( 0 ) transactions for this product are the transactions that have both products. The Dervish system continues to do this for each remaining product in the candidate association. Then the trans ( 0 ) index for the last product is the number of transactions that have all the products in the candidate association.
The Dervish system makes such a count for each partition and the sum of these counts is the total number of transactions that contain all products in the candidate association. Finally, the Dervish system checks if this sum is at least as large as the minimum support.
7.7. An Example of Checking Support
As an example, consider the final table for the example in the previous section, where all trans ( 0 ) indices are reset to 0, as in Table 17.
TABLE 17.
a b c d e f
1 1 1 1 6 6
4 2 2 2 5 3
5 4 4 6 3 2
7 5 6 5 4 8
6 7 3 7
7 8
Figure imgf000046_0001
Trans(O) 0 0 0 0 0 0
Assume that there are many more transactions and that the table above is for one of the partitions of all transactions. Now the problem is to count how many transactions contain products b, c, and f because this is the candidate association.
-44- As before, the Dervish system starts out by scanning the list of products from left to right. The first product in the candidate association is product b, and this product is contained in six transactions. Now the Dervish system determines which of the six transactions contain the next product, c. There are five such transactions. Finally, the Dervish system determines which of these five transactions contain product f. For this, the Dervish system uses Whirl to move the transactions that also have product c up (6, 2, and 7) and the ones that do not have product c down (3 and 8). The resulting table is Table 18.
TABLE 18.
a b c d e f
1 1 1 1 6 6
4 2 2 2 5 2
5 4 4 6 3 7
7 5 6 5 4 8
6 7 3 3
7 8
Figure imgf000047_0001
trans(O) 0 6 5 0 0 3
From this, the Dervish system determines that there are three transactions in this partition that contain products b, c, and f. One should note several things about this counting procedure. First, nothing was changed in the columns for products a, d, and e, those products not in the candidate association. Second, the Dervish system only made a single sweep through the columns of the products that were part of the candidate association. In this counting procedure, there is no backtracking, and Whirl is invoked only once for each product, whereas when searching for all candidate associations, Whirl may be invoked multiple times. It follows that this counting procedure is much faster than generating all candidate associations.
-45- 7.8. Partitioning the Set of Transactions
At the start of the Dervish system, the user estimates the worst-case memory requirement for the number of products and transactions. If the memory requirement is larger than the size of the main memory, the Dervish system has to partition the set of transactions. If the total memory requirement is N times the size of the main memory, then the Dervish system partitions the set of transactions into N partitions of roughly equal size.
The Dervish system does not depend on any particular partitioning method. While it is necessary that each transaction occur in exactly one partition, it is immaterial which transactions are lumped together in any given partition. It is desirable that the memory requirement for each partition be smaller than the size of the main memory.
After the partitioning is done, the Dervish system (or any other system for associations discovery) processes each partitioned set of transactions. As before, s is the minimum support for the entire set of transactions. If there is an association, that is, a set of products that occur in at least s transactions, then there must be a partitioned set for which the same set of products occur in at least s/N transactions.
In other words, for each partition, the user sets the minimum support to the smallest integer that is at least s/N, and the user runs the Dervish system (or any associations discovery system). Each association found for each partition is a candidate association for the entire set of transactions. A procedure for counting the number of transactions that contain each candidate association was discussed above.
While partitioning allows processing of data sets too large to fit in main memory, there is some associated processing overhead. The runtime overhead in partitioning the set of transactions is relatively minor. Also, there is some overhead in having to load every transaction twice and in having to check if a candidate association is, in fact, an association. But, the biggest overhead arises when the system produces many candidate associations that turn out not to be associations.
In summary, if the memory requirement for a dataset is too large, the Dervish system partitions the set of transactions and processes each partition in turn, yielding a
-46- number of candidate associations. The Dervish system then loads each partition in memory again and counts the number of transactions that contain each candidate association.
7.9. A Parallel Processing System From the previous out-of-memory version, one can distill an implementation of the Dervish system that will run on a multiprocessor machine. In this implementation, the Dervish system partitions the set of transactions into N sets, as before, where N is at most the number of processors. The Dervish system assigns each partition to a processor and processes each partition, using minimum support s/N, to yield a number of candidate associations.
Now the Dervish system counts, for each candidate association, in how many transactions all products in the candidate association occurs. This, too, can be done in parallel, except for the accumulation of every count for each partition. In contrast to the out-of-memory version, it is not necessary to load each partition into memory twice. Because of the overhead for counting candidate associations and possibly creating candidate associations that are not associations, the parallel implementation of the Dervish system will not run N times as fast as the sequential implementation. But for large data sets, there is a significant speedup in executing the system in parallel. Note that the parallel implementation of the Dervish system also works if the memory requirement is larger than the total main memory available on the parallel machine. In that case, the Dervish system can use the out-of-memory implementation on each processor. In effect, the Dervish system can cascade an out-of-memory partition on top of a parallel partition or cascade a parallel partition on top of an out-of-memory partition.
As those skilled in the art will understand, a parallel implementation of the Dervish system need not run on a single multiprocessor machine. For example, it is possible to do parallel processing on a network of single processor or multiprocessor machines. Clearly, any interconnection of computers can be used for a parallel implementation of the Dervish system.
-47- 8. Conclusion
The Dervish system takes a product-based approach rather than a transaction- based approach and uses far less memory than conventional systems. The Dervish system is therefore capable of solving much larger data sets in less time. In addition, should the Dervish system run out of memory, there is an efficient method to partition the problem into sub-problems. While this increases the run-time, running out of memory does not cause an excessive increase in run-time, as it does with the traditional systems. Finally, there is a natural division of the associations discovery problem that allows the Dervish system to run on any number of parallel processors. It will be appreciated by those skilled in the art that further embodiments of the invention may be made without departing from the spirit and scope of the invention as described herein. Such embodiments are intended to be within the scope of the appended claims.
-48-

Claims

CLAIMSWhat is claimed is:
1. A computer-implemented method for discovering associations among products, various combinations of the products occurring among a plurality of transactions, comprising the steps of: (a) accepting a numerical transaction threshold; (b) using a computer, identifying a set of transactions, each transaction in said set including one or more associated products in common, said set having a size at least as large as said transaction threshold; (c) using said computer, determining from within said identified set an additional set of transactions, said additional set having at least as many members as said transaction threshold, each transaction in said additional set including in common: (i) one or more previously determined associated products, and (ii) at least one additional associated product; and (d) using said computer, optionally repeating said step (c) until a desired number of sets of transactions is determined.
2. The method of claim 1 wherein said step (c) further includes identifying said associated products common to said identified set of transactions.
3. The method of claim 2 wherein said step of identifying said associated products is performed using a datum generated during a previous repetition of said step (c).
4. The method of claim 2 wherein said step of identifying said associated products is performed by maintaining pointers to said associated products.
5. The method of claim 2 further including the step of determining an association of products that is contained in no other identified association of products.
-49-
6. The method of claim 1 wherein said step (c) includes at least one selective reordering of a representation, of said identified set of transactions, within a data structure containing said identified set of transactions.
7. The method of claim 1 wherein said step (c) is performed in a breadth-first manner.
8. The method of claim 1 wherein said step (c) is performed in a depth-first manner.
9. The method of claim 1 wherein said step (c) is performed in a combination of depth-first and breadth-first manners.
10. The method of claim 1 wherein said step (c) is performed in a recursive manner.
11. The method of claim 1 wherein said steps (c) and (d) are performed in an iterative manner.
12. The method of claim 1 wherein said step (c) further includes computing a metric for said identified set of transactions supporting said associated products.
13. The method of claim 12 wherein each of said transactions is represented as a list of products in said transaction, and wherein at least one product in said list has a connection to a transaction metric used to compute said computed metric.
14. The method of claim 13 wherein said at least one product is at an end of said list.
15. The method of claim 13 wherein said products having connections to said transaction metric occur periodically in said list.
16. The method of claim 12 wherein said step of computing said metric is performed statically.
17. The method of claim 12 wherein said step of computing said metric is performed dynamically.
-50-
18. The method of claim 12 wherein said step of computing said metric is performed in a cascaded manner.
19. The method of claim 12 wherein said step of computing said metric includes performing a scalar operation.
20. The method of claim 12 wherein said step of computing said metric includes performing a vector operation.
21. The method of claim 1 wherein said products include at least one hierarchical product.
22. The method of claim 21 wherein: (a) said products are ordered in a data structure; and
(b) said hierarchical product is ordered after any products contained in said
hierarchical product.
23. The method of claim 1 wherein: (a) said products are ordered in a data structure; and
(b) said products are ordered from least frequently occurring in said plurality
of transactions to most frequently occurring in said plurality of
transactions.
24. The method of claim 1 configured for out-of-memory operation in which the total memory required for the method exceeds an available memory of said computer.
25. The method of claim 24: ΓÇó further comprising the steps of (i) partitioning said method into a plurality of N sub-processes, each sub-process requiring a memory less than said available memory and (ii) replacing said transaction threshold by its value divided by N, before said step (b); and
-51- ΓÇó wherein said steps (b) - (d) are performed for each of said sub-processes in turn and, for each said sub-process, determining a plurality of candidate associations, each said candidate association including all said associated products common to a corresponding one of said identified transactions within said sub-process.
26. The method of claim 25 further comprising the step of determining a subset of said plurality of candidate associations, each member of said subset cumulatively occurring, among all of said transactions for all of said sub-processes, at least as many times as said transaction threshold accepted in said step (a).
27. The method of claim 24 wherein said out-of-memory operation is achieved in a smooth fall-over from available memory to disk.
28. The method of claim 24 further configured for a parallel processing operation in cascade with said out-of-memory operation.
29. The method of claim 1 configured for use in a distributed computing environment.
30. The method of claim 29 wherein said distributed computing environment includes a parallel processing environment having N processors.
31. The method of claim 30: ΓÇó further comprising the steps of (i) partitioning said method into N sub-processes corresponding to said N processors and (ii) replacing said transaction threshold by its value divided by N, before said step (b); and ΓÇó wherein said steps (b) - (d) are performed for each of said sub-processes on its corresponding processor and, for each said sub-process, recording a plurality of candidate associations, each said candidate association including all said associated products common to a corresponding one of said identified transactions within said sub-process.
32. The method of claim 31 further comprising the step of determining a subset of said plurality of candidate associations, each member of said subset cumulatively
-52- occurring, among all of said transactions for all of said sub-processes, at least as many times as said transaction threshold accepted in said step (a).
33. The method of claim 30 further configured for an out-of-memory operation in cascade with said parallel processing operation.
34. A computer-implemented method for discovering associations among products, various combinations of the products occurring among a plurality of transactions, comprising the steps of: (a) accepting a numerical transaction threshold; (b) partitioning said plurality of transactions into a plurality of partitions, each partition including a subset of said plurality of transactions, and each transaction being represented by a list of said products occurring therein; (c) processing one of said partitions to generate a first plurality of candidate associations; (d) processing an additional one of said partitions to generate a second plurality of candidate associations; and (e) for each said generated candidate association, determining certain of said plurality of transactions that include said candidate association.
35. The method of claim 34 further comprising the step of identifying as an association certain of said generated candidate associations having said number of transactions that are supported by at least as many transactions as said transaction threshold.
36. The method of claim 34 wherein said steps (c) and (d) are performed sequentially.
37. The method of claim 34 wherein said steps (c) and (d) are performed in parallel.
38. The method of claim 34 wherein each of said partitions fits in an available memory of a computer processing said partition.
39. The method of claim 1 implemented such that its memory requirement is independent of the number of said associations among said products.
-53-
40. The method of claim 1 wherein each said identified set of transactions is represented with an index into a data structure including a plurality of locations, each said location storing a representation of an occurrence of one of said products in one of said transactions.
41. The method of claim 1 wherein each said identified set of transactions is represented with a pointer into a data structure including a plurality of locations, each said location storing a representation of an occurrence of one of said products in one of said transactions.
42. The method of claim 1 wherein each said transaction is represented by a list of said products occurring therein.
43. The method of claim 1 wherein said step (c) is performed by computing set intersections.
44. The method of claim 1 wherein said products and said transactions are contained in, and accessed from, a computer database connected to said computer.
45. The method of claim 1 wherein said step (c) includes: (i) augmenting said previously determined associated products with an additional product to form a candidate association of products; (ii) testing whether said candidate association occurs among said identified set of transactions at least as often as said transaction threshold; and (iii) accepting, as said additional identified set of transactions, all said transactions containing said candidate association, provided that the result of the test in said step (ii) is true.
46. The method of claim 45 wherein said step (i) further includes forming a plurality of candidate associations.
-54-
47. The method of claim 45 wherein said step (c) further includes identifying said associated products common to said identified set of transactions.
48. The method of claim 47 wherein said step of identifying said associated products is performed using a datum generated during a previous repetition of said step (c).
49. The method of claim 47 wherein said step of identifying said associated products is performed by maintaining pointers to said associated products.
50. The method of claim 45 wherein said step (c) includes at least one selective reordering of a representation, of said identified set of transactions, within a data structure containing said identified set of transactions.
51. The method of claim 45 wherein said step (c) is performed in a combination of depth-first and breadth-first manners.
52. The method of claim 45 wherein said steps (c) and (d) are performed in an iterative manner.
53. The method of claim 45 wherein said step (c) further includes computing a metric for said identified set of transactions supporting said associated products.
54. The method of claim 45 wherein said products include at least one hierarchical product.
55. The method of claim 45 configured for out-of-memory operation in which the total memory required for the method exceeds an available memory of said computer.
56. The method of claim 45 configured for a parallel processing environment including N processors.
57. The method of claim 45 wherein said transactions are represented as a list of products.
-55-
58. A computer-readable medium for facilitating discovery of associations among products, various combinations of the products occurring in a plurality of transactions, the computer-readable medium comprising: (a) a first data structure including a plurality of locations;
(b) each said location used for dynamically storing a representation of an
occurrence of one of said products in one of said transactions.
59. The computer-readable medium of claim 58 further comprising a plurality of additional data structures, each said additional data structure corresponding to a particular one of said products.
60. The computer-readable medium of claim 59 wherein each said additional data structure includes a plurality of elements, each said element representing certain of said plurality of transactions.
61. The computer-readable medium of claim 60 wherein each of said certain represented transactions includes one of said combinations of products including said particular product.
62. The computer-readable medium of claim 60 wherein each said additional data structure includes indices for specific ones of said locations in said first data structure.
63. The computer-readable medium of claim 60 wherein each said additional data structure includes pointers to specific ones of said locations in said first data structure.
64. The computer-readable medium of claim 59 wherein all of said corresponding additional data structures occur together in a single data structure.
65. The computer-readable medium of claim 64 wherein said single data structure is said first data structure.
-56-
66. The computer-readable medium of claim 58 wherein: (a) said products are ordered in said first data structure;
(b) said products include at least one hierarchical product; and
(c) said hierarchical product is ordered after any products contained in said
hierarchical product.
67. The computer-readable medium of claim 58 wherein: (a) said products are ordered in said first data structure; and
(b) said products are ordered from least frequently occurring in said plurality
of transactions to most frequently occurring in said plurality of
transactions.
68. The computer-readable medium of claim 58 wherein said first data structure includes, for each said transaction, a list of products occurring in said transaction.
69. The computer-readable medium of claim 68 wherein a first product in said list has a connection to a second product in said list.
70. The computer-readable medium of claim 69 wherein said connection is a pointer.
71. The computer-readable medium of claim 69 wherein said connection is an index.
72. The computer-readable medium of claim 69 wherein said second product has a connection to said first product.
73. The computer-readable medium of claim 68 wherein at least one product in said list has a connection to a metric.
74. The method of claim 73 wherein said metric is a scalar.
75. The method of claim 73 wherein said metric is an array.
-57-
76. The computer-readable medium of claim 73 wherein said at least one product having a connection to said metric is at an end of said list.
77. The computer-readable medium of claim 73 wherein said products having connections to said metric occur periodically in said list.
78. An apparatus for discovering associations among products, various combinations of the products occurring among a plurality of transactions, comprising: (a) an input module configured to accept a numerical transaction threshold; (b) an identification module configured to identify a set of transactions, each transaction in said set including one or more associated products in common, said set having a size at least as large as said transaction threshold; (c) control logic configured to determine from within said identified set an additional set of transactions, said additional set having at least as many members as said transaction threshold, each transaction in said additional set including in common: (i) one or more previously determined associated products, and (ii) at least one additional associated product; and (d) said control logic being repeatedly invokable to determine a desired
number of sets of transactions.
79. The apparatus of claim 78 wherein said control logic includes logic configured to identify said associated products common to said identified set of transactions.
80. The apparatus of claim 79 wherein said logic configured to identify said associated products is configured to use a datum generated during a previous invocation of said control logic.
81. The apparatus of claim 79 wherein said logic configured to identify said associated products is configured to maintain pointers to said associated products.
-58-
82. The apparatus of claim 79 further including logic configured to determine an association of products that is contained in no other identified association of products.
83. The apparatus of claim 78 wherein said control logic includes ordering logic configured to selectively reorder a representation, of said identified set of transactions, within a data structure containing said identified set of transactions.
84. The apparatus of claim 78 wherein said control logic is configured to operate in a breadth-first manner.
85. The apparatus of claim 78 wherein said control logic is configured to operate in a depth-first manner.
86. The apparatus of claim 78 wherein said control logic is configured to operate in a combination of depth-first and breadth-first manners.
87. The apparatus of claim 78 wherein said control logic is configured for recursive operation.
88. The apparatus of claim 78 wherein said control logic is configured for iterative operation.
89. The apparatus of claim 78 wherein said control logic further includes measurement logic configured to compute a metric for said identified set of transactions supporting said associated products.
90. The apparatus of claim 89 configured to represent each of said transactions as a list of products in said transaction, and wherein at least one product in said list is connectable to a transaction metric used to compute said computed metric.
91. The apparatus of claim 90 wherein said at least one product is at an end of said list.
-59-
92. The apparatus of claim 90 wherein said products being connectable to said transaction metric occur periodically in said list.
93. The apparatus of claim 89 wherein said measurement logic is configured for static computation of said metric.
94. The apparatus of claim 89 wherein said measurement logic is configured for dynamic computation of said metric.
95. The apparatus of claim 89 wherein said measurement logic is operable in a cascaded manner.
96. The apparatus of claim 89 wherein said measurement logic is configured to perform a scalar operation.
97. The apparatus of claim 89 wherein said measurement logic is configured to perform a vector operation.
98. The apparatus of claim 78 wherein said products include at least one hierarchical product.
99. The apparatus of claim 78 configured for out-of-memory operation in which the total memory required for operation thereof exceeds a memory available thereto.
100. The apparatus of claim 99 ΓÇó further comprising a partitioning module for (i) creating a plurality of N sub- processes, each sub-process requiring a memory less than said available memory and (ii) replacing said transaction threshold by its value divided by N; and ΓÇó an iteration module for executing said elements (b) - (d) for each of said sub- processes in turn and, for each said sub-process, determining a plurality of candidate associations, each said candidate association including all said associated products common to a corresponding one of said identified transactions within said sub-process.
-60-
101. The apparatus of claim 100 further comprising aggregation logic configured to determine a subset of said plurality of candidate associations, each member of said subset cumulatively occurring, among all of said transactions for all of said sub- processes, at least as many times as said transaction threshold acceptable by said element (a).
102. The apparatus of claim 78 configured for use in a distributed computing environment.
103. The apparatus of claim 102 wherein said distributed computing environment includes a parallel processing environment having N processors.
104. The apparatus of claim 103 ΓÇó further comprising a partitioning module for (i) creating N sub-processes corresponding to said N processors and (ii) replacing said transaction threshold by its value divided by N; and ΓÇó an iteration module for executing said elements (b) - (d) for each of said sub- processes on its corresponding processor and, for each said sub-process, recording a plurality of candidate associations, each said candidate association including all said associated products common to a corresponding one of said identified transactions within said sub-process.
105. The apparatus of claim 104 further comprising aggregation logic configured to determine a subset of said plurality of candidate associations, each member of said subset cumulatively occurring, among all of said transactions for all of said sub- processes, at least as many times as said transaction threshold acceptable by said element (a).
106. The apparatus of claim 78 implemented such that a memory requirement thereof is independent of the number of said associations among said products.
107. The apparatus of claim 78 configured to represent each said transaction by a list of said products occurring therein.
-61-
108. The apparatus of claim 78 wherein said control logic is configured to operate by computing set intersections.
109. The apparatus of claim 78 configured to access said products and said transactions from a computer database connected to thereto.
1 10. The apparatus of claim 78 wherein said control logic includes: (i) augmentation logic configured to augment said previously determined associated products with an additional product to form a candidate association of products; (ii) testing logic configured to test whether said candidate association occurs among said identified set of transactions at least as often as said transaction threshold; and (iii) acceptance logic configured to accept, as said additional identified set of transactions, all said transactions containing said candidate association, provided that the result of said test is true.
11 1. The apparatus of claim 1 10 wherein said augmentation logic further includes logic configured to form a plurality of candidate associations.
112. The apparatus of claim 110 wherein said control logic further includes logic configured to identify said associated products common to said identified set of transactions.
113. The apparatus of claim 112 wherein said logic configured to identify said associated products is configured to use a datum generated during a previous invocation of said control logic configured to determine said additional set of transactions.
114. The apparatus of claim 112 wherein said logic configured to identify said associated products is configured to maintain pointers to said associated products.
-62-
1 15. The apparatus of claim 110 wherein said control logic includes ordering logic configured to selectively reorder a representation, of said identified set of transactions, within a data structure containing said identified set of transactions.
116. The apparatus of claim 1 10 wherein said control logic operates in a combination of depth-first and breadth-first manners.
117. The apparatus of claim 1 10 wherein said control logic is configured for iterative operation.
118. The apparatus of claim 110 wherein said control logic further includes measurement logic configured to compute a metric for said identified set of transactions supporting said associated products.
119. The apparatus of claim 1 10 configured for use in a parallel processing environment having N processors.
-63-
PCT/US1999/002525 1998-02-03 1999-02-03 Method and apparatus for associations discovery WO1999039295A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU25871/99A AU2587199A (en) 1998-02-03 1999-02-03 Method and apparatus for associations discovery

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US1821598A 1998-02-03 1998-02-03
US09/018,215 1998-02-03

Publications (1)

Publication Number Publication Date
WO1999039295A1 true WO1999039295A1 (en) 1999-08-05

Family

ID=21786832

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/002525 WO1999039295A1 (en) 1998-02-03 1999-02-03 Method and apparatus for associations discovery

Country Status (2)

Country Link
AU (1) AU2587199A (en)
WO (1) WO1999039295A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4908761A (en) * 1988-09-16 1990-03-13 Innovare Resourceful Marketing Group, Inc. System for identifying heavy product purchasers who regularly use manufacturers' purchase incentives and predicting consumer promotional behavior response patterns
US5615341A (en) * 1995-05-08 1997-03-25 International Business Machines Corporation System and method for mining generalized association rules in databases
US5794209A (en) * 1995-03-31 1998-08-11 International Business Machines Corporation System and method for quickly mining association rules in databases
US5884305A (en) * 1997-06-13 1999-03-16 International Business Machines Corporation System and method for data mining from relational data by sieving through iterated relational reinforcement

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4908761A (en) * 1988-09-16 1990-03-13 Innovare Resourceful Marketing Group, Inc. System for identifying heavy product purchasers who regularly use manufacturers' purchase incentives and predicting consumer promotional behavior response patterns
US5794209A (en) * 1995-03-31 1998-08-11 International Business Machines Corporation System and method for quickly mining association rules in databases
US5615341A (en) * 1995-05-08 1997-03-25 International Business Machines Corporation System and method for mining generalized association rules in databases
US5884305A (en) * 1997-06-13 1999-03-16 International Business Machines Corporation System and method for data mining from relational data by sieving through iterated relational reinforcement

Also Published As

Publication number Publication date
AU2587199A (en) 1999-08-16

Similar Documents

Publication Publication Date Title
Padmanabhan et al. Unexpectedness as a measure of interestingness in knowledge discovery
Srikant et al. Mining generalized association rules
US7433879B1 (en) Attribute based association rule mining
US5920855A (en) On-line mining of association rules
US6236985B1 (en) System and method for searching databases with applications such as peer groups, collaborative filtering, and e-commerce
US6173280B1 (en) Method and apparatus for generating weighted association rules
Spiliopoulou et al. Data mining for measuring and improving the success of web sites
US6061682A (en) Method and apparatus for mining association rules having item constraints
Maimon et al. Introduction to knowledge discovery in databases
US20060206516A1 (en) Keyword generation method and apparatus
US7809666B2 (en) Method and system for sequential compilation and execution of rules
CA2451076A1 (en) Method of facilitating database access
Pillai et al. User centric approach to itemset utility mining in Market Basket Analysis
Bora Data mining and ware housing
Hilderman et al. Mining market basket data using share measures and characterized itemsets
Yang et al. GHIC: A hierarchical pattern-based clustering algorithm for grouping Web transactions
Zhang et al. Measuring customer similarity and identifying cross-selling products by community detection
US20040049504A1 (en) System and method for exploring mining spaces with multiple attributes
WO1999039295A1 (en) Method and apparatus for associations discovery
Cios et al. Unsupervised learning: association rules
Pradhan et al. Product bundling for ‘efficient’vs ‘non-efficient’customers: Market basket analysis employing genetic algorithm
Zhou et al. Raising, to enhance rule mining in web marketing with the use of an ontology
Dinh et al. Mining compact high utility sequential patterns
Giannotti et al. Integration of Deduction and Induction for Mining Supermarket Sales Data.
Wang et al. E cient roll-up and drill-down analysis in relational database

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AL AM AT AU AZ BA BB BG BR BY CA CH CN CU CZ DE DK EE ES FI GB GE GH GM HR HU ID IL IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG UZ VN YU ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW SD SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
NENP Non-entry into the national phase

Ref country code: KR

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase