WO2003021477A2

WO2003021477A2 - A sampling approach for data mining of association rules

Info

Publication number: WO2003021477A2
Application number: PCT/EP2002/008335
Authority: WO
Inventors: Frank Beekmann; Roland Grund; Andreas Rudolph
Original assignee: International Business Machines Corporation; Ibm Deutschland Gmbh
Priority date: 2001-09-04
Filing date: 2002-07-26
Publication date: 2003-03-13
Also published as: JP2005502130A; CN1578955A; US7289984B2; IL160731A0; US20050027663A1; US7668793B2; WO2003021477A3; CA2459758A1; KR20040029157A; US20080147688A1

Abstract

The current invention relates to a data mining technology for determining association rules within a multitude of N transactions each transaction comprising up to p different items. According to the invention a sample size n of the multitude of N transactions is determined based on precision requirements. The sample size n is chosen such, that it is at least in the order of magnitude of an estimated sample size n*. Finally association rules are computed based on a sample of the multitude of N transactions with sample size n according to any methodology for mining of association rules using the association rules as estimated association rules of the multitude of N transactions.

Description

D E S C R I P T I O N

A Sampling Approach For Data Mining Of Association Rules

1. Background of the Invention

1.1 Field of the Invention

The present invention relates generally to a method, system and program product for uncovering relationships or association rules between items in large databases .

1.2 Description and Disadvantages of Prior Art

Data mining is an emerging technical area, whose goal is to extract significant patterns or interesting rules from large databases; in general the area of data mining comprises all methods which are applicable to extract "knowledge" from large amounts of existing data. The whole process is known as knowledge discovery in databases. Finding association rules is one task for which data mining methods have been developed for.

Association rule mining has been introduced by Agrawal et al . (refer for instance to R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proc . 20th V DB Conf . , Sept. 1994.) and was motivated by shopping basket analysis. The rules were generated to find out which articles or items in a shop are bought together. To be more general association rules can be used to discover dependencies among attribute values of records in a database. Even further specific basket data usually consists of a record per customer with a transaction date, along with items bought by the customer. An example of an association rule- over such a database could be that 80% of the customers that bought bread and milk, also bought eggs. The data mining task for association rules can be broken into two steps . The first step consists of finding all the sets of items, called as itemsets, that occur in the database with a certain user-specified frequency, called minimum support. Such itemsets are called large itemsets. An itemset of k items is called a k-itemset. The second step consists of forming implication rules among the large itemsets found in the first step.

Several algorithms have been developed to generate efficiently association rules. The well known and very successful APRIORI algorithm has been disclosed by Agrawal et al . for instance in above mentioned document. The most important value with which association rules are measured is the support value which is the relative frequency of occurrence of one item or several items together in one rule.

Today generating association rules in case of very large data bases (number of entries several million records and above) can be extremely time consuming. Many algorithms proposed for data mining of association rules make repeated passes over the database to determine the commonly occurring itemsets (or set of items) . For large databases, the I/O overhead in scanning the database can be extremely high. This processing time is not only required for executing the mining algorithms themselves . A lot of time is also spent during- the preprocessing steps. This includes the processing time for import of data and also processing time for transforming data for applying the algorithm. This preparation can take several hours of expensive CPU-time even in case of large MVS-systems. To improve this performance equation it has been suggested instead of taking the whole database for the generation of association rules just to draw a sample and generate the association rules on that basis . This teaching has been introduced by H. Toivonen, Sampling Large Databases for Association Rules, Proceedings of the 22^nd VLDB Conference Mumbai (Bombay) , India 1996 as well as Zaki, M. J. , Parthasarathy, S., Li, W. , Ogihara, M. , Evaluation of Sampling for Data Mining of Association Rules, Computer Science Department, Technical Report 617, University of Rochester (1996) . ^"

Toivonen et al . stated an algorithm for detecting "exact" (not being based on some sample) association rules. Within this teaching sampling has been used only for the precalculation of the support values of the rules as one step in the algorithm; Toivonen et al . are completely mute about the idea of data mining for "estimated" (approximate) association rules based on some sample. Toivonen et al . also disclosed necessary bounds for sample sizes. Using an univariate approach the ^• support value of an arbitrary association rule has been estimated. Toivonen et al . calculated the probability that an error between the true support value and the estimated support value exceeds a given threshold by using the binomial distribution and applying Chernoff bounds. With this they derived a formula for a sufficient sample size.

Zaki et al . took this idea up and published these bounds for approximate association rules generated under sampling. These bounds were also calculated using the univariate approach suggested by Toivonen including Chernoff bounds. It turned out by these investigations that these bounds are not very efficient since the required sample size can be very huge. As shown by Zaki et al . the required sample sizes can even become greater than the original database ( ! ) . Thus the current state of the art teaching is completely unsatisfactory and actually cannot be applied to real world problems .

Therefore, in principle the approach of data mining for association rules based on samples would allow to save processing time in the preprocessing step as well as in the analysis phase. But the fundamental problem which occurs is the accuracy of the generated association rules. If the sample is suitably chosen it is possible to estimate the error which appears by this approach. This error can be controlled by calculating sufficiently large sample sizes. But currently it is completely unclear how to determine reasonable sample sizes .

1.3 Objective of the Invention

The invention is based on the objective to improve the performance of the technologies for data mining of association rules .

2. Summary and Advantages of the Invention

The objectives of the invention are solved by the independent claims . Further advantageous arrangements and embodiments of the invention are set forth in the respective subclaims .

The current invention relates to a data mining technology for determining association rules within a multitude of N transactions each transaction comprising up to p different items . According to the invention a sample size n of the multitude of N transactions is determined based on precision requirements. The sample size n is chosen such, that it is at least in the order of magnitude of an estimated sample size n* .

Finally association rules are computed based on a sample of the multitude of N transactions with sample size n according to any methodology for mining of association rules using the association rules as estimated association rules of the multitude of N^" transactions .

The important concept underlying all embodiments of the current invention is the observation that much smaller sample sizes can be determined, which at the same time satisfy the required precision requirements, if further parameters characterizing the multitude of transactions are introduced in the sample size determination. This is in strict contrast to the state of the art knowledge (refer for instance to the above mentioned Chernoff bounds) wherein no properties of the multitude of transactions are being used. As such characterizing properties the current invention suggests to use the size N of the multitude of transactions or the number p of different items occurring within the transactions. Of course, once formulas for these sample sizes including such characterizing properties have been determined additional approximation techniques may be applied to eliminate these characterizing properties again. Even sample sizes based on such additional approximations represent a significant improvement over all state-of-the-art estimations.

As it can be seen by examples the sample sizes determined according to the current invention are much lower than the number of the original transactions and much lower than the known state of the art approaches. Therefore, the current teaching results in very significant performance improvements for data mining of association rules .

3. Brief Description of the Drawings

Figure 1 shows an example of a confidence ellipsoid with p=2.

Figure 2 shows an example of a confidence ellipsoid with p=3.

Figure 3 visualizes the process flow for sampling of association rules in multivariate case. This process flow could be applied also to the univariate model accordingly without any further problem.

Figure 4 depicts a distributed processing model for mining of association rules .

4. Description of the Preferred Embodiment

In the drawings and specification there has been set forth a preferred embodiment of the invention and, although specific terms are used, the description thus given uses terminology in a generic and descriptive sense only and not for purposes of limitation. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims.

The present invention can be realized in hardware, software, or a combination of hardware and software. Any kind of computer system - or other apparatus adapted for carrying out the methods described herein - is suited. A typical combination of hardware and software could be a general- purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein. The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which - when being loaded in a computer system - is able to carry out these methods .

Computer program^"means or computer program in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following a) conversion to another language, code or notation; b) reproduction in a different material form.

Within the current specification a transaction record, or transaction for short, refers to just a tuple of items; of course it is not required that such a record has been part of any computer transaction. The wording of transaction records is used for historical reasons only. Moreover an item may be represented by any type of attribute not necessarily related to an item within the real world.

4.1 Introduction

In the area of data mining so called association rules are a methodology to figure out unknown relations or rules from usually very large data sets. This methodology consists of the following proceeding. Given a set of so called items. These items could be purchases from supermarket basket data. Subsets of this set of items are so called transactions, for example beer and crisps as one transaction, whereas another transaction could consist of bred and butter. A set of items is often called itemset too. Therefore every transaction contains an itemset.

More formally the problem of mining association rules over basket data, that is a multitude of N transactions, can be stated as follows .

Let I = {i₁,i₂,...,i_p } be a set of p distinct attribute values, also called items. Each transaction T in the database D of transactions (that is a multitude of N transactions) , has a unique identifier TID, and contains a set of items, such that T £= I . An association rule is an expression A => B, where itemsets A; B c I, and A fl B = 0. Each itemset is said to have a support s if s% of the transactions in D contain the itemset (thus, the support measure represents a relative frequency) . The association rule is said to have confidence c if c% of the transactions that contain A also contain B, i.e., c = support (A B) /support (A) , i.e., the conditional probability that transactions contain the itemset B, given that they contain itemset A. For example, we may have that 80% of the customers that bought bread and milk also bought eggs. The number 80% is the confidence of the rule, the support of the rule is support (A U B) . Data mining of association rules from such databases consists of finding the set of all such rules which meet the user-specified minimum confidence and support values .

The task of data mining for association rules can be broken into two steps: 1. Find all the large k-itemsets for k = 1; 2; ...

2. Generate rules from these large itemsets. Given that X is a large k-itemset, for every non-empty subset A c X, a rule of the form A => B is generated, where B = X-A, and provided that this rule has the required confidence.

Note that all above mentioned measures are determined based on the set of all transactions under consideration.

For the purpose of performance improvements one could think at this point about to estimate these desired measures by a well chosen sample out of the set of all transactions. In the language of sampling theory the set of all transactions would correspond to the population, whose properties (i. e. relative frequencies and other measures) should be estimated by a well chosen sample. Limited to this sample one could calculate the relative frequencies of the desired events as estimators for the relative frequencies of these events on the population.

Consequently one then has to solve the following problems:

1. how should the sample be chosen such that possible systematic biases which could be caused by a possible order in the database could be eliminated as far as possible?

2. most important: how should the sample size be chosen to guarantee a desired precision of the estimators (in our case of the relative frequencies)?

3. how should the precision of the estimators be understood?

The first question aims to eliminate possible systematic errors. For instance, selecting for example every n-th record from a database could and could choose such serious systematic errors . The second question deals with the point how many transactions should be taken from the whole population, that is the sample size. Intuitively it is quite clear that this problem is related with the precision which can be achieved due to the sample. This means that a sample of 100 transactions can guarantee a much less precision of the estimators than a sample of 10000 transactions.

The third question deals with the following: Assume the set of all transactions contains 1.000.000 transactions. If one takes a sample of 100 transactions by chance one can calculate the relative frequency of an item (say) A as an estimator for the relative frequency of item A on the whole population. If one takes by chance a second sample of 100 transactions, one can also calculate the relative frequency of item A based on this second sample as an estimator, but both calculated frequencies will be different. If one repeats this proceeding several hundred times then the so calculated relative frequencies will more or less scatter around the relative frequency of item A on the whole population.

4.2 Sampling Methods

One of the most well known sampling schemes is the so called Bernoulli sampling. This assumes that the data is given for example as a sequential file or data base where the records can be numbered von 1 to N and the data base can be traversed through along this ordering. Given a probability π, with which every element can be chosen, this sampling scheme works as following: For the i-th element a random experiment is made where this element is chosen with probability π. This can be done by generating a random number on the interval (0,1) where - li ¬

the i-th element is taken, if the considered random number is smaller than π, otherwise this element is refused.

Remark: In case of so called linear congruent random number generators it is quite important that the period of such a generator is large enough. This means that in case of a period of for example 5, after 5 random numbers out of the interval (0,1) the numbers are repeated which is of course not very sensible. As some rule of thumb one requires that the period L and the number N of the population should satisfy:

Due to this sampling scheme the sample size is not a fixed number, but in the sense of probability theory a random variable whose statistical parameters as expectation E and variance Var are given as following:

E_BE(n) = Nπ and

Additionally it is possible to calculate a confidence interval for a prescribed confidence level (1 - α) where α is the specified probability of an error for the sample size n by the formula:

Here u._« is the percentile of the standard normal distribution

2

N(0 , 1) -distribution for a given probability . Given a probability α then the percentile u._ is the value which is

exceeded with probability -f- by the unit normal distributed random variable (abbreviated by N(0,1)) . Hence

where the density d of X (as N(0 , 1) -distributed random variable) is

The inclusion probability % of the i-th observation to be put into the sample is then given by:

71; = π which means that for every element this probability is constant and equal to the prescribed probability π.

The inclusion probability π_y for the i-th as well as the j-th element to be put simultaneously into the sample is in this case equal π².

The big advantage of the sampling scheme consists in the fact that it can be easily implemented on a computer. The disadvantage consists in- the- fact that the sample size is no longer a fixed quantity but a random variable.

If one is interested in a fixed quantity for the sample size in advance then one has to use so called simple random sampling. This is an additional sampling scheme for which the inclusion probability for every observation to be put into the sample of size n is constant:

Jtj = —n

N if n is .the sample size and N the size of the population.

For the inclusion probability π_y that the i-th as well as the j-th element are put simultaneously into the sample one gets: n(n-l) ^J N(N-l) Such a sampling scheme can be implemented as follows . Assume that ε_l5ε₂,... are independent random variables uniformly distributed on the interval [0,1].

1. If ε, <— then take the element k=l, otherwise don't.

¹ N

2. For the following elements k=2 , 3 , ... assume that n^ is the number of the elements already taken under the first k-1 elements of the whole population. If we have n-n ε < k ^k N-k+1 in case of the k-th random number- ε_k , then the-k-th element is- taken, otherwise not.

3. The procedure terminates, if we have n_k = n.

The disadvantage of such a sampling scheme consists in the fact that the number of the already chosen elements has to be saved. On the other hand we have the advantage that now the sampling ^'procedure terminates with the desired number of elements .

4.3 A Sampling Approach For Data Mining Of Association Rules

This invention describes a methodology for computing association rules based on a sample instead of the whole population. It is suggested to use these association rules as estimated association rules of whole the population of transactions. As then the actual methodology for mining of association rules can be limited to the sample only significant performance improvements are achieved.

The important feature of the proposed invention is the technique for determining the sample size, while at the same time achieving prescribed precision requirements. As mentioned above already in contrast to the current state-of-the-art the essential concept of the current invention is the observation that much smaller sample sizes can be determined, which at the same time satisfy the required precision requirements, if further parameters characterizing the multitude of transactions are introduced in the sample size determination. In one embodiment of the current invention it is suggested to use as such characterizing property the size N of the multitude of transactions, in another embodiment of the current invention the number p of different items occurring within the transactions is used as characterizing property. Of course, once formulas for these sample sizes including such characterizing properties have been determined additional approximation techniques may be applied to eliminate these characterizing properties again. Even sample sizes based on such additional approximations represent a significant improvement over all state-of-the-art estimations. Or in other words: the significant advantage of achieving smaller sample sizes by taking into account further parameters characterizing the multitude of transactions will not be lost even if afterwards additional approximations eliminate these parameters again.

Estimations (for instance estimations for the support values) calculated according to the current state-of-the-art are done only based on a univariate analysis. An univariate analysis means that only a single value is estimated. In contrast to that one embodiment of the current invention suggests a completely new approach by applying a multivariate estimation analysis. A multivariate analysis means that a vector is undergoing an estimation analysis wherein each component of that vector is an estimator and wherein all components are estimated simultaneously. For instance with respect to the support values the idea with this approach is to have a sample size which estimates the support of all of the single items to a specified correctness simultaneously. The proposed multivariate approach is based on confidence ellipsoids to determine the necessary sample sizes and has several advantages . The basic idea behind this method is that not only one support value per item is estimated but all support values simultaneously. If this vector of support values is close enough to the real existing vector of support values the structure in the data is ^"also valid in the sample and therefore the rules will have good precision; or in other words, then the sample contains the same structure as the population and henceforth the same rules .

Such an embodiment based on a multivariate analysis is not even indicated within the teachings of Zaki et al . or Toivonen et al ..

Additionally it is described how to select the records in a database such that these records are selected as randomly as possible.

4.3.1 Univariate Model

The basic concept is that the support value of any rule R can be seen as a relative frequency. This value can be approximately measured by an estimator as follows.

Suppose the whole database consists of N sequentially ordered elements (where each element is represented by a record) . For each element we can construct a binary attribute which is 1 if the element supports the rule, i.e. when the item(s) of a rule appear (s) in the record, and 0 if the rule is not supported by the element. The mean value of this binary attribute is the support (denoted by p) . Drawing a sample without replacement an unbiased estimator (an estimator is unbiased if the expectation value of the estimator equals the parameter which shall be estimated) for this support value is the mean of the binary attribute measured- over all elements in the sample (This mean is denoted by p ) . Moreover a confidence interval for the support value can be constructed. The idea behind a confidence interval is the following. The estimator drawn from the sample will approximate the true value, which means the estimator will not take on every time the same value. But if we draw a lot of samples and calculate an estimator it could be seen that these values are spread around the true value. We now try to find an interval around the calculated estimator such that we know that the true value is in this interval with a given probability 1-α. According to our sample method and the kind of the estimator we can use the following formula for constructing a confidence interval can be deduced:

where p is the estimator, u „ the percentile of the standard

normal distribution, N the size of the whole population and n the sample size.

Given a probability α then the percentile u.__α is the value

which is exceeded with probability -- by the unit normal distributed random variable (abbreviated by N(0,1)) . Hence P(X > u_l__J,) = where the density d of X (as N(0 , 1) -distributed random variable) is

That means if we calculate such an interval for a rule we can be sure that the probability that the true value is covered by this interval is 1-α. The formula shows that the length of this interval can be controlled by the sample size. The larger this is the smaller will be the confidence interval which can be used to calculate the sample size when the maximal length of the confidence interval is given.

The problem is that one cannot use the above formula directly since the value p is not known before the sample is drawn. Therefore one need to consider the following formula where the second term of the sum is substituted. (The difference is in considering an estimator for the variance of the p in the first formula and the true variance of p in the second formula)

There are two possibilities to define the length of the confidence intervals . One is to give a relative error the other is to give an absolute error from the true value. Both possibilities are described below.

A user will state that the maximal approximation error should be a factor δ from the true value, hence a relative error. By that we can use the following known formula to calculate the sample size: u² Np(l-p) n =- eq. 1

(N-l)δ²p +u² p(l-p)

2

The problem in this formula is that we need to know the true value . Seeing that this value exceeds a given threshold Minsup I S

and that the function is decreasing in p, we can determine a bound for the minimum sample size by using the formula: u ² „ N Minsup(l - Minsup) n = — eq. 2

(N -l) δ Minsu + u² Minsup(l -Minsup)

The following example shall demonstrate the use of the formula :

Given a database of 4000000 records. The specified Minsup value is 0,01 and the estimator for a rule R shall not differ from the true value by 1% with a probability of..90%. Then.it is necessary to draw a sample of the size -1415204 elements.

Zaki suggested using the following formulae for estimating that the estimated value p derived from a sample of size n is less {greater} than (l-δ) { (1 + δ) } from the true value p which means that the relative error is less {greater} than a factor δ from :

P(p ≤ (l- δ)p) ≤ exχ?(-δ²np / 2) ^{' '}

P(p.≥ (l + δ)p) ≤ exp(-<5V/3) From that he derived necessary sample sizes by comparing the right hand sides with a given error probability α. By that he obtains : n = -21n(α)/ pδ² for the lower bound n = -3ln(a)l pδ² for the upper bound. By that Zaki does not consider a closed confidence interval as we did above. He only stated probabilities for open intervals with which we will deal later. At this point we can state that our approach lead to better accuracy by taking a smaller sample. To show that our approach outperforms Zakis approach consider again the calculation of the sample size. We can give the approximate formula for the sample size (representing an approximation of eq. 1) : u² (1-p) δ² _P

This will be compared with the lower bound formula of Zaki (which gives the smallest sample size according to Zaki) .

Both formulae have the same denominator, hence it suffices to show that -21n(α)>H² (1- p)

^•-f

In praxis usually the values 0.1, 0.05 and 0.01 are chosen for α. The following table shows the value -21n^"(α) and u² for^{" '}the >-f - above mentioned values of α.

Hence -21n(α) is always larger than u² for the stated values

'-f from which we can conclude that the above given inequality holds for these values .

The other possibility to calculate the sample size is related to specify an absolute error d between the estimator and the true value. Based on the absolute error measure d the following formula may be deduced:

This formula again needs knowledge about the true parameter p. But an analysis shows that the stated formula has the biggest value when p = 0.5. Hence one possibility to calculate the sample size is to plug in p=0.5 resulting in sample size: u,²„

The following example will illustrate the calculation. Given a population size of 7000000, a confidence level of 99% and an absolute error of 0.01, we obtain a sample size of 16551 which marks a significant improvement over the state-of-the-art.

Toivonen et al . suggest to take the following sample size given an absolute error d and an error probability α: n = — ¹-ιIn—²

2d² a As above we can show that our approach yields a smaller sample size. Again we use an approximate formula: u^Pd-p)

1 2 such that it suffices to show that —In—>u.² _ap(l-p) .

2 a 2

Noting that p(l-p) < 0.25 for Q ≤ p ≤ l it suf f ices to show :

21n- > u² 2

As above we show that this inequality holds at least for the common values of α.

With that it is shown that our approach yields smaller sample sizes than that of Toivonen et al .

We now state another result which can be applied to calculate necessary sample sizes. If we consider the confidence interval above, there are two kinds of errors which can occur. One is that the true support value is larger than the calculated upper bound, the other is that the true support value is less than the lower bound. In practical circumstances there are situations where one only need to have a confidence interval for the true value which is bounded only to one side (Compare Zaki et al . ) .

If one is interested only in obtaining a bound which will be exceeded by the true value only with a probability α one can use the confidence interval:

This means one can be sure that the true support value of a rule will not be greater than the upper bound. This can be of importance if one will control the error that a rule which has a support value greater than the Minsup threshold has not this property in the sample. For example assume the Minsup value is given and the support value of a rule in the sample yields a confidence interval stated above with an upper bound less than Minsup. Then the probability that this rule has got a true support value on the population which is greater than Minsup is less than α.

The other case which can be of interest uses the following confidence interval which has only a lower bound:

This confidence interval can be used if one is only interested that the error that the true support value is less than the lower border is less than the error probability α. This can be the case when the error shall be controlled that a rule has a support value greater than the Minsup threshold whereas the true value is less than this threshold. For example if in the sample a rule will have a support value such that the lower bound of the corresponding confidence interval is greater than the Minsup threshold then the true value is smaller than this bound only with an error probability of at most α.

From both formulae we can derive the sample sizes in the same way as we did above. The only thing which changes in the formulae is that u² is substituted by u² .

2

Hence the sample size for a single bounded confidence interval is

n= ^U^^P(1 _?-^P) eq. 5

(N-l)δ²p²+u² _αp(l-p)

when a relative error of δ of the true value is given. Substituting p by Minsup as explained above yields _{n =} Uι„_«NMinsup(l - Minsup) _eg _ _g

(N - l)δ²Minsuρ ² + u²._aMinsup(l - Minsup)

When an absolute error is given one can use the following formula for calculating the sample size:

where p can be chosen as 0.5 such that is the largest value for n resulting in

The sample sizes obtained by these formulae are smaller than the sample size calculated for the corresponding closed confidence interval. Since the latter sample sizes have been shown to be smaller than the sample sizes suggested by Zaki et al . and Toivonen et al . so are the sample sizes derived here.

4.3.2 Multivariate Model — _ _

In the precedent section we showed how to use confidence intervals for estimating the support of an item or a set of items which indicates the support of a rule. As explained above the meaning of a confidence interval with confidence level of (1-α) is that in (l-α)xl00 percent of all samples the relative frequency of an item A on the whole population is covered by the corresponding confidence interval .

A generalization (in the sense of considering p items simultaneously) of this idea consists in the construction of a so called confidence ellipsoid at confidence level (1-α) . A confidence ellipsoid in p dimensions defines a region in p dimensions, such that the true value is enclosed in this region with a certain probability (1-α).

In case of p=2 items this ellipsoid is an ellipsis. Fig. 1 shows an example of a confidence ellipsoid with p=2. In case of p=3 items an example of a confidence ellipsoid is shown within Fig. 2.

On the other hand the width (the area or the volume respectively) is a measure for the precision. Therefore if one requires a certain precision one could choose the sample size such, that the width (the area or the volume respectively) does not exceed a prescribed bound (for a desired confidence level) .

To estimate the supports of single items simultaneously it is necessary to transform every transaction into a binary vector. Every component of such a vector then corresponds to an item, where a value 1 means that the considered item exists in the considered transaction and a value 0, that the item is not^" present. Note that the dimension of the binary vectors is implied by the number p of all possible single items.

Now assume that we took a sample from a population with p- dimensional vectors denoted by

Yι =

Y (P)

for i=l, . . . , N. Further we define :

or k=l , , p;

N

S_ϊ„ _γ„_J = r_J ^L _r∑ (Y_i ^<k, -Ϋ.*^> )(Y_i ^<1> -Ϋ-¹")

¹ i=l or k ≠ /^" = !,..., ∑γ ^_ (Sγ^{k) γ^(l) )kJ=l,...p

The vectors in the sample will be denoted by i such that we get the corresponding quantities on the sample whereas Y is replaced by y and N by n, with the exception that the covariance matrix of the sample is denoted by

S = (S_y(k)_;y(i) )k,l=l,...p where s _{(k) (1)} are the corresponding estimators for the covariances based on the sample.

With this notation one can prove the following theorems f r__a_ simple random sample: Theorem 1 : y. is an unbiased estimator of Y.

Cov(y.)=— 1 ∑γ is the covariance matrix of y. nl NJ

Cov(y.)=— 1—— |s is an unbiased estimator of Cov(y.)

Additionally we can state a central limit theorem for the estimator y. :

Theorem 2 :

Assume in the situation of a simple random sampling the following: n_v ->∞and (N_v - n_v) →∞ , if n_v ->∞ let I_V = {1,..., N_V ]

Y„: =

Y (p)

for i E /„ J VI

J VI for i = l,...,n_v

for all τ > 0 and k=l , . . . , p lim sup p ^{k )1} < 1 for all k = 1 , p with

=max\ p Yi^k ∑_ajYy \

the multiple correlation coefficient between the k-th and the remaining p-1 variables

Cov(y_v ) the covariance matrix of y_v_

Then we have: the convergence in distribution of the quantity

Cov(y_vfhy_v -E(y_vy) against an N(0 , Id_p) -distribution is equivalent with the condition

0

where the p-dimensional N ( 0, Id_p) -distributed random variable Y has got the density-function f

f(y) = (2π) ² exp(—^-/y) (note that y is also p-dimensional)

This theorem opens the possibility to construct at least approximate confidence ellipsoids for the vector Y. . Because we are interested in multivariate considerations (contrary to the state-of-art and the improvement given in the preceding section) we now want to consider confidence ellipsoids and their construction.

Let us consider first the situation that the observations are independent and identically multivariate normal-distributed p- dimensional vectors with expectation vector μo and covariance matrix Σ, whose inverse exists.

Now let us assume that we want to construct the desired confidence ellipsoid for the unknown expectation vector μ₀ based on a sample of size n. Such an ellipsoid is given by:

where x is given by ι = -1 " τι Y ^-^ _Xi ⁿ 1=1 and X_\-_a.__v is the value such that given a probability α then

P(Y ≥ z _:P) = a where Y has a χ²-distribution with p degrees of freedom, i.e. the density function

where T denotes the Gamma-function

Thus Z -_a-_p ^^s ^^ιe Percentile of the χ² -distribution with p degrees of freedom. This is due to the fact that the quantity n(x-μ₀)' Σ^-1(l- ₀) is χ² -distributed with p degrees of freedom, if the data is multivariate normal-distributed.

In case that the covariance matrix Σ is unknown this matrix has to be estimated from the data. A possible estimator for Σ is given by:

ft- ι=rl

The corresponding confidence ellipsoid is then given by:

where now F_x__a._Ptn__p is the value such that given a probability then

P<? Fl-cr.p*-p ) = <* where Y has a ^-distribution with m_x (= p) and m₂ (=n-p) degrees of freedom, i.e. the density function

forjξ≥O

where r denotes the Gamma-function.

In case that the condition that the data is multivariate normal-distributed is not fulfilled and if instead of that a central limit theorem for the estimator of the expectation vector holds then the above given confidence ellipsoids are only approximately valid. Based on such an approximation the following replacements can be introduced: the expectation vector μ₀ by the vector F. the estimator x for the expectation vector μ₀ by the estimator y. the covariance matrix Σ respectively its estimator S by the covariance matrix ∑_γ respectively its estimator s

If we assume that central limit theorem holds then the given confidence ellipsoids remain as presented. Therefore we get for the ellipsoid K_Σ the ellipsoid

and for K=

Note that i „,²

X-a:p,n-p → τz - ^■a:p for n — > ∞ n-l

>1 for p > 1 n — p which gives the following approximate confidence ellipsoid:

To determine the necessary sample size we fix the desired confidence level and the desired maximal volume of the confidence ellipsoid. Note that the volume of a p-dimensional ellipsoid is given by p

V - const (p) \ J h ^■ _tk k=X where h_k k=l, . . . ,p denote the semiaxes of the ellipsoid and const (p) is a constant depending on the dimension p.

If one defines a maximal admissible confidence volume V , then it follows, that V* = cons(p)n ² {χ __{a) p}) ² Vdet? From this equation we get for the necessary n as sample size:

Related with this equation for the necessary sample size are the following two problems : a. the constant const (p) depends on the dimension p b. one needs an a-priori estimate of the covariance matrix -- -- -

To solve these problems one can propose the following :^"One encloses the confidence ellipsoid with semiaxes d_k , k=l, . . •/P by a brick with side lengths 2 d_k, k=l, . . ., p.

Consequently with respect to the volume the largest ellipsoid enclosed by that brick has the volume:

From this formula it is possible to deduce the following necessary sample size n:

where s²._k is the k-th diagonal element of the covariance matrix s .

If we define quantities Sk by the relation:

^2d _k = ε_k s_y± <^> d_k = ε_ks_y± eq . 10 then the necessary sample size would be given in terms of relative precision

that means the lengths of the semiaxes are a fraction of the standard deviations s_γk .

Therefore the necessary sample size can be expressed in terms of the ε :

With such a necessary sample size n the corresponding confidence ellipsoid would be enclosed by a brick of side lengths 2d_k with 2d_k - ε_ks_γk .

Choosing all ε_k equal a desired relative precision ε, then we get for the necessary sample size:

which can serve as a formula for practitioners, especially in cases where p is quite large.

Finally we will deal with the following problem: Due to the fact that we took a sample from the whole population, we can only compare the estimates for a support with the desired minimal support chosen by the user. Therefore we encounter the problem that due to random variations we might get an estimator which is just by chance below the given minimal support. This means that we should be interested in a statistical measure how significant this situation is related to a considered item or variable. From the statistical point of view this leads to the theory of simultaneous confidence intervals which we will deal next.

These intervals can be received from a constructed confidence ellipsoid as following.

For arbitrary xe$R^p and A assumed as a positive definite pxp- matrix we get :

based on this expression we get for the ellipsoid K:

K = Y. e $l ^p sup (<y.-Ϋ.γ vf < —1

≤ — X(l-a):p ve9I^p-{θ} v 'sv

Therefore we can embed K directly into a p-dimensional brick which is given as the product of the intervals

if v is the k-th unit vector (k=l, ....,p)

Every of these intervals is denoted as a simultaneous confidence interval. Componentwise we get an interval for YS^k)

4.3.2 Process Flow For Sampling Of Association Rules

Fig. 3 visualizes the process flow for sampling of association rules in multivariate case outlined in the previous chapter. This process flow could be applied also to the univariate model accordingly without any further problem.

In step 301 a decision is made whether data mining for association rules should be performed based on the complete multitude of transaction records (choosing path 302) or based on a sample (choosing path 303) . In case path 302 is selected the methodology for mining of association is applied within step 304 followed by a steps 305 to visualize the computed association rules .

If the association rules are to be computed based on a sample by taking path 303 first the sample size has to be determined within step 306. One approach would consist in specifying the sample size directly. Another approach would consist in a step to calculate the sample size. In the multivariate approach the sample size would be calculated based on: a. the number p of different items occurring within the multitude of transactions as parameter for more thoroughly characterizing the multitude of transactions, b. further precision requirements for a quality of the approximation comprising for example: bl . the confidence (1-α) for an estimation based on a sample b2. relative precision requirements S_k for the individual items or a common precision requirement ε for all items according to eq. 10. If some items need to be estimated with a higher degree of precision than others then the approach to specify relative precision requirements for the individual items has to be selected.

Based on these specifications an estimated sample size will be calculated within step 307 according to the approximation formulas eq. 11 or 12. This estimated sample size can be used directly as sample size or may be used as an orientation only. In the later case the final sample size would have to be chosen at least in the order of magnitude of the estimated sample size.

Based on the multitude of records and the calculated sample size the final sample will be drawn by random sampling in step 308

Using this sample as input the state of the art methodologies for mining of association may be applied within step 304 to determine the estimated association rules followed by a step 305 to visualize the estimated association rules.

If step 306 would also comprise the specification of a required minimal support value then in step 305 even a decision would be possible whether a considered association rule is of interest or not. For this purpose the simultaneous confidence intervals calculated within eq. 13 can be exploited. The following decision process would be applied:

1. if a confidence interval according to eq. 13 is completely on the left side of this minimal support value (i. e. the upper bound of this interval is smaller than the minimal support value) then the considered item is not of interest because the estimator of its support is below of the minimal support .

2. if a confidence interval according to eq. 13 for the considered item encloses the minimal support value or is completely on the right side then the item is interesting because the estimator of its support is above of the minimal support (remember that one defines an item as interesting if its support value is larger or equal the predefined minimal support value) .

Due to the construction of these confidence intervals one can be sure that one gets all interesting rules with common confidence (1-α) .

4.4 Applications According to the state-of-the-art it was necessary, because of the huge size of the data making up the multitude of transaction records and the extreme long processing time for calculating the association rules, that the computer system computing the association is identical with the computer system storing the multitude of transaction records . As the current invention is capable to reduce the amount of data the actual mining technology is applied to a very small sample of transaction records only (extremely small compared to the complete multitude of transaction records) a new distributed processing model comprising a client computer and a server computer connected by some communication network like the Internet is suggested.

Fig. 4 depicts a distributed processing model for mining of association rules.

Within Fig. 4 a client computer 401 is shown for controlling determination of association rules. The client stores the multitude of N transaction records 402. The client computer is drawing within step 403 a sample 404 from the multitude of N transactions with a sample size n. The sample size may be determined by any of the previous disclosed approaches .

Using the communication network 405 the sample is transmitted to the server computer 406 providing a specific service for mining of association rules. Within steps 407 the association rules are calculated based on the provided sample and returned to the client computer across the communication network. Since now the time for the analysis is small (being based on a small sample only) it is possible to send back the resulting approximate rules very quickly. Finally these rules may then be analyzed for further activities on the client system within step 408.

Depending on the extent of the provided service for mining of association rules on the server system two different embodiments are possible: either the client computer itself is determining the sample size, or the server computer is responsible for determining the sample size. In any case the technology disclosed within the current specification for determining the sample size is exploited.

Claims

C L A I M S

1. A computerized data mining method for determining association rules within a multitude of N transactions each transaction comprising up to p different items,

said method comprising a first step of determining a sample size n of said multitude of N transactions

wherein said sample size n is determined based on precision requirements, and

wherein said sample size n is determined based on a multivariate estimation analysis for achieving said precision requirements, and

said method comprising a second step of computing association rules based on a sample of said multitude of N transactions with sample size n according to any methodology for mining of association rules using said association rules as estimated association rules of said multitude of N transactions.

2. A computerized data mining method for determining association rules according to claim 1,

wherein said precision requirements comprise

a confidence (1-α) for an estimation based on a sample, and

a relative precision ε_k for an item k of a sample, said relative precision 8 defining an acceptable deviation of the support value of item k within a sample compared to within said multitude of N transactions, said relative precision 8_k being measured relative to the standard deviation of the support value of item k, and

wherein said sample size n is at least in the order of magnitude of an estimated sample size n*

with Zi-_a:P being the percentile of the χ -distribution with p degrees of freedom, and

with p being said number of different items for characterizing said multitude of N transactions.

3. A computerized data mining method for determining association rules according to claim 2,

wherein said relative precision 8=ε is identical for all items k, and

4. A computerized data mining method for determining association rules within a multitude of N transactions each transaction comprising up to p different items,

said method comprising a first step of determining a sample size n of said multitude of N transactions, wherein said sample size n is determined based on precision requirements for association rules,

wherein said precision requirements comprise a confidence (1-α) for an estimation based on a sample, and

wherein said precision requirements comprise a relative precision δ defining an acceptable deviation of the support value of a certain rule within a sample compared to within said multitude of N transactions, said relative precision δ measured relative to the support value of said certain rule, and

wherein said precision requirements comprise a lower boundary p for an expected support value, and

with u „ being the percentile of the standard normal

2 distribution, and

5. A computerized data mining method for determining association rules within a multitude of N transactions each transaction comprising up to p different items,

said method comprising a first step of determining a sample size n of said multitude of N transactions,

wherein said sample size n is determined based on precision requirements for association rules,

wherein said precision requirements comprise a relative precision δ defining an acceptable, either positive or negative deviation of the support value of a certain rule within a sample compared to within said multitude of N transactions, said relative precision δ measured relative to the support value of said certain rule, and

with U_j.e being the percentile of the standard normal distribution, and

6. A computerized data mining method for determining association rules according to claim 4 or 5 ,

wherein said lower boundary p for an expected support value is a minimum support value p=Minsup used by said methodology for mining of association rules .

7. A computerized data mining method for determining association rules within a multitude of N transactions each transaction comprising up to p different items,

wherein said precision requirements comprise an absolute precision d defining an acceptable deviation of the support value of a certain rule within a sample compared to within said multitude of N transactions, and

wherein said precision requirements comprise an upper boundary p for an expected support value, and

with u being the percentile of the standard normal distribution, and

said- method comprising a second step of computing association rules based on a sample of said multitude of N transactions with size n according to any methodology for mining of association rules using said association rules as estimated association rules of said multitude of N transactions.

8. A computerized data mining method for determining association rules within a multitude of N transactions each- transaction comprising up to p different items,

wherein said precision requirements comprise an absolute precision d defining an acceptable, either positive or negative deviation of the support value of a certain rule within a sample compared to within said multitude of N transactions , and wherein said precision requirements comprise an upper boundary p for an expected support value, and

with u₁__a being the percentile of the standard normal distribution, and

said method comprising a second step of computing association rules based on a sample of said multitude of N transactions with size n according to any methodology for mining of association rules using said association rules as estimated association rules of said multitude of N transactions.

9. A computerized data mining method for determining association rules according to claim 7 or 8,

wherein said upper boundary p for an expected support value is p=0.5.

10. A computerized data mining method for determining association rules according to anyone of claims 1, 4, 5, 7, or

wherein said methodology for mining association rules is the APRIORI methodology.

11. A computerized data mining method for determining association rules according to anyone of claims 1, 4, 5, 7, or wherein said sample is drawn by random sampling.

12. A client computer for controlling determination of association rules within a multitude of N transactions each transaction comprising up to p different items,

said client computer drawing a sample from said multitude of N transactions with a sample size n of at least in the order of magnitude of an estimated sample size n^*, said estimated sample size being determined by the method according to anyone of claims 1 to 11,

said client computer sending said sample to a server computer for determination of association rules, and

said client computer receiving from said server computer said determined association rules using said association rules as estimated association rules of said multitude of N transactions .

13. A client computer for controlling determination of association rules according to claim 12,

wherein said client computer is determining said estimated sample size n^*, or

wherein said server computer is determining said estimated sample size n^* on behalf of said client computer.

14. A data processing program for execution in a data processing system comprising software code portions for performing a method according to anyone of the preceding claims 1 to 11 when said program is run on said computer.

15. A computer program product stored on a computer usable medium, comprising computer readable program means for causing a computer to perform a method according to anyone of the preceding claims 1 to 11 when said program is run on said computer.