WO2008022341A2 - Seeding method for k-means clustering and other clustering algorithms - Google Patents

Seeding method for k-means clustering and other clustering algorithms Download PDF

Info

Publication number
WO2008022341A2
WO2008022341A2 PCT/US2007/076258 US2007076258W WO2008022341A2 WO 2008022341 A2 WO2008022341 A2 WO 2008022341A2 US 2007076258 W US2007076258 W US 2007076258W WO 2008022341 A2 WO2008022341 A2 WO 2008022341A2
Authority
WO
WIPO (PCT)
Prior art keywords
seeds
selecting
seed
data points
points
Prior art date
Application number
PCT/US2007/076258
Other languages
French (fr)
Other versions
WO2008022341A3 (en
Inventor
Rafail Ostrovsky
Yuval Rabani
Leonard J. Schulman
Chaitanya Swamy
Original Assignee
The Regents Of The University Of California
The Technion Research & Development Foundation Ltd
California Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Regents Of The University Of California, The Technion Research & Development Foundation Ltd, California Institute Of Technology filed Critical The Regents Of The University Of California
Publication of WO2008022341A2 publication Critical patent/WO2008022341A2/en
Publication of WO2008022341A3 publication Critical patent/WO2008022341A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Definitions

  • Clustering is performed in many applications, including the areas of pattern recognition, image analysis, data mining, search engines, bioinformatics, fraud detection, and the like, in order to associate data into groups for further analysis, including detecting items that are similar and also items that are dissimilar.
  • clustering it is assumed that there are a number of data items, whatever the items may represent, and the data items are to be organized in some fashion.
  • each data item is transformed into an analyzable form that may be best understood as a single point in a multi-dimensional space, and then clusters of points, sometimes called clouds or groups of points, are identified within such space. That is, given enough points, distinct groups of points are identified, where each group has a number of points in relatively close proximity such that the group is a clustering of such points.
  • the number of points derived from a data set can be relatively large, perhaps even on the order of millions, tens of millions, and perhaps even hundreds of millions. Accordingly, identifying clusters within a space of such data points is not a trivial matter.
  • One method for doing so includes the use of Lloyd-type fc-means clustering.
  • Finding the center of each cluster is not a trivial matter. Accordingly, Lloyd- type k- means clustering may be employed to do so. Briefly, in classic LIo yd- type fc-means clustering, and again presuming two clusters merely for the sake of explanation, a potential center for each cluster is defined, based on some set of parameters, or even randomly, where each such potential center is in effect a seed to initialize an iterative process. In such iterative process, some number of points closest to each seed are located, and for the located points for each seed, the center of such located points is found in a known manner, and the seed is relocated to such found center. The iterative process is repeated a number of times until some terminating event or condition occurs.
  • Such event or condition can for example be that the movement of each seed effectively settles down to a more-or-less stationary location or settled center, or can be a fixed number of iterations. Of course, it may be that at least some of the seeds never settle down to a settled center, in which case the iterations may be stopped after some predetermined number of iterations. In such a case, it may be that the seeds were initially poorly chosen, in which case new seeds may be picked and the process repeated.
  • ⁇ -clustering that more accurately define an initial seed for each cluster prior to performing the iterative process of such Lloyd-type ⁇ -clustering. More particularly, a need exists for such systems and methods that employ a procedure to at least roughly identify the initial location of each true cluster based on distances between at least a sampling of all the points in the data set.
  • a method of seeding that is performed with regard to performing clustering on a data set of data items, each of which has been transformed into a point in a multi-dimensional space.
  • a number of clusters are defined within the space, and a center of each cluster is initially located by selecting an initial seed as a potential center for each cluster.
  • An iterative process is then performed to move each seed to what eventually becomes the located center of a cluster.
  • a first initial seed for a first cluster is selected by:
  • the selection may, in some embodiments, be performed randomly with a uniform probability of picking any point, or in other embodiments may be based on an assigned probability distribution with, for example, the probability of picking a point D 1 as ⁇ ⁇ being:
  • the first initial seed may be selected as the centroid of the data points or the centroid of a sampling of the data points.
  • an initial seed for each successive cluster may be selected by: (i) receiving an integer k between 2 and n - 1 as the number of seeds that are to be produced; and
  • more than one initial seed may be selected before undertaking the iterative process described above for determining remaining seeds.
  • an abundance of seeds may be selected and later culled according to processes such as described below.
  • FIG. 1 is a block diagram of an example of a computing environment within which various embodiments of the present invention may be implemented;
  • FIG. 2 is a conceptual diagram showing a space within which a number of data points from a data set are located;
  • FIG. 3 is a conceptual diagram showing a space within which a number of data points from a data set are located and a partition of the points into clusters;
  • Fig. 1 is set forth herein as an exemplary computing environment in which various embodiments of the present invention may be implemented.
  • the computing system environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality. Numerous other general purpose or special purpose computing system environments or configurations may be used. Examples of well known computing systems, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers (PCs), server computers, handheld or laptop devices, multi-processor systems, microprocessor-based systems, network PCs, minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
  • PCs personal computers
  • server computers handheld or laptop devices
  • multi-processor systems microprocessor-based systems
  • network PCs network PCs
  • minicomputers minicomputers
  • mainframe computers mainframe computers
  • embedded systems distributed computing environments that include any of the above systems or devices, and the like.
  • Computer-executable instructions such as program modules executed by a computer may be used.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium.
  • program modules and other data may be located in both local and remote computer storage media including memory storage devices.
  • an exemplary system for implementing aspects described herein includes a computing device, such as computing device 100.
  • computing device 100 typically includes at least one processing unit 102 and memory 104.
  • memory 104 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two.
  • RAM random access memory
  • ROM read-only memory
  • FIG. 1 This most basic configuration is illustrated in Fig. 1 by dashed line 106.
  • Computing device 100 may have additional features/functionality.
  • computing device 100 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in Fig.
  • Computing device 100 typically includes or is provided with a variety of computer-readable media.
  • Computer readable media can be any available media that can be accessed by computing device 100 and includes both volatile and non-volatile media, removable and non-removable media.
  • Computer readable media may comprise computer storage media and communication media.
  • Computer storage media includes volatile and non-volatile, removable and nonremovable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Memory 104, removable storage 108, and non-removable storage 110 are all examples of computer storage media.
  • Computer storage media includes, but is not limited to, RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD- ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computing device 100. Any such computer storage media may be part of computing device 100.
  • Computing device 100 may also contain communications connection(s) 112 that allow the device to communicate with other devices.
  • Each such communications connection 112 is an example of communication media.
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared and other wireless media.
  • RF radio frequency
  • computer readable media as used herein includes both storage media and communication media.
  • Computing device 100 may also have input device(s) 114 such as keyboard, mouse, pen, voice input device, touch input device, etc.
  • Output device(s) 116 such as a display, speakers, printer, etc. may also be included. All these devices are generally known to the relevant public and therefore need not be discussed in any detail herein except as provided.
  • computing device 100 may be one of a plurality of computing devices 100 inter-connected by a network 118, as is shown in Fig. 1.
  • the network 118 may be any appropriate network
  • each computing device 100 may be connected thereto by way of a connection 112 in any appropriate manner
  • each computing device 100 may communicate with one or more of the other computing devices 100 in the network 118 in any appropriate manner.
  • the network 118 may be a wired or wireless network within an organization or home or the like, and may include a direct or indirect coupling to an external network such as the Internet or the like.
  • the computing device In the case of program code execution on programmable computers, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
  • One or more programs may implement or utilize the processes described in connection with the presently disclosed subject matter, e.g., through the use of an application- program interface (API), reusable controls, or the like.
  • API application- program interface
  • Such programs may be implemented in a high level procedural or object oriented programming language to communicate with a computer system.
  • the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations.
  • exemplary embodiments may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network 118 or a distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be effected across a plurality of devices in a network 118. Such devices might include personal computers, network servers, and handheld devices, for example.
  • clustering is performed in many applications in order to associate data into groups for further analysis, including detecting items that are similar and also items that are dissimilar.
  • clustering it is assumed that there are a number of data items, whatever the items may represent, and the data items are to be organized in some fashion.
  • fraud detection such as that which may be performed by a bank or other financial institution to find transactions that are improper and which should be prevented.
  • patterns of proper transactions are identified, and transactions that fall outside of or are dissimilar to such proper transactions are identified and then scrutinized for possible identification as being fraudulent.
  • clustering may be performed with regard to any other type of data set without departing from the spirit and scope of the present invention.
  • each data item is transformed into an analyzable form that may be best understood as a single point in a multi-dimensional space.
  • Fig. 2 is an example of a collection of points 10 and a set of two clusters 12, 14 delineated by the dotted curves.
  • a goal of a clustering algorithm is to identify within a collection of points 10 a set of clusters 12, 14 of points.
  • the set of clusters may partition the collection of points 10 into a set of distinct clusters 12, 14. Note that by its nature the drawing of Fig. 2 is two-dimensional. Nevertheless, it is to be understood that the objects shown in the two-dimensional drawing of Fig. 2 can readily be extended to a greater number of dimensions.
  • Distinct clusters 12, 14 are identified, where each cluster 12, 14 has a number of points in relatively close proximity to each other. As seen in Fig. 2, it may be that there are some points that are relatively close to each other, in which case such points and the data items represented by them are relatively similar in some sense. Likewise, it may be that there are points 10 that are relatively far from each other, in which case such points and the data items represented by them are relatively dissimilar.
  • Points that are relatively closer to the center of a cluster 12, 14 tend to be more closely identifiable as being part of such cluster 12, 14, while points that are relatively farther from the center of a cluster 12, 14 tend to be less identifiable as being part of such cluster 12, 14.
  • points that are relatively far from the center of any cluster that represents legitimate transactions may be more closely scrutinized as being at least potentially fraudulent.
  • Lloyd-type fc-means clustering is used to find a center of each cluster, such that each point can then be associated with a particular cluster based on distance within such space between the point and the center of the cluster.
  • each point is associated with the cluster whose center is nearest the point. Finding distances within high-dimensional space and associating a point with a particular cluster is known or should be apparent to the relevant public and therefore need not be set forth herein in detail. Accordingly, any method for finding such distance within such space and associating such point with a particular cluster may be employed without departing from the spirit and scope of the present invention.
  • classic Lloyd-type fc-means clustering involves selecting an initial collection of seeds as potential centers for each cluster. The initial selection may be based on some set of parameters, or may even be performed randomly. Various techniques for this initial selection are described below. An iterative process is then performed wherein the following steps are repeated: each point is assigned to a cluster associated with the potential cluster center nearest the point and then the centroid of all points (or some sampling of points) in each cluster is computed and a new cluster center for that cluster is located at the centroid of that cluster. The iterative process is repeated until some termination condition is met. For example, the termination condition might be that the cluster centers move less than a predetermined amount from one iteration to the next, or that some predetermined number of iterations is performed.
  • Fig. 3 provides an example of an initial stage of a Lloyd-type k- means clustering algorithm for two clusters.
  • a collection of points 30 is provided.
  • Initial seeds ⁇ i and ⁇ 2 have been chosen and are designated as initial potential centers for the two clusters 32, 34. Techniques for selected the initial seeds are discussed below.
  • Each point in the collection of points 30 is then associated with the cluster whose center is closest to the point.
  • points nearer to ⁇ i than ⁇ 2 i.e., points above line 36
  • points nearer to ⁇ 2 than ⁇ 1? i.e., points below line 36
  • the centroid of the points in the each cluster 32 34 is computed and new cluster centers, V 1 , V 2 respectively, are located at those centroids.
  • each point in the collection of points 30 is once again associated with a cluster whose center is closest to the point.
  • points nearer to V 1 than V 2 i.e., points above and to the right of line 38
  • points nearer to V 2 than V 1 i.e., points below and to the left of line 38
  • the process is repeated, determining new cluster centers and perforce new clusters, until some termination condition is met.
  • Such Lloyd-type ⁇ -clustering is generally expected to work well if the initial seeds are well-chosen. However, if, for example, multiple seeds are chosen within one 'true' cluster (i.e., a cluster that ought to be found) or if multiple seeds iteratively migrate to one true cluster, the process can result in multiple cluster centers within the one true cluster and no cluster centers in another true cluster, and the points in the true clusters may end up being assigned to the clusters in some false and perhaps even playful fashion. [0038] Clustering in general is a venerable statistical problem of both practical and theoretical interest. There is presently a wide and unsatisfactory gap between the practical and theoretical clustering literatures.
  • a near-optimal ⁇ -clustering can be achieved by a partition into fewer than k clusters, then that smaller value of k should be used to cluster the data. If near-optimal ⁇ -clusterings are hard to find only when they provide ambiguous classification or marginal benefit (i.e., in the absence of a meaningful k- clustering), then such hardness should not be viewed as an acceptable obstacle to algorithm development. Instead, the performance criteria should be revised.
  • each center is associated with all data in its Voronoi region, and is located at the center of mass of this data.
  • Lloyd's method which consists of the following two phases: (a) 'Seed' the process with some initial centers (the literature contains many competing suggestions of how to do this); (b) Iterate the following Lloyd step until the clustering is 'good enough': cluster all the data in the Voronoi region of a center together, and then move the center to the centroid of its cluster.
  • the objective is to partition X into k clusters X ⁇ ..., X k and assign each point in every cluster X 1 to a common center c ; e R d , so as to minimize the fc-means cost, defined as ⁇ f_ j ⁇ xe% Il JC - c, II 2 ,
  • cluster X 1 simply consists of all points x e X for which c, is the nearest center, breaking ties arbitrarily.
  • clustering X 1 the best centers corresponding to this clustering are obtained by setting c, to be the center of mass (centroid) of
  • c is the centroid of cluster X 1 , and each point in X 1 has c, as its nearest center.
  • Theorem 3.4 The above algorithm returns a clustering of cost at most —
  • both techniques proceed in two stages.
  • the first stage is a seeding stage, which performs the bulk of the work and guarantees that at the end of this stage there are k seed centers positioned at nearly the right locations.
  • each optimal center has a (distinct) initial center located in close proximity.
  • This is precisely the leverage that we obtain from the fc-means separation condition (as in the 2-means case).
  • We consider a natural generalization of the sampling procedure used for the 2-means case and show that this picks the k initial centers from the cores of the optimal clusters. This sampling procedure runs in linear time but it succeeds with probability that is exponentially small in k.
  • Running time The sampling procedure consists of k iterations, each of which takes ⁇ (nd) time. This is because after sampling a new point C 1+1 , we can update the quantity min ⁇ 1 !+1 ⁇ ll JC - c Il for each point x in ⁇ (d) time. So the overall running time is Oinkd) .
  • Analysis. Let ⁇ 2 « p ⁇ 1 be a parameter that we will set later. As in the 2-
  • X° ut [ x e X 1 : Il x - C 1 Il 2 ⁇ ⁇ - ⁇ . Note that X ⁇ c X 1 '
  • N h- — - — -r where 0 ⁇ ⁇ ⁇ 1 is a desired error tolerance.
  • Lemma 4.3 Suppose that we have sampled i points ⁇ Jc 1 ,..., X 1 ] from X. Let X 1 ,..., X m be all the clusters whose outer cores contain some sampled point x ⁇ . Then
  • the initial abundant collection of cluster centers may be determined by any of the methods described herein, or by a hybrid of two or more of the methods. Some of the initial cluster centers may be selected by random methods and some may be selected by deterministic methods. Some may be chose with respect to previously selected cluster centers and some may be chosen independently of previously selected cluster centers.
  • centers may be deleted or merged, for example by the techniques described below, until the desired number of cluster centers remain.
  • a 2 k (•) cost of the sampled instance is much smaller than its (•) cost, and that the optimal centers for the sampled instance lie near the optimal centers for X.
  • Lemma 4.10 For each center c ; , there is a center C 1 such that .
  • Ball- ⁇ > means step. Let B ; be the points of X in a ball of radius U 1 I 3 around C 1 , and C 1 be the centroid of B ; . Return C 1 ,..., c ⁇ k as the final centers.
  • a linear time constant-factor approximation technique may be accomplished by : 1 - Execute the seeding procedure above to obtain k initial centers C 1 ,..., c k . 2 - Run the ball-fc- means step to obtain the final centers.
  • the running time is O(nkd+k 3 d).
  • Lemma 4.10 we know that with probability 1 - O(y/ ⁇ ) , for each c ; , there is a distinct center C 1 such that Il C 1 - C 1 Il ⁇ D 1 /10. Combined with Lemma 4.11, this yields the following theorem.
  • Theorem 4.13 If A 2 J (X) ⁇ ⁇ 2 for a small enough ⁇ , the above algorithm
  • the PTAS combines the sampling procedure with the centroid estimation step: 1 - Use the procedure above to pick k initial centers c 1 ,..., c k . 2 - Run the centroid estimation procedure to obtain the final centers.
  • cost(jc;,..., X k ) denote the cost of clustering X around the centers xi,...,x t in R d .
  • R(x) we use R(x) to denote the Voronoi region of point x (the centers will be clear from the context).
  • S 1 ® S 2 (S 1 XS 2 ) (J (S 2 XS 1 ) denote the symmetric difference of S 1 and S 2 .
  • Theorem 5.2 Let ⁇ ⁇ 1 / 3. Suppose that for every ⁇ -clustering X 1 ,..., X k of X of cost at most ⁇ 2 A 2 k (X) ,
  • the various embodiments of the present invention are directed in particular toward data analysis and methods of clustering data points into clusters that represent groups of interest.
  • One method of performing clustering is the so-called fc-means clustering heuristic.
  • fc-means clustering heuristic tends to be sensitive to the way initial cluster centers or seeds are chosen, which is significant given that some methods for randomly selecting seeds can lead to results of variable quality.
  • a procedure for choosing initial seeds to find centers of clusters is provided which leads to more effective clustering procedures, both in terms of the quality of results and running time.
  • the seeding procedure and clustering procedure are performed with regard to a data set of items, each of which must be in an analyzable form that may be best understood as a single point in a multi-dimensional space. Accordingly, such data set is procured, obtained, or otherwise provided, as a collection of points 500. As should be understood, such procurement and transformation may be performed in any appropriate manner without departing from the spirit and scope of the present invention.
  • the seeding procedure receives as an input an integer k between 2 and n- ⁇ which represents the number of seeds that are to be produced for the clustering procedure.
  • the seeding procedure will output k points out of the n input points that will serve as seeds.
  • the first seeds may be determined by selecting one or more points at random from the collection of data points. The random selection may be made from any probability distribution assigned to the data points. In some embodiments the first seeds are selected uniformly at random from the data points.
  • the center of gravity C of the n points may be calculated, for example by taking
  • this center of gravity C is used as the first seed.
  • the first seed may be chosen by the following procedure:
  • Such distance C 1 0 may be calculated in any appropriate manner without departing from the spirit and scope of the teachings herein.
  • the distance may be calculated as the squared Euclidean distance between such point (ii)
  • An average squared distance is then calculated as:
  • a point ⁇ x from among the n points is selected at random as the first seed with the probability of picking any point D ;
  • each distance may be calculated in any appropriate manner, such as the
  • steps 504 and 506 are repeated until all of the required seeds ⁇ x , ..., ⁇ k have been selected. Thereafter, ⁇ x , ..., ⁇ k as the selected seeds are then supplied (step 510) to a clustering procedure, and the clustering procedure iteratively moves the seeds within the space until a termination condition is reached and cluster centers are defined.
  • more than one seed may be selected by techniques such as those described in connection with “Picking the first seed” above before proceeding to the techniques of "Picking subsequent seeds.”
  • an abundant number of seeds may be initially selected. That is, some number of seeds greater than that to be used by a clustering algorithm are initially selected. The collection of seeds is then culled, such as for example by the deletion / merger techniques described above.
  • the initial abundant collection of seeds may be chosen by any suitable method, such as those described above for example, or by a combination of methods. For example, some number of seeds may be chosen by the procedure described in connection with Fig. 5 above and then the collection of seeds may be augmented by selecting additional seeds at random. As another example, some number of seeds may be picked randomly, and then the collection of seeds augmented by the procedure described in connection with steps 504 to 508 of Fig. 5.
  • the seeds that are selected by the techniques disclosed herein will be communicated to a clustering algorithm for subsequent determination of a desirable clustering of given data.
  • a Lloyd-type iteration algorithm may take the selected seeds as a starting point for an iterative process that determines a clustering for the data.

Abstract

Techniques for determining a collection of seeds for use in a clustering algorithm. After one or more seeds are selected by any of various methods, additional seeds are selected by randomly selecting a point from a set of data points with the probability of selecting a given point being proportional to the distance from the given point to the nearest previously selected seed. An abundant number of seeds may be initially selected, and the collection of seeds culled by an iterative process.

Description

SEEDING METHOD FOR K-MEANS CLUSTERING AND OTHER CLUSTERING
ALGORITHMS
CROSS REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of U.S. Provisional Application No. 60/822,903, filed August 18, 2006 and entitled "SEEDING METHOD FOR K-MEANS CLUSTERING AND OTHER CLUSTERING ALGORITHMS", hereby incorporated by reference in its entirety.
BACKGROUND
[0002] Clustering is performed in many applications, including the areas of pattern recognition, image analysis, data mining, search engines, bioinformatics, fraud detection, and the like, in order to associate data into groups for further analysis, including detecting items that are similar and also items that are dissimilar. Generally, in clustering, it is assumed that there are a number of data items, whatever the items may represent, and the data items are to be organized in some fashion.
[0003] To perform clustering on a data set of data items, each data item is transformed into an analyzable form that may be best understood as a single point in a multi-dimensional space, and then clusters of points, sometimes called clouds or groups of points, are identified within such space. That is, given enough points, distinct groups of points are identified, where each group has a number of points in relatively close proximity such that the group is a clustering of such points. [0004] Notably, the number of points derived from a data set can be relatively large, perhaps even on the order of millions, tens of millions, and perhaps even hundreds of millions. Accordingly, identifying clusters within a space of such data points is not a trivial matter. One method for doing so includes the use of Lloyd-type fc-means clustering.
[0005] In a grossly simplified example in connection with such Lloyd-type fc-means clustering, it is to be assumed that a data set produces two clusters. Note, however, that the number of clusters is typically more, and can either be pre-defined or can be derived from the data set in multiple fashions. Regardless, with the assumed two clusters, the center of each such cluster within the high-dimensional space is found, and then each point is associated with one or the other of the clusters based on distance within such space to the found center of each cluster. Typically each point is associated with the cluster whose center is closest to such point. Distance within the space may be computed by any suitable metric. We will often make use of the squared Euclidean distance as the measure of the distance between two points.
[0006] Finding the center of each cluster is not a trivial matter. Accordingly, Lloyd- type k- means clustering may be employed to do so. Briefly, in classic LIo yd- type fc-means clustering, and again presuming two clusters merely for the sake of explanation, a potential center for each cluster is defined, based on some set of parameters, or even randomly, where each such potential center is in effect a seed to initialize an iterative process. In such iterative process, some number of points closest to each seed are located, and for the located points for each seed, the center of such located points is found in a known manner, and the seed is relocated to such found center. The iterative process is repeated a number of times until some terminating event or condition occurs. Such event or condition can for example be that the movement of each seed effectively settles down to a more-or-less stationary location or settled center, or can be a fixed number of iterations. Of course, it may be that at least some of the seeds never settle down to a settled center, in which case the iterations may be stopped after some predetermined number of iterations. In such a case, it may be that the seeds were initially poorly chosen, in which case new seeds may be picked and the process repeated.
SUMMARY
[0007] An issue arises in connection with LIo yd- type ^-clustering in that it may be that two seeds as chosen are effectively in one 'true' cluster and none are in the other 'true' cluster, where each true cluster is defined as the cluster that ought to be found. In such a case, and as may be appreciated, the seeds may result in two settled centers within the one true cluster and no settled centers in the other true cluster, and the points in the true clusters likely end up being assigned to the two settled centers in some undesirable fashion. [0008] Accordingly, a need exists for systems and methods for performing Lloyd-type
^-clustering that more accurately define an initial seed for each cluster prior to performing the iterative process of such Lloyd-type ^-clustering. More particularly, a need exists for such systems and methods that employ a procedure to at least roughly identify the initial location of each true cluster based on distances between at least a sampling of all the points in the data set.
[0009] The aforementioned needs are satisfied at least in part by a method of seeding that is performed with regard to performing clustering on a data set of data items, each of which has been transformed into a point in a multi-dimensional space. A number of clusters are defined within the space, and a center of each cluster is initially located by selecting an initial seed as a potential center for each cluster. An iterative process is then performed to move each seed to what eventually becomes the located center of a cluster.
[0010] In some embodiments of the method, a first initial seed for a first cluster is selected by:
(i) receiving a number n of the points, each of which is numbered D1 for / =1, 2,
3, ..., «;
(ii) calculating a center of gravity, or centroid, C of the n points by taking the average on each coordinate over the n points;
_ 0
(iii) calculating, for each of the n points, a squared distance C, of such point D1 to C;
(iv) calculating an average squared distance as:
(v) selecting a point μ\ from among the n points at random as the first initial seed. The selection may, in some embodiments, be performed randomly with a uniform probability of picking any point, or in other embodiments may be based on an assigned probability distribution with, for example, the probability of picking a point D 1 as μ\ being:
_ 0
2nC
In still other embodiments, the first initial seed may be selected as the centroid of the data points or the centroid of a sampling of the data points.
[0011] Thereafter, an initial seed for each successive cluster may be selected by: (i) receiving an integer k between 2 and n - 1 as the number of seeds that are to be produced; and
(ii) selecting points μ2 , ..., μk as seeds according to an iterative process, where, supposing μ1; . . . , μ/-_1 have already been chosen, μ } is selected by:
- for each l ≤ i ≤ n , calculating C, as a squared distance between D ; and μj_i
- for each l ≤ i ≤ n , letting C1 denote a minimum of
Figure imgf000005_0001
- updating C to be an average of C, over all l ≤ i ≤ n ; and
- picking μ ; = D ; with assigned probability:
Figure imgf000005_0002
[0012] In some embodiments, more than one initial seed may be selected before undertaking the iterative process described above for determining remaining seeds. In some embodiments, an abundance of seeds may be selected and later culled according to processes such as described below.
BRIEF DESCRIPTION OF THE DRAWINGS
[0013] The foregoing summary, as well as the following detailed description of various embodiments of the present invention, will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the embodiments, there are shown in the drawings embodiments which are presently preferred. As should be understood, however, the embodiments of the present invention are not limited to the precise arrangements and instrumentalities shown. In the drawings:
[0014] Fig. 1 is a block diagram of an example of a computing environment within which various embodiments of the present invention may be implemented;
[0015] Fig. 2 is a conceptual diagram showing a space within which a number of data points from a data set are located;
[0016] Fig. 3 is a conceptual diagram showing a space within which a number of data points from a data set are located and a partition of the points into clusters;
[0017] Fig. 4 is a conceptual diagram showing a space within which a number of data points from a data set are located and a partition of the points into clusters; and [0018] Fig. 5 is a flow diagram showing steps performed when selecting the initial seeds for clusters in accordance with various embodiments presented herein.
DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS
Example Computing Environment
[0019] Fig. 1 is set forth herein as an exemplary computing environment in which various embodiments of the present invention may be implemented. The computing system environment is only one example of a suitable computing environment and is not intended to suggest any limitation as to the scope of use or functionality. Numerous other general purpose or special purpose computing system environments or configurations may be used. Examples of well known computing systems, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers (PCs), server computers, handheld or laptop devices, multi-processor systems, microprocessor-based systems, network PCs, minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
[0020] Computer-executable instructions such as program modules executed by a computer may be used. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules and other data may be located in both local and remote computer storage media including memory storage devices.
[0021] With reference to Fig. 1, an exemplary system for implementing aspects described herein includes a computing device, such as computing device 100. In its most basic configuration, computing device 100 typically includes at least one processing unit 102 and memory 104. Depending on the exact configuration and type of computing device, memory 104 may be volatile (such as random access memory (RAM)), non-volatile (such as read-only memory (ROM), flash memory, etc.), or some combination of the two. This most basic configuration is illustrated in Fig. 1 by dashed line 106. Computing device 100 may have additional features/functionality. For example, computing device 100 may include additional storage (removable and/or non-removable) including, but not limited to, magnetic or optical disks or tape. Such additional storage is illustrated in Fig. 6 by removable storage 108 and nonremovable storage 110. [0022] Computing device 100 typically includes or is provided with a variety of computer-readable media. Computer readable media can be any available media that can be accessed by computing device 100 and includes both volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media.
[0023] Computer storage media includes volatile and non-volatile, removable and nonremovable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Memory 104, removable storage 108, and non-removable storage 110 are all examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD- ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computing device 100. Any such computer storage media may be part of computing device 100.
[0024] Computing device 100 may also contain communications connection(s) 112 that allow the device to communicate with other devices. Each such communications connection 112 is an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, radio frequency (RF), infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
[0025] Computing device 100 may also have input device(s) 114 such as keyboard, mouse, pen, voice input device, touch input device, etc. Output device(s) 116 such as a display, speakers, printer, etc. may also be included. All these devices are generally known to the relevant public and therefore need not be discussed in any detail herein except as provided.
[0026] Notably, computing device 100 may be one of a plurality of computing devices 100 inter-connected by a network 118, as is shown in Fig. 1. As may be appreciated, the network 118 may be any appropriate network, each computing device 100 may be connected thereto by way of a connection 112 in any appropriate manner, and each computing device 100 may communicate with one or more of the other computing devices 100 in the network 118 in any appropriate manner. For example, the network 118 may be a wired or wireless network within an organization or home or the like, and may include a direct or indirect coupling to an external network such as the Internet or the like.
[0027] It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, where appropriate, with a combination of both. Thus, the methods and apparatus of the presently disclosed subject matter, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine -readable storage medium wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the presently disclosed subject matter. In the case of program code execution on programmable computers, the computing device generally includes a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. One or more programs may implement or utilize the processes described in connection with the presently disclosed subject matter, e.g., through the use of an application- program interface (API), reusable controls, or the like. Such programs may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language, and combined with hardware implementations.
[0028] Although exemplary embodiments may refer to utilizing aspects of the presently disclosed subject matter in the context of one or more stand-alone computer systems, the subject matter is not so limited, but rather may be implemented in connection with any computing environment, such as a network 118 or a distributed computing environment. Still further, aspects of the presently disclosed subject matter may be implemented in or across a plurality of processing chips or devices, and storage may similarly be effected across a plurality of devices in a network 118. Such devices might include personal computers, network servers, and handheld devices, for example.
Clustering - Overview
[0029] It is to be appreciated that clustering is performed in many applications in order to associate data into groups for further analysis, including detecting items that are similar and also items that are dissimilar. Generally, in clustering, it is assumed that there are a number of data items, whatever the items may represent, and the data items are to be organized in some fashion. One example of the use of clustering is fraud detection such as that which may be performed by a bank or other financial institution to find transactions that are improper and which should be prevented. Typically, in such fraud detection, patterns of proper transactions are identified, and transactions that fall outside of or are dissimilar to such proper transactions are identified and then scrutinized for possible identification as being fraudulent. Of course, clustering may be performed with regard to any other type of data set without departing from the spirit and scope of the present invention.
[0030] To perform clustering on a data set of data items such as the aforementioned transactions, each data item is transformed into an analyzable form that may be best understood as a single point in a multi-dimensional space. Fig. 2 is an example of a collection of points 10 and a set of two clusters 12, 14 delineated by the dotted curves. A goal of a clustering algorithm is to identify within a collection of points 10 a set of clusters 12, 14 of points. The set of clusters may partition the collection of points 10 into a set of distinct clusters 12, 14. Note that by its nature the drawing of Fig. 2 is two-dimensional. Nevertheless, it is to be understood that the objects shown in the two-dimensional drawing of Fig. 2 can readily be extended to a greater number of dimensions.
[0031] Distinct clusters 12, 14 are identified, where each cluster 12, 14 has a number of points in relatively close proximity to each other. As seen in Fig. 2, it may be that there are some points that are relatively close to each other, in which case such points and the data items represented by them are relatively similar in some sense. Likewise, it may be that there are points 10 that are relatively far from each other, in which case such points and the data items represented by them are relatively dissimilar.
[0032] Points that are relatively closer to the center of a cluster 12, 14 tend to be more closely identifiable as being part of such cluster 12, 14, while points that are relatively farther from the center of a cluster 12, 14 tend to be less identifiable as being part of such cluster 12, 14. With regard to the aforementioned fraud detection example, points that are relatively far from the center of any cluster that represents legitimate transactions may be more closely scrutinized as being at least potentially fraudulent.
[0033] Lloyd-type fc-means clustering is used to find a center of each cluster, such that each point can then be associated with a particular cluster based on distance within such space between the point and the center of the cluster. In particular, each point is associated with the cluster whose center is nearest the point. Finding distances within high-dimensional space and associating a point with a particular cluster is known or should be apparent to the relevant public and therefore need not be set forth herein in detail. Accordingly, any method for finding such distance within such space and associating such point with a particular cluster may be employed without departing from the spirit and scope of the present invention.
[0034] Briefly, classic Lloyd-type fc-means clustering involves selecting an initial collection of seeds as potential centers for each cluster. The initial selection may be based on some set of parameters, or may even be performed randomly. Various techniques for this initial selection are described below. An iterative process is then performed wherein the following steps are repeated: each point is assigned to a cluster associated with the potential cluster center nearest the point and then the centroid of all points (or some sampling of points) in each cluster is computed and a new cluster center for that cluster is located at the centroid of that cluster. The iterative process is repeated until some termination condition is met. For example, the termination condition might be that the cluster centers move less than a predetermined amount from one iteration to the next, or that some predetermined number of iterations is performed.
[0035] Fig. 3 provides an example of an initial stage of a Lloyd-type k- means clustering algorithm for two clusters. A collection of points 30 is provided. Initial seeds μi and μ2 have been chosen and are designated as initial potential centers for the two clusters 32, 34. Techniques for selected the initial seeds are discussed below. Each point in the collection of points 30 is then associated with the cluster whose center is closest to the point. Thus points nearer to μi than μ2, i.e., points above line 36, are associated with one cluster 32 and points nearer to μ2 than μ1? i.e., points below line 36, are associated with the other cluster 34. The centroid of the points in the each cluster 32 34 is computed and new cluster centers, V1, V2 respectively, are located at those centroids.
[0036] Referring now to Fig. 4, each point in the collection of points 30 is once again associated with a cluster whose center is closest to the point. Thus points nearer to V1 than V2, i.e., points above and to the right of line 38, are associated with one cluster 40 and points nearer to V2 than V1, i.e., points below and to the left of line 38, are associated with the other cluster 42. The process is repeated, determining new cluster centers and perforce new clusters, until some termination condition is met.
[0037] Such Lloyd-type ^-clustering is generally expected to work well if the initial seeds are well-chosen. However, if, for example, multiple seeds are chosen within one 'true' cluster (i.e., a cluster that ought to be found) or if multiple seeds iteratively migrate to one true cluster, the process can result in multiple cluster centers within the one true cluster and no cluster centers in another true cluster, and the points in the true clusters may end up being assigned to the clusters in some false and perhaps even bizarre fashion. [0038] Clustering in general is a venerable statistical problem of both practical and theoretical interest. There is presently a wide and unsatisfactory gap between the practical and theoretical clustering literatures. For decades, practitioners have been using heuristics of great speed but uncertain merit; the latter should not be surprising since the problem is hard in almost any formulation. However, in the last few years, algorithms researchers have made considerable innovations, and even obtained polynomial-time approximation schemes (PTAS's) for some of the most popular clustering formulations. Yet these contributions have not had a noticeable impact on practice. Practitioners instead continue to use a variety of iterative heuristics (Lloyd, EM, agglomerative methods, etc.) that have no known performance guarantees.
[0039] There are two ways to approach this disjuncture. One is to continue developing new techniques until they are so good— down to the implementations— that they displace entrenched methods. The other is to look toward popular heuristics and ask whether there are reasons that justify their extensive use, but elude the standard theoretical criteria; and in addition, whether theoretical scrutiny suggests improvements in their application. This is the approach taken in embodiments of the present invention.
[0040] As in other prominent cases such an analysis involves some abandonment of the worst-case inputs criterion. In fact, part of the challenge is to identify simple conditions on the input which allow one to prove a performance guarantee of wide applicability. Our starting point is the notion that one should be most concerned with clustering data that possesses a meaningful ^-clustering. Two examples of settings where one would intuitively not consider the data to possess a meaningful ^-clustering are as follows. If nearly optimum cost can be achieved by two very different k-way partitions of the data then the identity of the optimal partition carries little meaning. For example, if the data was generated by random sampling from a source, then the optimal cluster regions might shift drastically upon resampling. Alternatively, if a near-optimal ^-clustering can be achieved by a partition into fewer than k clusters, then that smaller value of k should be used to cluster the data. If near-optimal ^-clusterings are hard to find only when they provide ambiguous classification or marginal benefit (i.e., in the absence of a meaningful k- clustering), then such hardness should not be viewed as an acceptable obstacle to algorithm development. Instead, the performance criteria should be revised.
[0041] Specifically, we consider the following fc-means formulation of clustering: given a finite set X c Rd , find k points or centers to minimize the sum over all points X G X of the squared distance between x and the center to which it is assigned. This sum will sometimes be referred to as the cost of the clustering. [0042] In an optimal solution, each center is associated with all data in its Voronoi region, and is located at the center of mass of this data. Perhaps the most popular heuristic used for this problem is Lloyd's method, which consists of the following two phases: (a) 'Seed' the process with some initial centers (the literature contains many competing suggestions of how to do this); (b) Iterate the following Lloyd step until the clustering is 'good enough': cluster all the data in the Voronoi region of a center together, and then move the center to the centroid of its cluster.
[0043] Although Lloyd-style methods are widely used, to our knowledge, there is no known mathematical analysis that attempts to explain or predict the performance of these heuristics on high-dimensional data. If the data is well-clusterable according to a certain 'clusterability' or 'separation' condition that we discuss below, then various Lloyd-style methods do indeed perform well and return a near-optimal clustering. Our contributions are threefold:
[0044] (1) We introduce a separation condition and justify it as a reasonable abstraction of well-clusterability for the analysis of fc-means clustering algorithms. Our condition is simple, and abstracts a notion of well-clusterability alluded to earlier: letting
Figure imgf000012_0001
(X) denote the cost of an optimal fc-means solution of input X, we say that X is ε-separated for Ic- means if
Figure imgf000012_0002
≤ ε2 .
[0045] One motivation for proposing this condition is that a significant drop in the Jc- clustering cost is already used in practice as a diagnostic for choosing the value of Jc. Furthermore, we show below that: (i) The data satisfies our separation condition if and only if it satisfies the other intuitive notion of well-clusterability suggested earlier, namely that any two low-cost ^-clusterings disagree on only a small fraction of the data; and (ii) The condition is robust under noisy (even adversarial) perturbation of the data.
[0046] (2) We present a novel and efficient sampling process for seeding Lloyd's method with initial centers. The disclosed seeding procedures will allow us to prove the effectiveness of these methods.
[0047] (3) We demonstrate the effectiveness of our variants of the Lloyd heuristic under the separation condition. Specifically: (i) Our simplest variant uses only the new seeding procedure, requires a single Lloyd-type descent step, and achieves a constant-factor approximation in time linear in IXI. This algorithm has success probability exponentially small in k, but we show that (ii) a slightly more complicated seeding process based on our sampling procedure yields a constant-factor approximation guarantee with constant probability, again in linear time. Since only one run of seeding+descent is required in both algorithms, these are candidates for being faster in practice than currently used Lloyd variants, which are used with multiple re-seedings and many Lloyd steps per re-seeding, (iii) We also give a PTAS
(polynomial time approximation scheme) by combining our seeding process with a sampling procedure whose running time is linear in IXI and the dimension, and exponential in k. This PTAS is significantly faster, and also simpler.
Literature And Problem Formulation
[0048] Let X c R d be the given point set and n = X . In the Jc- means problem, the objective is to partition X into k clusters Xλ ..., Xk and assign each point in every cluster X1 to a common center c; e Rd , so as to minimize the fc-means cost, defined as ∑f_jxe% Il JC - c, II2 ,
where H-Il denotes the I 2 norm. We let
Figure imgf000013_0001
(x ) denote the optimum Ic- means cost. Observe that given the centers C1 ,..., Εk , it is easy to determine the best clustering corresponding to these centers: cluster X1 simply consists of all points x e X for which c, is the nearest center, breaking ties arbitrarily. Conversely given a clustering X1 ,..., Xk , the best centers corresponding to this clustering are obtained by setting c, to be the center of mass (centroid) of
cluster X1 , that is, setting c, = χ * ∑xeχ x ■ ^ follows that both of these properties
simultaneously hold in an optimal solution, that is, c, is the centroid of cluster X1 , and each point in X1 has c, as its nearest center.
Preliminaries
[0049] We use the following notation throughout. For a point set S , we use ctr( S ) to denote the center of mass of S . Let partition X1 u ... u Xk = X be an optimal fc-means clustering of the input X , and let C1 =ctr( X1 ) and c=ctr( X ). So A2 k (X ) = ∑f=1xeXι Il x - C1 Il 2 =
Figure imgf000013_0002
1, «=l X I, and r; 2 = Δ2 (X1 ) / Ji1 , that is, rt 2 is the 'mean squared error' in cluster X1 . Define D1 = min Il c - c Il . We assume throughout that X is ε-separated
for fc-means, that is, Δ^ (x) ≤ £2Δ2 i_1 (x ) where 0 < £ ≤ £0 with ε0 being a suitably small constant. We use the following basic lemmas quite frequently.
[0050] Lemma 2.1: For every x , Y W x - y \\ 2 = Δ2 (x )+ n \\ x - e ll 2 . Hence ye X
Σ y \\2= nA2 1 (X) . χ,y}z» x - x [0051] Lemma 2.2: Let S c Rd , and S1 u S2 be a partition of S with S1 ≠ 0.
Let s, S1 , S2 denote respectively
Figure imgf000014_0001
XcIr[S2 ). Then,
\s \\s
(/ ): A] [S) = A] [S1 ) + A] [S2 ) + - ] ^- \\ S1 - S2 II 2 , and
(«): S1 - S < Δ2 (s) 5
1*1 tø
The 2-Means Problem
[0052] We first consider the 2-means case. We assume that the input X is ε-separated for 2-means. We present an algorithm that returns a solution of cost at most (l + f[ε))A\ [X ) in linear time, for a suitably defined function / that satisfies lim/(f) = 0 . An appealing feature ε→O of our technique is its simplicity, both in description and analysis. Where we consider the k- means case below, we will build upon this technique to obtain both a linear time constant-factor (of the form 1 + f[ε)) approximation algorithm and a PTAS with running time exponential in Ic, but linear in n, d..
[0053] Our 2-means technique provides a novel non-uniform sampling process to pick two seed centers. In our sampling process we pick the pair x, y e X with probability proportional to Il x - y Il 2 . This biases the distribution towards pairs that contribute a large amount to A] [x) (noting that nA] [x)= ^H x — y II2 ). This seeding method is new to the vast
literature on the fc-means problem. By putting more weight on pairs that contribute a lot to A] [X ) , the sampling process aims to pick the initial centers from the cores of the two optimal clusters. We define the core of a cluster precisely later, but loosely speaking, it consists of points in the cluster that are significantly closer to this cluster center than to any other center. Lemmas 3.1 and 3.2 below make the benefits of this approach precise. Thus, in essence, we are able to leverage the separation condition to nearly isolate the optimal centers. Once we have the initial centers within the cores of the two optimal clusters, we show that a simple Lloyd-like step, which is also simple to analyze, yields a good performance: we consider a suitable ball around each center and move the center to the centroid of this ball to obtain the final centers.
[0054] In particular, randomly select a pair of points from the set X to serve as the initial centers, picking the pair x, y in X with probability proportional to Il x - y Il . Let C1 , C2 denote the two picked centers. For each C1 , consider the ball of radius Il C1 - C2 Il / 3 around C1 and compute the centroid C1 of the portion of X in this ball. Return q , C2 as the final centers. [0055] The entire algorithm runs in time O(nd ). Step 2 clearly takes only O( nd ) time.
We show that the sampling step can be implemented to run in O( nd ) time. Consider the following two-step sampling procedure:
(a) first pick center C1 by choosing a point x e X with probability equal to
Figure imgf000015_0001
(using Lemma 2.1);
(b) pick the second center by choosing point y e X with probability equal to
H j - C1 II 2 f (A2 1 {x) + n \\ c - c1 \\ 2 ).
[0056] The analysis hinges on the important fact that under the separation condition, the radius rt of each optimal cluster is substantially smaller than the inter-cluster separation Il C1 - C2 Il (Lemma 3.1 below). This allows us to show in Lemma 3.2 that with high probability, each initial center C1 lies in the core (suitably defined) of a distinct optimal cluster, say X1 , and hence Il C1 - C2 Il is much larger than the distances Il C1 - C1 Il for i =1,2. Assuming that C1 , C2 lie in the cores of the clusters, we show in Lemma 3.3 that the ball around C1 contains only, and most of, the mass of cluster X1 , and therefore the centroid c, of this ball is very 'close' to C1 . This in turn implies that the cost of the clustering around C1 , C2 is relatively small.
[0057] Lemma 3.1: max Ik ~ c2f = θ(ε2 )\c1 - c2\\ .
Figure imgf000015_0002
100 ε2
Let p = — (where ε is sufficiently small so that p < 1. We define the core l - ε
of cluster X1 as the set xc°r = < r x e X : . p. - C1 l 1||2 < r l2//l > . By Markov' s inequality,
Figure imgf000015_0003
[0058] Lemma 3.2: Pr [fo ,c2}n X™r ≠ 0 and Jc1 , c2 }n X2 cor ≠ ø] = 1 - θ{p).
[0059] Lemma 3.3: For each i, we have IOT c δ c X . Hence, lie - c f < -^- r; 2.
Il \ - p
[0060] Theorem 3.4: The above algorithm returns a clustering of cost at most —
\-p with probability at least 1 - θ(p) in time θ(nd), where p = θ(e 2 ). . THE Z-MEANS PROBLEM
[0061] We now consider the fc-means setting. We assume that
Figure imgf000016_0001
(X ) <e 2 A2^1 (X ).
We describe a linear time constant- factor approximation algorithm, and a PTAS that returns a (l+ &>)-optimal solution in time O\2°^k/W'nd). The techniques consist of various ingredients, which we describe separately first for ease of understanding, before gluing them together to obtain the final algorithm.
[0062] Conceptually both techniques proceed in two stages. The first stage is a seeding stage, which performs the bulk of the work and guarantees that at the end of this stage there are k seed centers positioned at nearly the right locations. By this we mean that if we consider distances at the scale of the inter-cluster separation, then at the end of this stage, each optimal center has a (distinct) initial center located in close proximity. This is precisely the leverage that we obtain from the fc-means separation condition (as in the 2-means case). We employ three simple seeding procedures, with varying time vs. quality guarantees, that exploit this separation condition to seed the k centers at locations very close to the optimal centers. We consider a natural generalization of the sampling procedure used for the 2-means case, and show that this picks the k initial centers from the cores of the optimal clusters. This sampling procedure runs in linear time but it succeeds with probability that is exponentially small in k.
[0063] We present a very simple deterministic greedy deletion procedure, where we start off with all points in X as the centers and then greedily delete points (and move centers) until there are k centers left. The running time here is Oψ d). For the fc-means problem, if A2 k (X ) <e 2 Δ2 λ_j (X ), we show that our greedy deletion procedure followed by a clean-up step (in the second stage) yields a (l+f(ε))-approximation algorithm. Finally, we combine the sampling and deletion procedures to obtain an Oψkd + &3<i)-time initialization procedure. We sample O(k) centers, which ensures that every cluster has an initial center in a slightly expanded version of the core, and then run the deletion procedure on an instance of size O(k) derived from the sampled points to obtain the k seed centers.
[0064] Once the initial centers have been positioned sufficiently close to the optimal centers, we can proceed in two ways in the second-stage. One option is to use a ball-fc-means step, as in 2-means, which yields a clustering of cost (l +
Figure imgf000016_0002
(X ) due to exactly the same reasons as in the 2-means case. Thus, combined with the initialization procedure, this yields a constant-factor approximation algorithm with running time Oψkd + k3d).
[0065] Another option, which yields a PTAS, is, for each initial center, we compute a list of candidate centers for the corresponding optimal cluster as follows: we sample a small set of points uniformly at random from a slightly expanded Voronoi region of the initial center, and consider the centroid of every subset of the sampled set of a certain size as a candidate. We exhaustively search for the k candidates (picking one candidate per initial center) that yield the least cost solution, and output these as our final centers. The fact that each optimal center C1 has an initial center in close proximity allows us to argue that the entire optimal cluster X1 is contained in the expanded Voronoi region of this initial center, and moreover that I X1 1 is a significant fraction of the total mass in this region. Given this property, a random sample from the expanded Voronoi region also (essentially) yields a random sample from X1 , which allows us to compute a good estimate of the centroid of X1 , and hence of
Figure imgf000017_0001
(X1 ).
[0066] We obtain a (l+ω)-optimal solution in time θ(2o{k/w)nd) with constant probability. Since we incur an exponential dependence on k anyway, we just use the simple sampling procedure in the first- stage to pick the k initial centers. Although the running time is exponential in k, it is significantly better than others and we obtain a simpler PTAS. Both of these features can be traced to the separation condition, which enables us to nearly isolate the positions of the optimal centers in the first stage.
Seeding Procedures Used in Stage I
[0067] Sampling. We pick k initial centers as follows: first pick two centers C1 , C2 as in the 2-means case, that is, choose x, y e X with probability proportional to Il x - y Il 2 . Suppose we have already picked i ≥ 2 centers C1 ,..., C1 . Now pick a random point x e X with probability proportional to min ^1 ^W x - c} Il 2 and set that as center C1+1.
[0068] Running time. The sampling procedure consists of k iterations, each of which takes θ(nd) time. This is because after sampling a new point C1+1 , we can update the quantity min ^1 !+1}ll JC - c Il for each point x in θ(d) time. So the overall running time is Oinkd) . [0069] Analysis. Let ε 2 « p < 1 be a parameter that we will set later. As in the 2-
means case, we define the core of cluster X1 as xc°r = < r x e X : . \.\x - C1 l 1||2 < r '2 //l > . We show that
under our separation assumption, the above sampling procedure will pick the k initial centers to lie in the cores of the clusters X1 ,..., Xk with probability (l - θ(p))k . We also show that if we sample more than k, but still O(k) points, then with constant probability, every cluster will contain a sampled point that lies in a somewhat larger core that we call the outer core of the cluster. This analysis will be useful below.
[0070] Lemma 4.1: With probability l-O(p), the first two centers C1 , C2 lie in the cores of different clusters, that is, Pr[u!≠J (Λ: e Xc°r and y e Xc°r )] = 1 - θ{p) .
Now inductively suppose that the first i centers picked C1 ,..., C1 lie in the cores of clusters X ,..., X β . We show that conditioned on this event, C1+1 lies in the core of some cluster Xt where t £ [J1 , ..., J1 ] with probability 1 - O(p). Given a set S of points , we use d(x, S) to denote min yeS Il x - y \\.
[0071] Lemma 4.2:
Pr[C1+1 G U^1 ]ι}Xrc \ 0,,...,C1 lie in the cores of X λ ,..., X } \ = l - θ{p).
[0072] Next, we analyze the case when more than k points are sampled. Let P1 = p3 .
,2
Define the outer core of X1 to be X°ut= [ x e X1 : Il x - C1 Il 2 < ^- }. Note that X^ c X1'
Let N = h- — - — -r where 0 < δ < 1 is a desired error tolerance. One can show that at
1 - 5/? (1 - 5/?)2 every sampling step, there is a constant probability that the sampled point lies in the core of some cluster whose outer core does not contain a previously sampled point. Observe that Lemma 4.3 below gives an unconditional bound. This allows us to argue, via a straightforward martingale analysis, that if we sample N points from X, then with some constant probability, each outer core X°ut will contain a sampled point. Corollary 4.4 below summarizes the results.
[0073] Lemma 4.3: Suppose that we have sampled i points { Jc1 ,..., X1 ] from X. Let X1 ,..., Xm be all the clusters whose outer cores contain some sampled point x} . Then
Pr[S1+1 G u;=m+1 x;OT]> 1 -5/?.
[0074] Lemma 4.4: Suppose we sample N points X1 xN from X using the above sampling procedure. Then Pr[V/ = l, ... , k, there exists some X1 e X °ut \ ≥ 1 - δ . [0075] Corollary 4.5:
(i) If we sample k points C1 ,..., ck , then with probability (1 - O(p))k , where p = Ω(£2/3) , each point C1 lies in a distinct core Xc°r , so Il C1 - C1 W ≤
Figure imgf000018_0001
(ii) If we sample N = points C1 ,..., cN and p = *Jε , then with
Figure imgf000019_0001
probability 1 - 0(p) , each outer core X°ut contains some C1 , so Il C1 - C1 Il < Y1 I ^p3 .
Deletion / Merger Techniques
[0076] In some embodiments, it may be desirable to initially determine some abundant number of cluster centers, i.e., some number greater than k, the number of cluster centers desired at the end of the process, and then delete (sometimes called "merge") some cluster centers until k cluster centers remain. The initial abundant collection of cluster centers may be determined by any of the methods described herein, or by a hybrid of two or more of the methods. Some of the initial cluster centers may be selected by random methods and some may be selected by deterministic methods. Some may be chose with respect to previously selected cluster centers and some may be chosen independently of previously selected cluster centers.
[0077] Once some abundant number of cluster centers has been selected, centers may be deleted or merged, for example by the techniques described below, until the desired number of cluster centers remain.
[0078] Greedy deletion procedure. We maintain a set of centers C that are used to cluster X. For purposes of the analysis below, we initialize C <— X , i.e., C are selected uniformly at random from X , and delete centers until k centers remain. It should be understood however that for purposes of the techniques disclosed herein, the initial set of centers C may be selected in a variety of manners, including by way of examples, the ways described above and by the sampling procedure described below. For any point X G Rd , let R(x) <Z X denote the points of X in the Voronoi region of X (given the set of centers C ). We call R(x) the Voronoi set of X. Repeat the following steps until I C \=k.
(i) Compute T = ΣΛ:e(A ΣyeR(x) H y - x H 2 , the cost of clustering X around C . For
every X G C , compute Tx = ∑ ^ . , ΣyeR (_) Il y - z II 2 , where R_x(z) is the Voronoi set of z
given the center set C \ { x } .
(ii) Pick the center y e C for which Tx-T is minimum and set C <— C \ { y } . (iii) Recompute the Voronoi sets R(x) = R (x) c X for each (remaining) center
x € C . Now we "move" the centers to the centroids of their respective (new) Voronoi sets, that is, for every set R(x) , we update C <— C \ {x} u {ctr(R(x))} . [0079] Running time. There are n — k iterations of the previous three steps. Each iteration takes O(n2d) time: computing T and the sets R(x) for each x takes O(n2d) time and we can then compute each Tx in 0(|i?(X)|<i) time (since while computing T , we can also
compute for each point its second- nearest center in C ). Therefore the overall running time is O(n3d) .
[0080] A linear time seeding procedure. We now combine the sampling idea with the deletion procedure to obtain a seeding procedure that runs in time O(nkd + k3d) and succeeds with high probability. We first sample O(k) points from X using the sampling procedure. Then we run the deletion procedure on an O(fc)-size instance consisting of the centroids of the Voronoi regions of the sampled points, with each centroid having a weight equal to the mass of X in its corresponding Voronoi region. The sampling process ensures that with high probability, every cluster X1 contains a point C1 that is close to its center C1 . This will allow us to argue that the
A2 k (•) cost of the sampled instance is much smaller than its
Figure imgf000020_0001
(•) cost, and that the optimal centers for the sampled instance lie near the optimal centers for X. By the analysis of the previous section, one can then show that after the deletion procedure the k centers are still close to the optimal centers for the sampled instance, and hence also close to the optimal centers for X.
Figure imgf000020_0002
[0081] Sampling - Sample N = 1 l— points from X using the sampling
1 - 5 A (1 - 5 A) procedure above. Let S denote the set of sampled points.
[0082] Deletion phase. - For each x e S , let R(x) = \y e X : |>> - xj = min _^ ^y - z|| be its Voronoi set (with respect to the set S). We now ignore X, and consider a weighted instance with point-set S = { x = ctr(R(jc)) : X G S }, where each x has a weight w( x ) = IR(JC)I. Run the deletion procedure above on S to obtain k centers cλ ,...,ck .
[0083] Running time: Sampling takes O(nNd) = O(nkd) time. The run-time analysis of the deletion phase above shows that the deletion phase takes O(N3d) = O(k3d) time. So the overall running time is O(nkd + k3d) .
[0084] Lemma 4.9: A2 k (S) = O(ε22 k_1 (S) . For each optimal center c ; , there is some
s. such that Il S - c, Il <
Figure imgf000020_0003
[0085] Lemma 4.10: For each center c ; , there is a center C1 such that .
Figure imgf000021_0001
Procedures Used in Stage II
[0086] Given k seed centers C1 ,..., ck located sufficiently close to the optimal centers after stage I, we use two procedures in stage II to obtain a near-optimal clustering: the ball-fc- means step, which yields a (l+f(ε))-approximation algorithm, or the centroid estimation step, which yields a PTAS with running time exponential in k. Define (I1 = min J≠[ H c7 - C1 W.
[0087] Ball-λ> means step. Let B ; be the points of X in a ball of radius U1 I 3 around C1 , and C1 be the centroid of B ; . Return C1 ,..., c~ k as the final centers.
[0088] Centroid estimation. For each i, we will obtain a set of candidate centers for cluster X1 . Fix β =25 / {25+256ε2}. Let R[ c X = { x € X : Ix - C1 J ≤ min ILc - cJ + rf, / 4 } be
4 the expanded Voronoi region of C1 . Sample points independently and uniformly at random
from R' , where ω is a given input parameter, to obtain a random subset S1 c R1 . Compute the centroid of every subset of S , of size 2 / O) ; let T1 be the set consisting of all these centroids. Select the candidates C1 e T1 ,..., Εk e Tk that yield the least-cost solution, and return these as the final centers.
[0089] Lemma 4.11: Let Y1 = { X G X^. W x - cJl 2 ≤ ή / p}. If II c1 - C1 II < D 1 Z lO
for each i, then F, c B c X , and Il c - c Il < -^- r; 2 .
\ - p
[0090] Lemma 4.12: Suppose Il C1 - C1 Il < D11 10 for each i. Then X1 c R1" , where R1" is as defined above, and I X1 1 > βl R' I.
A linear time constant-factor approximation technique
[0091] A linear time constant-factor approximation technique may be accomplished by : 1 - Execute the seeding procedure above to obtain k initial centers C1 ,..., ck . 2 - Run the ball-fc- means step to obtain the final centers.
[0092] The running time is O(nkd+k 3d). By Lemma 4.10, we know that with probability 1 - O(y/ε) , for each c ; , there is a distinct center C1 such that Il C1 - C1 Il < D1 /10. Combined with Lemma 4.11, this yields the following theorem. [0093] Theorem 4.13: If A2J (X) ≤ ε2 for a small enough ε, the above algorithm
returns a solution of cost at most • A2 k (X) with probability 1 - O(y/ε) in time l - 37f
O(nkd + k3d) .
A PTAS for any fixed k
[0094] The PTAS combines the sampling procedure with the centroid estimation step: 1 - Use the procedure above to pick k initial centers c1 ,..., ck . 2 - Run the centroid estimation procedure to obtain the final centers.
[0095] The running time is dominated by the exhaustive search in the centroid estimation procedure which takes time O(2Λ{(4k / {βω})}. We show that the cost of the final solution is at most (1+ω)
Figure imgf000022_0001
(X), with probability γk for some constant γ . By repeating the procedure O(γ~k ) times, we can boost this to a constant.
[0096] Theorem 4.14: Assuming that
Figure imgf000022_0002
for a small enough ε, there is a PTAS for the fc-means problem that returns a (l+ω)-optimal solution with constant probability in time O(2Λ{ O(k(\ +εΛ2) / ω)}nd.
The separation condition
[0097] We show that our separation condition implies, and is implied by, the condition that any two near-optimal ^-clusterings disagree on only a small fraction of the data.
[0098] Let cost(jc;,..., Xk) denote the cost of clustering X around the centers xi,...,xt in Rd. We use R(x) to denote the Voronoi region of point x (the centers will be clear from the context). Let S1 ® S2 = (S1 XS2) (J (S2 XS1 ) denote the symmetric difference of S1 and S2 .
[0099] Theorem 5.1: Let α2 = { l-401ε2 } / 400. Suppose that X c Rd is ε-separated for fc-means for a small enough ε. The following hold:
(i) If there are centers C1 ,..., ck such that cost( C1 ,..., ck ) ≤ a2
Figure imgf000022_0003
(X) , then for each C1 there is a distinct optimal center cσ(ι) such that IR( C1 ) Θ Xσ(ι) \ ≤ 28ε2 I Xσ(i) l;
(ii) If X is obtained by perturbing each x in X1 by a distance of then
Figure imgf000022_0004
Figure imgf000022_0005
. [0100] Notice that part (i) also yields that if X is ε-separated for fc-means, then any k- clustering of cost at most α22 A2 k (X) has small Hamming distance to the optimal ^-clustering
(more strongly, each cluster has small Hamming distance to a distinct optimal cluster). We now show the converse.
[0101] Theorem 5.2: Let ε ≤ 1 / 3. Suppose that for every ^-clustering X1 ,..., X k of X of cost at most α2 A2 k (X) ,
(i) there exists a bijection σ such that for all i, \ Xt ® Xσ(ι) I ≤ ε I Xσ{ι) I. Then, X is α- separated for fc-means;
(ii) there is a bijection σ such that
Figure imgf000023_0001
X1 Θ Xσ([) I < 2ε I {k-l } IXI. Then, X is α- separated for k- means.
Proofs
[0102] Proofs relating to the above Lemmas, Corollaries, and Theorems are believed to be ancillary to the various embodiments of the present invention as set forth in the claims, and therefore not necessary for enablement of such various embodiments of the present invention. Accordingly, such proofs have been omitted from the present disclosure. However, such proofs are available from U.S. Provisional Application No. 60/822,903, filed August 18, 2006 and entitled "SEEDING METHOD FOR fc-MEANS CLUSTERING AND OTHER CLUSTERING ALGORITHMS", hereby incorporated by reference in its entirety. Additionally, such proofs are also available from "The Effectiveness of Lloyd- Type Methods for the fc-Means Problem" as published in the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06) at pages 165-176, also hereby incorporated by reference in its entirety.
[0103] The various embodiments of the present invention are directed in particular toward data analysis and methods of clustering data points into clusters that represent groups of interest. One method of performing clustering is the so-called fc-means clustering heuristic. However, such heuristic tends to be sensitive to the way initial cluster centers or seeds are chosen, which is significant given that some methods for randomly selecting seeds can lead to results of variable quality. Accordingly, in various embodiments of the present invention, a procedure for choosing initial seeds to find centers of clusters is provided which leads to more effective clustering procedures, both in terms of the quality of results and running time.
[0104] The details of various seeding procedures for the clustering procedures and the clustering procedure itself, including justifications and method steps, are set forth in great detail above. We set forth below and in Fig. 5 detailed steps for implementing some embodiments of the teachings herein.
[0105] Preliminarily it is to be understood that the seeding procedure and clustering procedure are performed with regard to a data set of items, each of which must be in an analyzable form that may be best understood as a single point in a multi-dimensional space. Accordingly, such data set is procured, obtained, or otherwise provided, as a collection of points 500. As should be understood, such procurement and transformation may be performed in any appropriate manner without departing from the spirit and scope of the present invention.
[0106] With such collection of points it is to be appreciated that the seeding procedure for the clustering procedure receives as inputs a number n of the points from the collection, each of which is numbered D t for i =l, 2, 3, ..., n, and each of which resides in an m-dimensional space. In addition, the seeding procedure receives as an input an integer k between 2 and n-\ which represents the number of seeds that are to be produced for the clustering procedure. As should be understood, based thereon, the seeding procedure will output k points out of the n input points that will serve as seeds.
Picking the first seed
[0107] In a seeding procedure for picking an initial collection of cluster centers, various techniques may be employed for selecting the at least one initial cluster center, i.e., seed 502. In some embodiments, the first seeds may be determined by selecting one or more points at random from the collection of data points. The random selection may be made from any probability distribution assigned to the data points. In some embodiments the first seeds are selected uniformly at random from the data points.
[0108] The center of gravity C of the n points may be calculated, for example by taking
the average over the n points of each of the m coordinates, i.e., C = — ∑D . In some n I^ embodiments, this center of gravity C, sometimes called the centroid, is used as the first seed.
[0109] In some embodiments, the first seed may be chosen by the following procedure:
(i) for each of the n points, the distance C1 0 of such point D1 to C is calculated.
Such distance C1 0 may be calculated in any appropriate manner without departing from the spirit and scope of the teachings herein. For example, the distance may be calculated as the squared Euclidean distance between such point
Figure imgf000024_0001
(ii) An average squared distance is then calculated as:
∑C,° n and with such distances Cx , ... , C ° , and C , a point μ x from among the n points is selected at random as the first seed with the probability of picking any point D ;
C~ + C"0 as μ j given by: Pr[μx = D1 ] = ' .
2nC
Picking subsequent seeds
[0110] With μ\ selected, points μ2 , ..., μ k are then selected as seeds according to the following iterative process. Supposing μ1; . . . , μ}-\ have already been chosen (initially j=2 and only μ\ has been chosen), we then pick// by:
(i) For each l ≤ i ≤ n , and each 1 < t < j - 1 calculate C' , the distance between D t and μt. As before, each distance may be calculated in any appropriate manner, such as the
aforementioned squared Euclidean distance. Let C1 denote the minimum of JC1 jf=0 , i.e., the distance between point D1 and its closest previously selected seed (step 504).
(ii) Update C to be the average of C1 over all l ≤ i ≤ n , and pick μ } = D t with
C assigned probability: Pτ[μ = DJ = — = (step 506). n C
[0111] If more seeds are required, the "Yes" branch out of decision 508, steps 504 and 506 are repeated until all of the required seeds μx , ..., μ k have been selected. Thereafter, μx , ..., μk as the selected seeds are then supplied (step 510) to a clustering procedure, and the clustering procedure iteratively moves the seeds within the space until a termination condition is reached and cluster centers are defined.
[0112] In some embodiments, more than one seed may be selected by techniques such as those described in connection with "Picking the first seed" above before proceeding to the techniques of "Picking subsequent seeds."
[0113] In some embodiments, an abundant number of seeds may be initially selected. That is, some number of seeds greater than that to be used by a clustering algorithm are initially selected. The collection of seeds is then culled, such as for example by the deletion / merger techniques described above. The initial abundant collection of seeds may be chosen by any suitable method, such as those described above for example, or by a combination of methods. For example, some number of seeds may be chosen by the procedure described in connection with Fig. 5 above and then the collection of seeds may be augmented by selecting additional seeds at random. As another example, some number of seeds may be picked randomly, and then the collection of seeds augmented by the procedure described in connection with steps 504 to 508 of Fig. 5.
[0114] Generally, the seeds that are selected by the techniques disclosed herein will be communicated to a clustering algorithm for subsequent determination of a desirable clustering of given data. For example, a Lloyd-type iteration algorithm may take the selected seeds as a starting point for an iterative process that determines a clustering for the data.
CONCLUSION
[0115] The programming necessary to effectuate the processes performed in connection with the various embodiments of the present invention is straight-forward and should be apparent to the relevant programming public. Accordingly, such programming is not attached hereto. Any particular programming, then, may be employed to effectuate the various embodiments of the present invention without departing from the spirit and scope thereof.
[0116] In the present invention, systems and methods are provided for performing Lloyd-type ^-clustering that more accurately define the initial seed for each cluster prior to performing the iterative process of such Lloyd-type ^-clustering. An initial procedure is performed to at least roughly identify the location of each true cluster based on distances between at least a sampling of all the points in the data set. Once the first seed μ\ is picked, each successive seed μ2 , ..., μ k is then picked proportional to the distances of all other points to the previously picked seeds. Thus, each point in effect competes to be the next seed with probability proportional to the distance of the point to the previously picked seeds. It should be appreciated that changes could be made to the embodiments described above without departing from the inventive concepts thereof. It should be understood, therefore, that this invention is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the spirit and scope of the present invention as defined by the appended claims.

Claims

What is Claimed:
1. A method for selecting a set of seeds for use in clustering data points, the data points represented by points in a multi-dimensional space, the method comprising: selecting at least one seed; and selecting at least one additional seed by selecting a point at random from the set of data points wherein the probability of selecting a given point from the set of data points is proportional to a distance from the given point to the closest previously selected seed.
2. The method as recited in claim 1, wherein selecting at least one seed comprises: selecting a point at random from a set of data points wherein the probability of selecting a given point from the set of data points is determined by the distance of the given point from the centroid of the set of data points.
3. The method as recited in claim 2, wherein the probability of selecting any a given point from the set of data points is proportional to a constant plus the ratio of the distance between the given point and the centroid to the average over all points in the data set of the distance between each point and the centroid.
4. The method as recited in claim 1, wherein selecting at least one seed comprises: selecting the centroid of a set of at least some of the points in the set of data points as a seed.
5. The method as recited in claim 1, wherein selecting at least one seed comprises: selecting at least one point uniformly at random from the set of data points.
6. The method as recited in claim 1 wherein the distance between two points is computed as the squared Euclidean distance between the two points in the multi-dimensional space.
7. A method for selecting a set of seeds for use in clustering data points, the data points represented by points in a multi-dimensional space, the method comprising: determining a desired number of seeds; selecting a set of seeds wherein the number of seeds in the set of seeds is greater than the desired number of seeds, and wherein at least one of the seeds in the initial set of seeds is selected by selecting a point at random from the set of data points wherein the probability of selecting a given point from the set of data points is proportional to distance from the given point to the closest previously selected seed; and iteratively removing seeds from the set of seeds until the number of seeds in the set of remaining seeds is the desired number of seeds.
8. The method as recited in claim 7, wherein iteratively removing seeds comprises: initializing a current set of seeds as the selected set of seeds; and repeating the steps of: determining for each seed x in the current set of seeds a cost of clustering in relation to a clustering based on a set of seeds consisting of the current set of seeds except x; identifying a seed XQ for which its determined cost of clustering was minimum among the determined costs; removing the seed x0 from the current set of seeds to obtain an updated set of seeds; determining a clustering based on the updated set of seeds; and relocating each seed in the updated set of seeds to the centroid of the cluster in the clustering based on the updated set of seeds containing the seed; until the updated set of seeds contains the desired number of seeds.
9. A computer readable medium comprising computer executable instructions for selecting a set of seeds for use in clustering data points, the data points represented by points in a multidimensional space, the computer executable instructions comprising computer executable instructions for: selecting at least one seed; and selecting at least one additional seed by selecting a point at random from the set of data points wherein the probability of selecting a given point from the set of data points is proportional to distance from the given point to a closest previously selected seed.
10. The computer readable medium as recited in claim 9, wherein the computer executable instructions for selecting a first seed comprise computer executable instructions for: selecting a point at random from a set of data points wherein the probability of selecting a given point from the set of data points is determined by the distance of the given point from the centroid of the set of data points.
11. The computer readable medium as recited in claim 10, wherein the probability of selecting any a given point from the set of data points is proportional to a constant plus the ratio of the distance between the given point and the centroid to the average over all points in the data set of the distance between each point and the centroid.
12. The computer readable medium as recited in claim 9, wherein the computer executable instructions for selecting at least one seed comprise computer executable instructions for: selecting the centroid of a set of at least some of the points in the set of data points.
13. The computer readable medium as recited in claim 9, wherein the computer executable instructions for selecting at least one seed comprise computer executable instructions for: selecting a point uniformly at random from the set of data points.
14. A computer readable medium comprising computer executable instructions for use in selecting a set of seeds for use in clustering data points, the data points represented by points in a multi-dimensional space, the computer executable instructions comprising computer executable instructions for: determining a desired number of seeds; selecting a set of seeds wherein the number of seeds in the set of seeds is greater than the desired number of seeds, and wherein at least one of the seeds in the initial set of seeds is selected by selecting a point at random from the set of data points wherein the probability of selecting a given point from the set of data points is proportional to distance from the given point to a closest previously selected seed; and iteratively removing seeds from the set of seeds until the number of seeds in the set of remaining seeds is the desired number of seeds.
16. The computer readable medium as recited in claim 15, wherein the computer executable instructions for iteratively removing seeds comprise computer executable instructions for: initializing a current set of seeds as the selected set of seeds; and repeating the steps of: determining for each seed x in the current set of seeds a cost of clustering in relation to a clustering based on a set of seeds consisting of the current set of seeds except x; identifying a seed x0 for which its determined cost of clustering was minimum among the determined costs; removing the seed XQ from the current set of seeds to obtain an updated set of seeds; determining a clustering based on the updated set of seeds; and relocating each seed in the updated set of seeds to the centroid of the cluster in the clustering based on the updated set of seeds containing the seed; until the updated set of seeds contains the desired number of seeds.
PCT/US2007/076258 2006-08-18 2007-08-18 Seeding method for k-means clustering and other clustering algorithms WO2008022341A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US82290306P 2006-08-18 2006-08-18
US60/822,903 2006-08-18

Publications (2)

Publication Number Publication Date
WO2008022341A2 true WO2008022341A2 (en) 2008-02-21
WO2008022341A3 WO2008022341A3 (en) 2008-12-31

Family

ID=39083194

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/076258 WO2008022341A2 (en) 2006-08-18 2007-08-18 Seeding method for k-means clustering and other clustering algorithms

Country Status (1)

Country Link
WO (1) WO2008022341A2 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014143729A1 (en) * 2013-03-15 2014-09-18 Affinnova, Inc. Method and apparatus for interactive evolutionary optimization of concepts
US9111298B2 (en) 2011-03-08 2015-08-18 Affinova, Inc. System and method for concept development
US9208132B2 (en) 2011-03-08 2015-12-08 The Nielsen Company (Us), Llc System and method for concept development with content aware text editor
US9311383B1 (en) 2012-01-13 2016-04-12 The Nielsen Company (Us), Llc Optimal solution identification system and method
US9785995B2 (en) 2013-03-15 2017-10-10 The Nielsen Company (Us), Llc Method and apparatus for interactive evolutionary algorithms with respondent directed breeding
WO2017181660A1 (en) * 2016-04-21 2017-10-26 华为技术有限公司 K-means algorithm-based data clustering method and device
JP2018195089A (en) * 2017-05-17 2018-12-06 富士通株式会社 Clustering method, clustering program, and information processor
US10354263B2 (en) 2011-04-07 2019-07-16 The Nielsen Company (Us), Llc Methods and apparatus to model consumer choice sourcing
CN111125469A (en) * 2019-12-09 2020-05-08 重庆邮电大学 User clustering method and device for social network and computer equipment
US11657417B2 (en) 2015-04-02 2023-05-23 Nielsen Consumer Llc Methods and apparatus to identify affinity between segment attributes and product characteristics
CN116539643A (en) * 2023-03-16 2023-08-04 南京京烁雷达科技有限公司 Method and system for identifying coal rock data by using radar
US20230316099A1 (en) * 2022-03-29 2023-10-05 Microsoft Technology Licensing, Llc System and method for identifying and resolving performance issues of automated components

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040022442A1 (en) * 2002-07-19 2004-02-05 Samsung Electronics Co., Ltd. Method and system for face detection using pattern classifier
US6944338B2 (en) * 2000-05-11 2005-09-13 Becton Dickinson And Company System for identifying clusters in scatter plots using smoothed polygons with optimal boundaries
US6973212B2 (en) * 2000-09-01 2005-12-06 Siemens Corporate Research, Inc. Graph cuts for binary segmentation of n-dimensional images from object and background seeds

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6944338B2 (en) * 2000-05-11 2005-09-13 Becton Dickinson And Company System for identifying clusters in scatter plots using smoothed polygons with optimal boundaries
US6973212B2 (en) * 2000-09-01 2005-12-06 Siemens Corporate Research, Inc. Graph cuts for binary segmentation of n-dimensional images from object and background seeds
US20040022442A1 (en) * 2002-07-19 2004-02-05 Samsung Electronics Co., Ltd. Method and system for face detection using pattern classifier

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KANUNGO ET AL.: 'An efficient k-Means Clustering Algorithm: Analysis and Implementation' IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE vol. 24, no. 7, July 2002, pages 1 - 12, 881 - 892 *
KANUNGO ET AL.: 'The Analysis of a Simple k-Means Clustering Algorithm' PROC. 16TH ACM SYMP. ON COMPUTATIONAL GEOMETRY 2000, pages 101 - 109 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9218614B2 (en) 2011-03-08 2015-12-22 The Nielsen Company (Us), Llc System and method for concept development
US9262776B2 (en) 2011-03-08 2016-02-16 The Nielsen Company (Us), Llc System and method for concept development
US9111298B2 (en) 2011-03-08 2015-08-18 Affinova, Inc. System and method for concept development
US9208515B2 (en) 2011-03-08 2015-12-08 Affinnova, Inc. System and method for concept development
US9208132B2 (en) 2011-03-08 2015-12-08 The Nielsen Company (Us), Llc System and method for concept development with content aware text editor
US10354263B2 (en) 2011-04-07 2019-07-16 The Nielsen Company (Us), Llc Methods and apparatus to model consumer choice sourcing
US11037179B2 (en) 2011-04-07 2021-06-15 Nielsen Consumer Llc Methods and apparatus to model consumer choice sourcing
US11842358B2 (en) 2011-04-07 2023-12-12 Nielsen Consumer Llc Methods and apparatus to model consumer choice sourcing
US9311383B1 (en) 2012-01-13 2016-04-12 The Nielsen Company (Us), Llc Optimal solution identification system and method
US11574354B2 (en) 2013-03-15 2023-02-07 Nielsen Consumer Llc Methods and apparatus for interactive evolutionary algorithms with respondent directed breeding
US11195223B2 (en) 2013-03-15 2021-12-07 Nielsen Consumer Llc Methods and apparatus for interactive evolutionary algorithms with respondent directed breeding
US9785995B2 (en) 2013-03-15 2017-10-10 The Nielsen Company (Us), Llc Method and apparatus for interactive evolutionary algorithms with respondent directed breeding
WO2014143729A1 (en) * 2013-03-15 2014-09-18 Affinnova, Inc. Method and apparatus for interactive evolutionary optimization of concepts
US20140344013A1 (en) * 2013-03-15 2014-11-20 Affinnova, Inc. Method and apparatus for interactive evolutionary optimization of concepts
US10839445B2 (en) 2013-03-15 2020-11-17 The Nielsen Company (Us), Llc Method and apparatus for interactive evolutionary algorithms with respondent directed breeding
US9799041B2 (en) 2013-03-15 2017-10-24 The Nielsen Company (Us), Llc Method and apparatus for interactive evolutionary optimization of concepts
US11657417B2 (en) 2015-04-02 2023-05-23 Nielsen Consumer Llc Methods and apparatus to identify affinity between segment attributes and product characteristics
WO2017181660A1 (en) * 2016-04-21 2017-10-26 华为技术有限公司 K-means algorithm-based data clustering method and device
JP2018195089A (en) * 2017-05-17 2018-12-06 富士通株式会社 Clustering method, clustering program, and information processor
CN111125469B (en) * 2019-12-09 2022-06-10 重庆邮电大学 User clustering method and device of social network and computer equipment
CN111125469A (en) * 2019-12-09 2020-05-08 重庆邮电大学 User clustering method and device for social network and computer equipment
US20230316099A1 (en) * 2022-03-29 2023-10-05 Microsoft Technology Licensing, Llc System and method for identifying and resolving performance issues of automated components
CN116539643A (en) * 2023-03-16 2023-08-04 南京京烁雷达科技有限公司 Method and system for identifying coal rock data by using radar
CN116539643B (en) * 2023-03-16 2023-11-17 南京京烁雷达科技有限公司 Method and system for identifying coal rock data by using radar

Also Published As

Publication number Publication date
WO2008022341A3 (en) 2008-12-31

Similar Documents

Publication Publication Date Title
WO2008022341A2 (en) Seeding method for k-means clustering and other clustering algorithms
Krissinel et al. Common subgraph isomorphism detection by backtracking search
Harwood et al. Fanng: Fast approximate nearest neighbour graphs
Holzmann et al. A minimized automaton representation of reachable states
Liu et al. MESA: boost ensemble imbalanced learning with meta-sampler
US10157239B2 (en) Finding common neighbors between two nodes in a graph
Wang et al. Association rule mining based parameter adaptive strategy for differential evolution algorithms
Chatterjee et al. Conditionally Optimal Algorithms for Generalized B\" uchi Games
Saha et al. Simultaneous feature selection and symmetry based clustering using multiobjective framework
US9361403B2 (en) Efficiently counting triangles in a graph
Bhattacharyya et al. Efficiently summarising event sequences with rich interleaving patterns
Siyari et al. Lexis: An optimization framework for discovering the hierarchical structure of sequential data
Matsumoto et al. Accelerating exact similarity search on cpu-gpu systems
Jain et al. Poset-rl: Phase ordering for optimizing size and execution time using reinforcement learning
US20220050873A1 (en) Systems and methods for optimized quantum searching using a binomial version of grover&#39;s search algorithm
Tseng et al. Parallel batch-dynamic minimum spanning forest and the efficiency of dynamic agglomerative graph clustering
Barnat et al. Revisiting resistance speeds up I/O-efficient LTL model checking
Zhang et al. Time series classification by shapelet dictionary learning with svm-based ensemble classifier
CN113868650B (en) Vulnerability detection method and device based on code heterogeneous middle graph representation
US20060023947A1 (en) Use of hausdorff distances in the earth mover linear program
Minch et al. GPU-embedding of kNN-graph representing large and high-dimensional data
Yarmolik et al. Controlled random tests
US20210097179A1 (en) Creating generic rules in a high dimensional sparse feature space using negative feedback
Zhao et al. Constrained Approximate Similarity Search on Proximity Graph
Stirling et al. F3m: Fast focused function merging

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07814241

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 07814241

Country of ref document: EP

Kind code of ref document: A2