Index to site:
Cluster analysis in Chameleon Statistics
Cluster analysis is a commonly used technique (or set of techniques) for identifying structure in data when such structure is unknown a priori.
More specifically, cluster analysis is the classification of sets of multivariate data into groups or ‘clusters’ of similar samples. Most standard clustering methods fall into one of two categories, namely (i) partition methods, and (ii) hierarchical methods.
In partition clustering, every data sample is initially assigned to a cluster in some (possibly random) way. Samples are then iteratively transferred from cluster to cluster until some criterion function is minimized. Once the process is complete, the samples will have been partitioned into separate compact clusters. Examples of partition clustering methods are k-means and Lloyd's method.
In hierarchical clustering, each sample is initially considered a member of its own cluster, after which clusters are recursively combined in pairs according to some predetermined condition until eventually every point belongs to a single cluster. The resulting hierarchical structure may be represented by a binary tree or "dendrogram", from which the desired clusters may be extracted. Examples of hierarchical clustering methods are the single-link, Ward's, centroid, complete-link, group average, median, and parametric Lance Williams methods.
The best way to gain some intuition into the process of clustering is through some simple examples:
Human beings have an inherent ability to identify clusters in two dimensions. We can clearly see that there are two clusters in this image. Our aim is to teach computers how to emulate this intuition, and then to generalize such capabilities to any number of dimensions.
While this may seem quite simple in principle, it is often very difficult in practice.
Simple clustering methods typically assume equally-sized compact spheroidal clusters. Problems can occur when actual clusters are (a) not of equal size, (b) have complex shapes, or (c) have complex topologies. Agreement with human intuition in such cases generally requires use of more density-based techniques. Of the standard methods, single-link hierarchical clustering is often most suitable.
Other standard methods such as k-means or Ward's method are also valuable. In many common problems, such as categorizing customer types, or partitioning land into convenient "postal regions", or defining "codebooks" for compressing data, compact, equally-sized and evenly-spaced clusters are desired.
Realistic data sets sometimes contain clusters within clusters. Partition methods, as well as some hierarchical methods, are unable to detect subclusters. Density-based hierarchical clustering methods such as the single-link method and its relatives are well-suited to correctly identifying subclusters.
Sometimes care must be taken in selecting the best clustering method(s) for a particular dataset, and some experimentation may be required. Applying inappropriate methods to non-trivial data sets such as those described above can sometimes give undesirable or unexpected results:
(a) (b) (c)
Seventh Sense Software actively carries out research in-house into the development of state-of-the-art clustering algorithms and methods. For example we have developed the world's fastest single-link clustering algorithm for Euclidean datasets. Descriptions of some of our results and discoveries are available on the Algorithms page.
Payment can be made by American Express, MasterCard, VISA, money order, or checks drawn on US banks. Institutional purchase orders are also accepted within the USA. Call, FAX, or e-mail to place your order today! Orders will be sent out promptly. Shipping costs.Return policy. Try our convenient Currency converter.