|
|
|
Index to site:
|
Cluster analysis in Chameleon StatisticsCluster analysis is a commonly used technique (or set of techniques) for identifying structure in data when such structure is unknown a priori. More specifically, cluster analysis is the classification of sets of multivariate data into groups or ‘clusters’ of similar samples. Most standard clustering methods fall into one of two categories, namely (i) partition methods, and (ii) hierarchical methods. In partition clustering, every data sample is initially assigned to a cluster in some (possibly random) way. Samples are then iteratively transferred from cluster to cluster until some criterion function is minimized. Once the process is complete, the samples will have been partitioned into separate compact clusters. Examples of partition clustering methods are k-means and Lloyd's method.
The best way to gain some intuition into the process of clustering is through some simple examples:
While this may seem quite simple in principle, it is often very difficult in practice. Simple clustering methods typically assume equally-sized compact spheroidal clusters. Problems can occur when actual clusters are (a) not of equal size, (b) have complex shapes, or (c) have complex topologies. Agreement with human intuition in such cases generally requires use of more density-based techniques. Of the standard methods, single-link hierarchical clustering is often most suitable.
Other standard methods such as
k-means or Ward's method are also valuable. In many common problems, such as categorizing
customer types, or partitioning land into convenient "postal regions",
or defining "codebooks" for compressing data, compact, equally-sized
and evenly-spaced clusters are desired. Realistic data sets sometimes contain clusters within clusters. Partition methods, as well as some hierarchical methods, are unable to detect subclusters. Density-based hierarchical clustering methods such as the single-link method and its relatives are well-suited to correctly identifying subclusters. Sometimes care must be taken in selecting the best clustering method(s) for a particular dataset, and some experimentation may be required. Applying inappropriate methods to non-trivial data sets such as those described above can sometimes give undesirable or unexpected results: (a) Seventh Sense Software actively carries out research in-house into the development of state-of-the-art clustering algorithms and methods. For example we have developed the world's fastest single-link clustering algorithm for Euclidean datasets. Descriptions of some of our results and discoveries are available on the Algorithms page.
|
Ordering information:
Payment can be made by American Express, MasterCard, VISA, money order, or checks drawn on US banks. Institutional purchase orders are also accepted within the USA. Call, FAX, or e-mail to place your order today! Orders will be sent out promptly. Shipping costs. Return policy. Try our convenient Currency converter.
e-mail: sales@exetersoftware.com |