Exeter Software

Index to site:

Search our site

Order form

Density estimation in Chameleon Statistics

Data ‘samples’ are by definition representatives of some unknown ‘population’ (e.g. surveys). The goal of density estimation is to model the distribution of data in the population based upon the distribution of the samples.

Large sample sets will more accurately reflect the population than small ones. Accurate estimation of population distributions from samples of limited size is an important but challenging problem.

Consider the sample data set on the left. We would like to estimate the population distribution from which these samples are derived. Most density-estimators are based upon one or more of the following techniques:

(a) Spatial binning methods.
(b) Nearest-neighbor methods.
(c) (Gaussian) mixture-models.
(d) Kernel-based methods.

Spatial binning methods simply partition the space into regular blocks and count the number of samples in each block. The population density estimate within each block is given by the number of samples per unit volume for the block.
Properties: Only discrete density estimates are obtained, but they are normalized. There is an unpleasant trade off between bin size and quality of density estimate, leaving the problem of how to find the optimal bin size. Very poor quality is obtained with small samples or when the number of variables is large. It is only really suitable for purely categorical data which is already naturally discrete.

In nearest-nearest-neighbor methods, the population density estimate for a test point is obtained by measuring the volume V of the ball containing its k nearest points. The associated density estimate is given by the ratio k/V.

Properties: The overall shapes of population distributions are generally modeled well. The density estimate tends to be good inside the clusters where sample points are plentiful, but overestimates in the tails as a result of which the overall distribution is non-integrable. Density estimates reveal sudden sharp fluctuations. Like kernel methods, small values of k tend to overfit the sample, while large values oversmooth, leaving the problem of selecting the optimal value for the parameter.

(Gaussian) mixture models attempt to find the superposition of Gaussians which best accounts for the sample data.

Properties: Continuous and robust density estimates are obtained with good asymptotic properties. The method can in principle model any shape of cluster, and works best when the population is described well by a mixture of Gaussians. The method typically requires large sample sizes for accuracy. Serious degradation of results can occur as the number of variables increases. In practice the method also has difficulty modeling complex geometries and topologies.

In kernel-based methods, each point is spread out over a region determined by the “kernel” function (usually flat or bell-shaped).

Properties: Continuous normalized density estimates are obtained. The estimates have good asymptotic properties. The estimation quality depends on wise selection of the local spread? Too small a spread generates estimates which undulate greatly, and too large a spread leads to over smooth estimates leading to loss of shape.

Seventh Sense Software actively researches into advanced density estimation techniques, and some of our discoveries are summarized on the Algorithms page.

 Ordering information:         Orders may be placed by mail, e-mail, FAX, phone, or by our secure order form.    Payment can be made by American Express, MasterCard, VISA, money order, or checks drawn on US banks. Institutional purchase orders are also accepted within the USA. Call, FAX, or e-mail to place your order today! Orders will be sent out promptly. Shipping costs. Return policy. Try our convenient Currency converter. e-mail: sales@exeter-software-ltd.com Phone: 1-631-689-7838, Toll-free within USA: 1-888-695-0285 FAX: 1-631-689-0103 or 1-713-422-2361 47 Route 25A, Suite 2, Setauket, NY 11733-2870 USA Request additional information.