**Density estimation in Chameleon Statistics****
**

**
**Data ‘samples’ are by definition representatives of some unknown
‘population’ (e.g. surveys). The goal of density estimation is to model the
distribution of data in the population based upon the distribution of the
samples.

Large sample sets will more accurately reflect the population than small
ones. Accurate estimation of population distributions from samples of limited
size is an important but challenging problem.

Consider the sample data set on the left. We would like to estimate the
population distribution from which these samples are derived. Most
density-estimators are based upon one or more of the following techniques:

(a) Spatial binning methods.

(b) Nearest-neighbor methods.

(c) (Gaussian) mixture-models.

(d) Kernel-based methods.

**Spatial
binning methods** simply partition the space into regular blocks
and count the number of samples in each block. The population density estimate
within each block is given by the number of samples per unit volume for the
block.

Properties: Only discrete density estimates are obtained, but they are normalized.
There is an unpleasant trade off between bin size and quality of density
estimate, leaving the problem of how to find the optimal bin size. Very poor
quality is obtained with small samples or when the number of variables is large.
It is only really suitable for purely categorical data which is already
naturally discrete.

In
**nearest-****nearest-neighbor
methods, the population density estimate for a test point is obtained
by measuring the volume V of the ball containing its k nearest points. The
associated density estimate is given by the ratio k/V.**

**
****Properties**:
The overall shapes of population distributions are generally modeled
well. The density estimate tends to be good inside the clusters where sample
points are plentiful, but overestimates in the tails as a result of which the
overall distribution is non-integrable. Density estimates reveal sudden sharp
fluctuations. Like kernel methods, small values of k tend to overfit the sample,
while large values oversmooth, leaving the problem of selecting the optimal
value for the parameter.

**
(Gaussian) mixture models** attempt to
find the superposition of Gaussians which best accounts for the sample data.

**Properties**:
Continuous and robust density estimates are obtained with good asymptotic
properties. The method can in principle model any shape of cluster, and works
best when the population is described well by a mixture of Gaussians. The method
typically requires large sample sizes for accuracy. Serious degradation of
results can occur as the number of variables increases. In practice the method
also has difficulty modeling complex geometries and topologies.

In
**kernel-based methods**, each
point is spread out over a region determined by the “kernel” function
(usually flat or bell-shaped).

**Properties**:
Continuous normalized density estimates are obtained. The estimates have
good asymptotic properties. The estimation quality depends on wise selection of
the local spread? Too small a spread generates estimates which undulate greatly,
and too large a spread leads to over smooth estimates leading to loss of shape.

Seventh Sense Software actively researches into advanced density estimation
techniques, and some of our discoveries are summarized on the Algorithms
page.