Sample / Probe Clustering


Short Summary

In the heatmap, samples and probes can be clustered according to their similarity to one another.
Comparing two probes, if their sample-to-sample variation is similar, the probes are considered to be "close".  

Comparing two samples, if the pattern of highly expressed and lightly expressed probes is similar, the samples are considered to be "close".

Clusters of "similar" probes are progressively assembled by first putting every probe into its own cluster, and then repeatedly linking the two clusters at each stage which are most similar to each other. When the last two clusters are linked, the pattern of links forms a binary tree.

The branches of the tree are an indication of how closely linked the two nodes are; short branches are tight connections, as shown in the following comparison.
distance between tree branches tight branches
(a) Samples are not close to one another
(b) Samples are closely related to one another

Details

The ideas of similarity between probes or samples are captured mathematically as a "distance matrix", which measures the distance of each probe to each other probe (an np x np matrix) and each sample to each other sample (an ns x  ns matrix).   
The distance between probes can be measured in a variety of ways.  

To make this concrete, the following compares a sample with a diluted sample of the same material.  The expression of the probes is very similar, so one would normally consider these samples to be "similar".  The RMS metric applied to linear data would treat them as very different, however, because each value on the left is about 10X higher than that on the right.   A Pearson metric would treat them as being similar.   Using fold change, either metric would treat them as similar.

no dilution
diluted
(a) No dilution
(b) 8-fold dilution


Once the distance between samples and probes has been defined, clusters are identified by progressively merging the two clusters which are closest to one another.    Initially all the probes are in their own clusters, and the distance between clusters is the same as the distance between probes.  As clusters start to include more points, there are a number of ways to define the "distance between clusters".   In the diagram below, the different probes are represented as points in a two-dimensional space.  Two clusters have been identified so far in the figure.   The distance between those two clusters could be defined as the shortest distance between them (the "single" method), the longest distance between any point in one cluster to any point in the other cluster (the "complete" method), by the average distance of each point in cluster1 to each point in cluster 2 (the "average" method), the distance from the centroid of cluster 1 to the centroid of cluster 2 ("centroid" method).   Other standard methods are the "median", "weighted" and "Ward" methods (see reference). 

In general, if well-defined clusters exist, any of the cluster-joining methods will give similar results.   The display may appear to update considerably with any change in algorithm, however, because in the tree presentation, one cluster must be arbitrarily assigned to the "left" side and one to the "right" side of each branch.   In the absence of a strong user preference for one method over another, the default "average" method is recommended.
cluster distance




See Also

Hierarchical clustering (Wikipedia)

Daniel Muller, "Modern hierarchical, agglomerative clustering algorithms" (2011)

Help Front Page