![]() |
![]() |
(a) Samples are not close to one another |
(b) Samples are closely related to one another
|
To make this concrete, the following compares a sample with a diluted sample of the same material. The expression of the probes is very similar, so one would normally consider these samples to be "similar". The RMS metric applied to linear data would treat them as very different, however, because each value on the left is about 10X higher than that on the right. A Pearson metric would treat them as being similar. Using fold change, either metric would treat them as similar.
![]() |
![]() |
(a) No dilution |
(b) 8-fold dilution |
Once the distance between samples and probes has been defined, clusters are identified by progressively
merging the two clusters which are closest to one another. Initially all the probes are in their own clusters, and the distance between clusters is the same as the distance between probes. As clusters start to include more points, there are a number of ways to define the "distance between clusters". In the diagram below, the different probes are represented as points in a two-dimensional space. Two clusters have been identified so far in the figure. The distance between those two clusters could be defined as the shortest distance between them (the "single" method), the longest distance between any point in one cluster to any point in the other cluster (the "complete" method), by the average distance of each point in cluster1 to each point in cluster 2 (the "average" method), the distance from the centroid of cluster 1 to the centroid of cluster 2 ("centroid" method). Other standard methods are the "median", "weighted" and "Ward" methods (see reference).
In general, if well-defined clusters exist, any of the cluster-joining methods will give similar results. The display may appear to update considerably with any change in algorithm, however, because in the tree presentation, one cluster must be arbitrarily assigned to the "left" side and one to the "right" side of each branch. In the absence of a strong user preference for one method over another, the default "average" method is recommended.
Hierarchical clustering (Wikipedia)
Daniel Muller,
"Modern hierarchical, agglomerative clustering algorithms"
(2011)