# Sample / Probe Clustering

### Short Summary

In the heatmap, samples and probes can be clustered according to their similarity to one another.
Comparing two probes, if their sample-to-sample variation is similar, the probes are considered to be "close".

Comparing two samples, if the pattern of highly expressed and lightly expressed probes is similar, the samples are considered to be "close".

Clusters of "similar" probes are progressively assembled by first putting every probe into its own cluster, and then repeatedly linking the two clusters at each stage which are most similar to each other. When the last two clusters are linked, the pattern of links forms a binary tree.

The branches of the tree are an indication of how closely linked the two nodes are; short branches are tight connections, as shown in the following comparison.
 (a) Samples are not close to one another (b) Samples are closely related to one another

### Details

The ideas of similarity between probes or samples are captured mathematically as a "distance matrix", which measures the distance of each probe to each other probe (an np x np matrix) and each sample to each other sample (an ns x  ns matrix).
The distance between probes can be measured in a variety of ways.
• The simplest distance metric is the root mean square difference between the data in the different samples.   If the probe P has values P1, P2, P3, P4... in the different samples, and the probe Q has values Q1, Q2, Q4, Q4, ... in the different samples, the RMS distance between probes P and Q is √ (P1-Q1)2+(P2-Q2)2+...   The distances in the RMS metric start at 0 and are in the units of the data.
• The Pearson distance is based on the Pearson correlation coefficient between probes P and Q, and is defined as 1-√ R2 so that probes that are well correlated have distance 0, and the maximum distance is 1.
• The choice of metric depends on what data is being analyzed.  If the data in the heatmap is expression level, Pearson may be the most suitable choice.  The RMS metric tends to treat all highly expressed probes as similar to one another, and all weakly expressed probes as similar to one another.   If the data in the heatmap is fold-change, then either metric is suitable, since all data are relative.   The default heat map is fold-change and the default metric is RMS.

To make this concrete, the following compares a sample with a diluted sample of the same material.  The expression of the probes is very similar, so one would normally consider these samples to be "similar".  The RMS metric applied to linear data would treat them as very different, however, because each value on the left is about 10X higher than that on the right.   A Pearson metric would treat them as being similar.   Using fold change, either metric would treat them as similar.

 (a) No dilution (b) 8-fold dilution

Once the distance between samples and probes has been defined, clusters are identified by progressively merging the two clusters which are closest to one another.    Initially all the probes are in their own clusters, and the distance between clusters is the same as the distance between probes.  As clusters start to include more points, there are a number of ways to define the "distance between clusters".   In the diagram below, the different probes are represented as points in a two-dimensional space.  Two clusters have been identified so far in the figure.   The distance between those two clusters could be defined as the shortest distance between them (the "single" method), the longest distance between any point in one cluster to any point in the other cluster (the "complete" method), by the average distance of each point in cluster1 to each point in cluster 2 (the "average" method), the distance from the centroid of cluster 1 to the centroid of cluster 2 ("centroid" method).   Other standard methods are the "median", "weighted" and "Ward" methods (see reference).

In general, if well-defined clusters exist, any of the cluster-joining methods will give similar results.   The display may appear to update considerably with any change in algorithm, however, because in the tree presentation, one cluster must be arbitrarily assigned to the "left" side and one to the "right" side of each branch.   In the absence of a strong user preference for one method over another, the default "average" method is recommended.