Getting Genetics Done: Hierarchical Clustering in R

Tuesday, June 16, 2009

Hierarchical Clustering in R

Hierarchical clustering is a technique for grouping samples/data points into categories and subcategories based on a similarity measure. Being the powerful statistical package it is, R has several routines for doing hierarchical clustering. The basic command for doing HC is

hclust(d, method = "complete", members=NULL)

Nearly all clustering approaches use a concept of distance. Data points can be conceptualized in a dimensional space where we can compute a distance measure between each pair of points. Points that are closer together get placed in groups or "cluster" together. The input for the hcluster method in R is a distance matrix, so we'll have to compute one of those somehow.

For this example, I'm using an Identity By State matrix, computed with SNPDUO software. This matrix essentially gives us the percentage of alleles two people share. It doesn't output a matrix per se, so I had to transform the output with a perl script. I get something like this out of SNPDUO:

Family1,Indiv1,Family2,Indiv2,MeanIBS
Family1,Indiv1,Family2,Indiv3,MeanIBS
Family1,Indiv1,Family2,Indiv4,MeanIBS
Family1,Indiv2,Family2,Indiv3,MeanIBS
...

For each pair of individuals in my ped file. I need it like this:

Indiv1 Indiv2 Indiv3 Indiv4 ....
Indiv1
Indiv2
Indiv3
Indiv4
....

Its a fairly simple scripting job... Now that I have a MATRIX, I load that into R as a dataset.

ibs<-read.delim("C:/Users/bushws1/Desktop/matrix.txt",Header=FALSE)

Next, we'll run hclust using this distance matrix

HC <- hclust(as.dist(ibs))

The as.dist(ibs) transforms the dataframe called "ibs" into a distance matrix in the R environment. Using this command, we are asking R to generate hierarchical groups of individuals from our data based on genetic similarity. We now have a hierarchical clustering object called "HC". We can plot the results of our cluster analysis using this command:

plot(HC)

If your dataset is small, this might work well for you, but for most genomics applications, you'll get a tree-shaped fuzzball like this:

The solution to this is to load a library from the Bioconductor package, called "ctc". This will let us export the cluster object in the Newick file format. It can then be imported into other more powerful graphing programs. This is done like so:

library(ctc)
write.table(hc2Newick(HC), file="C:/Users/bushws1/Desktop/ibs_new.txt",row.names=FALSE,col.names=FALSE)

You now have a file in Newick format, but R puts quotes around the output for some annoying reason. Open the file in notepad and remove the quotes and it should be ready to use.

To get a better, more readable plot, download "Dendroscope" from the University of Tubingen (nice place, by the way). Dendroscope will let you import the Newick file you created and gives you extensive plotting options. Check out this wicked Circular Cladogram...

In this example, there isn't an easy, intuitive explanation for the groupings, but for other examples, the groups might make more sense. There are lots of options for computing the clustering, and they may give very different results, so proceed with caution, but in general hierarchical clustering can be a useful tool for lots of data analysis situations.

13 comments:

EliJune 16, 2009 at 7:29 PM
If there are any changes that I can make to SNPduo to make getting this output from the SNPduo R code easier let me know.

Oh, and the annoying quotes are defaults in the "write.table" that really tick me off. If you do "write.table(..., quote=FALSE)" you won't get the quoted values.
ReplyDelete
Replies
AnonymousSeptember 16, 2009 at 12:05 PM
The ape package from cran is capable of plotting circular cladograms. It can convert hclust objects to it's own phylo object, or read a newick string.
ReplyDelete
Replies
AnonymousFebruary 22, 2010 at 1:26 PM
Add option quote=FALSE in write.table()
ReplyDelete
Replies
AnonymousFebruary 25, 2010 at 10:37 AM
run hclust with ward's method (method = "ward"), it wont be such a messy dendrogram
ReplyDelete
Replies
AnonymousSeptember 16, 2010 at 12:16 PM
Just use APE.

library(ape)
data(opsin.newick)
x <- read.tree(text=opsin.newick)
plot(x,type="fan")
ReplyDelete
Replies
ulverNovember 26, 2010 at 7:42 AM
hi, how could i get the elements from each cluster, for example if i want to get the centroids, thanks in advance
ReplyDelete
Replies
WillNovember 29, 2010 at 10:48 AM
Hi ulver -- I'm a bit confused with what you mean by "centroid". The easiest way to retrieve elements for a sub-tree is to use Dendroscope to select the sub-tree and it will list the elements for you.
ReplyDelete
Replies
joschDecember 5, 2010 at 1:59 AM
Thanks so much for the post! I'm a corpus linguist trying to visualize patterns in language usages and found the described method extremely helpful.
ReplyDelete
Replies
AnonymousApril 6, 2011 at 10:02 AM
If you have a Newick file you can use the interactive tree of life website to edit the tree and add colors and show additional datasets as concentric rings outside the tree
ReplyDelete
Replies
ZaneMay 2, 2011 at 3:35 AM
Hierarchical clustering? What is it? What is it used for? I need help to know more information about it. Thanks
ReplyDelete
Replies
TomAugust 5, 2011 at 9:08 PM
What does the meaning of hierarchical clustering? Can you give me more explanation about it? I really don’t understand about it. Does it have an effect to physiological condition or another? You can explain it through simple examples and ways. Thank for sharing.
ReplyDelete
Replies
zx8754August 17, 2011 at 4:28 AM
This comment has been removed by the author.
ReplyDelete
Replies
UnknownDecember 26, 2013 at 12:29 PM
Hello! I'm a bit confused. I'm assuming your "ibs" variable is equivalent to the "PI_HAT" variable that PLINK can compute. If that's so, shouldn't it be HC <- hclust(1 - as.dist(ibs)) instead of HC <- hclust(1 - as.dist(ibs))? Otherwise we would be feeding a similarity matrix to hclust when we want to feed it a dissimilarity one. Maybe I missed something?
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Tuesday, June 16, 2009

Hierarchical Clustering in R

13 comments: