Tuesday, June 16, 2009

Hierarchical Clustering in R

Hierarchical clustering is a technique for grouping samples/data points into categories and subcategories based on a similarity measure. Being the powerful statistical package it is, R has several routines for doing hierarchical clustering. The basic command for doing HC is

hclust(d, method = "complete", members=NULL)


Nearly all clustering approaches use a concept of distance. Data points can be conceptualized in a dimensional space where we can compute a distance measure between each pair of points. Points that are closer together get placed in groups or "cluster" together. The input for the hcluster method in R is a distance matrix, so we'll have to compute one of those somehow.

For this example, I'm using an Identity By State matrix, computed with SNPDUO software. This matrix essentially gives us the percentage of alleles two people share. It doesn't output a matrix per se, so I had to transform the output with a perl script. I get something like this out of SNPDUO:

Family1,Indiv1,Family2,Indiv2,MeanIBS
Family1,Indiv1,Family2,Indiv3,MeanIBS
Family1,Indiv1,Family2,Indiv4,MeanIBS
Family1,Indiv2,Family2,Indiv3,MeanIBS
...

For each pair of individuals in my ped file. I need it like this:

Indiv1 Indiv2 Indiv3 Indiv4 ....
Indiv1
Indiv2
Indiv3
Indiv4
....

Its a fairly simple scripting job... Now that I have a MATRIX, I load that into R as a dataset.

ibs<-read.delim("C:/Users/bushws1/Desktop/matrix.txt",Header=FALSE)


Next, we'll run hclust using this distance matrix
HC <- hclust(as.dist(ibs))


The as.dist(ibs) transforms the dataframe called "ibs" into a distance matrix in the R environment. Using this command, we are asking R to generate hierarchical groups of individuals from our data based on genetic similarity. We now have a hierarchical clustering object called "HC". We can plot the results of our cluster analysis using this command:
plot(HC)

If your dataset is small, this might work well for you, but for most genomics applications, you'll get a tree-shaped fuzzball like this:


The solution to this is to load a library from the Bioconductor package, called "ctc". This will let us export the cluster object in the Newick file format. It can then be imported into other more powerful graphing programs. This is done like so:

library(ctc)
write.table(hc2Newick(HC), file="C:/Users/bushws1/Desktop/ibs_new.txt",row.names=FALSE,col.names=FALSE)

You now have a file in Newick format, but R puts quotes around the output for some annoying reason. Open the file in notepad and remove the quotes and it should be ready to use.

To get a better, more readable plot, download "Dendroscope" from the University of Tubingen (nice place, by the way). Dendroscope will let you import the Newick file you created and gives you extensive plotting options. Check out this wicked Circular Cladogram...



In this example, there isn't an easy, intuitive explanation for the groupings, but for other examples, the groups might make more sense. There are lots of options for computing the clustering, and they may give very different results, so proceed with caution, but in general hierarchical clustering can be a useful tool for lots of data analysis situations.

13 comments:

  1. If there are any changes that I can make to SNPduo to make getting this output from the SNPduo R code easier let me know.

    Oh, and the annoying quotes are defaults in the "write.table" that really tick me off. If you do "write.table(..., quote=FALSE)" you won't get the quoted values.

    ReplyDelete
  2. The ape package from cran is capable of plotting circular cladograms. It can convert hclust objects to it's own phylo object, or read a newick string.

    ReplyDelete
  3. Add option quote=FALSE in write.table()

    ReplyDelete
  4. run hclust with ward's method (method = "ward"), it wont be such a messy dendrogram

    ReplyDelete
  5. Just use APE.

    library(ape)
    data(opsin.newick)
    x <- read.tree(text=opsin.newick)
    plot(x,type="fan")

    ReplyDelete
  6. hi, how could i get the elements from each cluster, for example if i want to get the centroids, thanks in advance

    ReplyDelete
  7. Hi ulver -- I'm a bit confused with what you mean by "centroid". The easiest way to retrieve elements for a sub-tree is to use Dendroscope to select the sub-tree and it will list the elements for you.

    ReplyDelete
  8. Thanks so much for the post! I'm a corpus linguist trying to visualize patterns in language usages and found the described method extremely helpful.

    ReplyDelete
  9. If you have a Newick file you can use the interactive tree of life website to edit the tree and add colors and show additional datasets as concentric rings outside the tree

    ReplyDelete
  10. Hierarchical clustering? What is it? What is it used for? I need help to know more information about it. Thanks

    ReplyDelete
  11. What does the meaning of hierarchical clustering? Can you give me more explanation about it? I really don’t understand about it. Does it have an effect to physiological condition or another? You can explain it through simple examples and ways. Thank for sharing.

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete
  13. Hello! I'm a bit confused. I'm assuming your "ibs" variable is equivalent to the "PI_HAT" variable that PLINK can compute. If that's so, shouldn't it be HC <- hclust(1 - as.dist(ibs)) instead of HC <- hclust(1 - as.dist(ibs))? Otherwise we would be feeding a similarity matrix to hclust when we want to feed it a dissimilarity one. Maybe I missed something?

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.