Getting Genetics Done

Tuesday, June 30, 2009

Tuesday Fun -- Epidemiological Dessert

Both useful and delicious, this two-by-two table pie by Sarah Lowry, Sarah McDougal (my 2nd cousin) and friends of the UW MPH program helps you get genetics done and satisfies the appetite! If only she could mail me the Odds Ratio...

Special Bioinformatics issue on next-gen sequencing

Bioinformatics is running a special issue on next generation sequencing, featuring lots of papers on alignment, assembly, Chip-Seq, RNA-Seq, and other related topics. This "virtual issue" will be continually updated as more developments are made. From the editors:

Next-generation sequencing technologies are revolutionising genomics and their effects are becoming increasingly widespread. Many tools and algorithms relevant to next-generation sequencing applications have been published in Bioinformatics, and so to celebrate this contribution we have gathered these together in this 'Bioinformatics for Next Generation Sequencing' virtual issue. This will be a living resource that we will continually update to include the very latest papers in this area to help researchers keep abreast of the latest developments.

Bioinformatics: Next generation sequencing issue

Monday, June 29, 2009

PCG Journal Club Articles, 6/24

Here are citations for the articles discussed at our most recent meeting (June 24). Our next meeting is scheduled for July 17.
~Julia

Choi Y, Wijsman EM, Weir BS. Case-control association testing in the presence of unknown relationships. Genet Epidemiol. 2009 Mar 30. [Epub ahead of print]

Gao X, Becker LC, Becker DM, Starmer JD, Province MA. Avoiding the high Bonferroni penalty in genome-wide association studies. Genet Epidemiol. 2009 May 11. [Epub ahead of print]

Hartge P. Genetics of reproductive lifespan. Nat Genet. 2009 Jun; 41(6): 637-8.

He C, Kraft P, Chen C, Buring JE, Paré G, Hankinson SE, Chanock SJ, Ridker PM, Hunter DJ, Chasman DI. Genome-wide association studies identify loci associated with age at menarche and age at natural menopause. Nat Genet. 2009 May 17. [Epub ahead of print]

Ong KK, et al. Genetic variation in LIN28B is associated with the timing of puberty. Nat Genet. 2009 May 17. [Epub ahead of print]

Perry JR, et al. Meta-analysis of genome-wide association data identifies two loci influencing age at menarche. Nat Genet. 2009 May 17. [Epub ahead of print]

Stolk L, et al. Loci at chromosomes 13, 19 and 20 influence age at natural menopause. Nat Genet. 2009 May 17. [Epub ahead of print]

Sulem P, et al. Genome-wide association study identifies sequence variants on 6q21 associated with age at menarche. Nat Genet. 2009 May 17. [Epub ahead of print]

Wednesday, June 24, 2009

Weekly R Clinic

For readers at Vanderbilt: At yesterday's R course I found out that Theresa Scott in the Biostatistics department holds a weekly R clinic and encourages new R users who want to learn more to bring any questions about R, or even your own code and data. The R clinic is held weekly on Thursday from 2:00-3:00 in MCN. You can find her email address at the link below - email her to get on the mailing list to know about cancellations or other changes.

Vanderbilt Biostatistics R Clinic

Tuesday, June 23, 2009

PDF tutorial from R course (Introduction to R)

Writing from the previously mentioned intro to R course at the Kennedy Center. If you couldn't make it you can download all the course materials from Theresa Scott's website, under the "Current Teaching Material" heading. Here is a direct link to the PDF for the overview materials that we're going over today, along with the R code from the examples in this guide. This guide is suitable for someone who's never used R before, and gets you up and running with the basics of the language, importing/manipulating data, generating basic descriptive statistics, and graphing. There's also a part II (which isn't being covered in the course today) that goes into more detail on the language, lower-level graphing commands, and has a very useful dictionary (Ch. 3) of commonly used functions broken down by topics and uses. Theresa also has tutorials on using LaTeX and R with Sweave on this website.

If you didn't get in the course off the waiting list keep an eye out on the Kennedy Center calendar of events to see when this course will be taught again.

Friday, June 19, 2009

No effect of the serotonin transporter on depression

One of the most famous examples of a gene-environment interaction is being scrutinized by evidence from one of the largest meta-analyses on the subject to date. There have been tons of reports claiming or refuting evidence for a statistically significant interaction between a variant in the serotonin transporter gene (5-HTT) and stressful life events and their effect on major depression. This article published this week in JAMA by several well-known geneticists (including Neil Risch and Kathleen Merikangas, authors of the highly influential 13-year old paper on why association may be more powerful than linkage) now provides evidence that no such interaction exists, and on top of that, there is no main effect of the serotonin transporter gene itself!

The authors performed a meta-analysis of 14 studies which published data on the association between 5-HTTLPR genotype (SS, SL, or LL), number of stressful life events (0, 1, 2, >3) or equivalent, and a categorical measure of depression. With a total sample size of 14,250 using a random-effects meta-analysis, no association was found for the interaction (OR, 1.01; 95% CI, 0.94-1.10), nor the transporter variant alone (OR, 1.05; 95% CI, 0.98-1.13).

On the other hand, the main effect association of stressful life events did hold up (OR, 1.41; 95% CI,1.25-1.57). So I suppose the bottom line here is if you want to avoid becoming depressed, then don't lose your job, gamble away your life savings, get a divorce, or brood over our inability to replicate a finding all weekend. Grab a cocktail, have a good weekend, and check back in a few years for the scoop on the genes.

JAMA: Interaction Between the Serotonin Transporter Gene (5-HTTLPR), Stressful Life Events, and Risk of Depression

Tuesday, June 16, 2009

Hierarchical Clustering in R

Hierarchical clustering is a technique for grouping samples/data points into categories and subcategories based on a similarity measure. Being the powerful statistical package it is, R has several routines for doing hierarchical clustering. The basic command for doing HC is

hclust(d, method = "complete", members=NULL)

Nearly all clustering approaches use a concept of distance. Data points can be conceptualized in a dimensional space where we can compute a distance measure between each pair of points. Points that are closer together get placed in groups or "cluster" together. The input for the hcluster method in R is a distance matrix, so we'll have to compute one of those somehow.

For this example, I'm using an Identity By State matrix, computed with SNPDUO software. This matrix essentially gives us the percentage of alleles two people share. It doesn't output a matrix per se, so I had to transform the output with a perl script. I get something like this out of SNPDUO:

Family1,Indiv1,Family2,Indiv2,MeanIBS
Family1,Indiv1,Family2,Indiv3,MeanIBS
Family1,Indiv1,Family2,Indiv4,MeanIBS
Family1,Indiv2,Family2,Indiv3,MeanIBS
...

For each pair of individuals in my ped file. I need it like this:

Indiv1 Indiv2 Indiv3 Indiv4 ....
Indiv1
Indiv2
Indiv3
Indiv4
....

Its a fairly simple scripting job... Now that I have a MATRIX, I load that into R as a dataset.

ibs<-read.delim("C:/Users/bushws1/Desktop/matrix.txt",Header=FALSE)

Next, we'll run hclust using this distance matrix

HC <- hclust(as.dist(ibs))

The as.dist(ibs) transforms the dataframe called "ibs" into a distance matrix in the R environment. Using this command, we are asking R to generate hierarchical groups of individuals from our data based on genetic similarity. We now have a hierarchical clustering object called "HC". We can plot the results of our cluster analysis using this command:

plot(HC)

If your dataset is small, this might work well for you, but for most genomics applications, you'll get a tree-shaped fuzzball like this:

The solution to this is to load a library from the Bioconductor package, called "ctc". This will let us export the cluster object in the Newick file format. It can then be imported into other more powerful graphing programs. This is done like so:

library(ctc)
write.table(hc2Newick(HC), file="C:/Users/bushws1/Desktop/ibs_new.txt",row.names=FALSE,col.names=FALSE)

You now have a file in Newick format, but R puts quotes around the output for some annoying reason. Open the file in notepad and remove the quotes and it should be ready to use.

To get a better, more readable plot, download "Dendroscope" from the University of Tubingen (nice place, by the way). Dendroscope will let you import the Newick file you created and gives you extensive plotting options. Check out this wicked Circular Cladogram...

In this example, there isn't an easy, intuitive explanation for the groupings, but for other examples, the groups might make more sense. There are lots of options for computing the clustering, and they may give very different results, so proceed with caution, but in general hierarchical clustering can be a useful tool for lots of data analysis situations.

NYT: In Simulation Work, the Demand Is Real

The New York Times published this interesting article on how the ability to design and perform computer simulations is a highly marketable skill for careers across many disciplines.

In methodology development we use simulation nearly every day. We've developed our own specialized genetic data simulation software, genomeSIMLA, that's freely available here by request for PC, Mac, and Linux.

But if you have R on your computer (get it free here), here's how to do a really simple Monte Carlo simulation to determine the power of a one-sample t-test.

First, fire up R and type this command:

rnorm(100)

That command generates 100 random numbers drawn from a standard normal distribution, mean=0, sd=1. Now type this:

rnorm(100,mean=2,sd=7)

That also draws 100 random numbers from a normal distribution, but this time the mean is 2 and the standard deviation is 7. You can also get the same results by just typing this:

rnorm(100,2,7)

Now, let's do a one sample t-test:

t.test(rnorm(100,2,7))

That command performs a one-sample t-test on the 100 samples drawn from a normal distribution with mean=2 and sd=7. Remember, the null hypothesis of a one-sample t-test is usually "the mean is not significantly different from zero". So if the p-value is less than .05, we would typically reject this null, and say that the mean is significantly different from zero.

Now, we knew that the mean was different from zero, because we said draw from a distribution with mean=2. But if this was the case and we only drew 100 samples, how likely is it that we would detect a difference? That's the power of the test - given that the null is false, how likely is it that we reject the null hypothesis?

One way we can answer this question is with a simulation.

First, let's type the same command, but just get ONLY the p-value from the t-test:

t.test(rnorm(100,2,7))$p.value

Was is less than .05? Try typing it again (you can hit the up arrow key to bring up the last command in R, just like on the Linux command line). It will be different because we have a different set of 100 observations. Type it in over and over again. Sometimes it will be less than .05, other times it wont be. Let's do this 1000 times, and see how often it is less than .o5. Let's use the replicate command:

replicate(1000,t.test(rnorm(100,2,7))$p.value)

That simulates doing the t-test 1000 times, and gives you the p-value from each one.

Now, let's do a logical test to see which of those are less than .05:

replicate(1000,t.test(rnorm(100,2,7))$p.value)<0.05

If you typed that in you'll see lots of TRUE's and FALSE's. TRUE means that the t-test on that particular sample was less than .05. Now, internally, R represents TRUE as 1, and FALSE as 0. So if we take the average of all 1000 of these, that will tell us the proportion of times out of 1000 trials that the p-value of the one-sample t-test was less than .o5:

mean(replicate(1000,t.test(rnorm(100,2,7))$p.value)<0.05)

When I did this the power was right around 80%. If you do this again it will be slightly different because remember we are sampling randomly so the results will vary slightly!

Congratulations, you just did your first simulation / power study! Of course because we know what the null distribution of t-statistics looks like under the null, we can mathematically determine the power of a t-test without doing simulation studies:

power.t.test(n=100,delta=2,sd=7,sig.level=.05,type="one.sample")

But if we had developed our own method or algorithm we probably wouldn't have a mathematical formula to calculate power, which is why we rely on simulation studies. Be sure to check out my other posts on power calculation software, choosing the correct analyses, and code to run analyses in R and other software.

This blog has moved!