Tuesday, June 30, 2009

Tuesday Fun -- Epidemiological Dessert



Both useful and delicious, this two-by-two table pie by Sarah Lowry, Sarah McDougal (my 2nd cousin) and friends of the UW MPH program helps you get genetics done and satisfies the appetite! If only she could mail me the Odds Ratio...

Special Bioinformatics issue on next-gen sequencing

Bioinformatics is running a special issue on next generation sequencing, featuring lots of papers on alignment, assembly, Chip-Seq, RNA-Seq, and other related topics. This "virtual issue" will be continually updated as more developments are made. From the editors:
Next-generation sequencing technologies are revolutionising genomics and their effects are becoming increasingly widespread. Many tools and algorithms relevant to next-generation sequencing applications have been published in Bioinformatics, and so to celebrate this contribution we have gathered these together in this 'Bioinformatics for Next Generation Sequencing' virtual issue. This will be a living resource that we will continually update to include the very latest papers in this area to help researchers keep abreast of the latest developments.
Bioinformatics: Next generation sequencing issue

Monday, June 29, 2009

PCG Journal Club Articles, 6/24

Here are citations for the articles discussed at our most recent meeting (June 24). Our next meeting is scheduled for July 17.
~Julia

Choi Y, Wijsman EM, Weir BS. Case-control association testing in the presence of unknown relationships. Genet Epidemiol. 2009 Mar 30. [Epub ahead of print]

Gao X, Becker LC, Becker DM, Starmer JD, Province MA. Avoiding the high Bonferroni penalty in genome-wide association studies. Genet Epidemiol. 2009 May 11. [Epub ahead of print]

Hartge P. Genetics of reproductive lifespan. Nat Genet. 2009 Jun; 41(6): 637-8.

He C, Kraft P, Chen C, Buring JE, Paré G, Hankinson SE, Chanock SJ, Ridker PM, Hunter DJ, Chasman DI. Genome-wide association studies identify loci associated with age at menarche and age at natural menopause. Nat Genet. 2009 May 17. [Epub ahead of print]

Ong KK, et al. Genetic variation in LIN28B is associated with the timing of puberty. Nat Genet. 2009 May 17. [Epub ahead of print]

Perry JR, et al. Meta-analysis of genome-wide association data identifies two loci influencing age at menarche. Nat Genet. 2009 May 17. [Epub ahead of print]

Stolk L, et al. Loci at chromosomes 13, 19 and 20 influence age at natural menopause. Nat Genet. 2009 May 17. [Epub ahead of print]

Sulem P, et al. Genome-wide association study identifies sequence variants on 6q21 associated with age at menarche. Nat Genet. 2009 May 17. [Epub ahead of print]

Wednesday, June 24, 2009

Weekly R Clinic

For readers at Vanderbilt: At yesterday's R course I found out that Theresa Scott in the Biostatistics department holds a weekly R clinic and encourages new R users who want to learn more to bring any questions about R, or even your own code and data. The R clinic is held weekly on Thursday from 2:00-3:00 in MCN. You can find her email address at the link below - email her to get on the mailing list to know about cancellations or other changes.

Vanderbilt Biostatistics R Clinic

Tuesday, June 23, 2009

PDF tutorial from R course (Introduction to R)

Writing from the previously mentioned intro to R course at the Kennedy Center. If you couldn't make it you can download all the course materials from Theresa Scott's website, under the "Current Teaching Material" heading. Here is a direct link to the PDF for the overview materials that we're going over today, along with the R code from the examples in this guide. This guide is suitable for someone who's never used R before, and gets you up and running with the basics of the language, importing/manipulating data, generating basic descriptive statistics, and graphing. There's also a part II (which isn't being covered in the course today) that goes into more detail on the language, lower-level graphing commands, and has a very useful dictionary (Ch. 3) of commonly used functions broken down by topics and uses. Theresa also has tutorials on using LaTeX and R with Sweave on this website.

If you didn't get in the course off the waiting list keep an eye out on the Kennedy Center calendar of events to see when this course will be taught again.

Friday, June 19, 2009

No effect of the serotonin transporter on depression

One of the most famous examples of a gene-environment interaction is being scrutinized by evidence from one of the largest meta-analyses on the subject to date. There have been tons of reports claiming or refuting evidence for a statistically significant interaction between a variant in the serotonin transporter gene (5-HTT) and stressful life events and their effect on major depression. This article published this week in JAMA by several well-known geneticists (including Neil Risch and Kathleen Merikangas, authors of the highly influential 13-year old paper on why association may be more powerful than linkage) now provides evidence that no such interaction exists, and on top of that, there is no main effect of the serotonin transporter gene itself!

The authors performed a meta-analysis of 14 studies which published data on the association between 5-HTTLPR genotype (SS, SL, or LL), number of stressful life events (0, 1, 2, >3) or equivalent, and a categorical measure of depression. With a total sample size of 14,250 using a random-effects meta-analysis, no association was found for the interaction (OR, 1.01; 95% CI, 0.94-1.10), nor the transporter variant alone (OR, 1.05; 95% CI, 0.98-1.13).

On the other hand, the main effect association of stressful life events did hold up (OR, 1.41; 95% CI,1.25-1.57). So I suppose the bottom line here is if you want to avoid becoming depressed, then don't lose your job, gamble away your life savings, get a divorce, or brood over our inability to replicate a finding all weekend. Grab a cocktail, have a good weekend, and check back in a few years for the scoop on the genes.

JAMA: Interaction Between the Serotonin Transporter Gene (5-HTTLPR), Stressful Life Events, and Risk of Depression

Tuesday, June 16, 2009

Hierarchical Clustering in R

Hierarchical clustering is a technique for grouping samples/data points into categories and subcategories based on a similarity measure. Being the powerful statistical package it is, R has several routines for doing hierarchical clustering. The basic command for doing HC is

hclust(d, method = "complete", members=NULL)


Nearly all clustering approaches use a concept of distance. Data points can be conceptualized in a dimensional space where we can compute a distance measure between each pair of points. Points that are closer together get placed in groups or "cluster" together. The input for the hcluster method in R is a distance matrix, so we'll have to compute one of those somehow.

For this example, I'm using an Identity By State matrix, computed with SNPDUO software. This matrix essentially gives us the percentage of alleles two people share. It doesn't output a matrix per se, so I had to transform the output with a perl script. I get something like this out of SNPDUO:

Family1,Indiv1,Family2,Indiv2,MeanIBS
Family1,Indiv1,Family2,Indiv3,MeanIBS
Family1,Indiv1,Family2,Indiv4,MeanIBS
Family1,Indiv2,Family2,Indiv3,MeanIBS
...

For each pair of individuals in my ped file. I need it like this:

Indiv1 Indiv2 Indiv3 Indiv4 ....
Indiv1
Indiv2
Indiv3
Indiv4
....

Its a fairly simple scripting job... Now that I have a MATRIX, I load that into R as a dataset.

ibs<-read.delim("C:/Users/bushws1/Desktop/matrix.txt",Header=FALSE)


Next, we'll run hclust using this distance matrix
HC <- hclust(as.dist(ibs))


The as.dist(ibs) transforms the dataframe called "ibs" into a distance matrix in the R environment. Using this command, we are asking R to generate hierarchical groups of individuals from our data based on genetic similarity. We now have a hierarchical clustering object called "HC". We can plot the results of our cluster analysis using this command:
plot(HC)

If your dataset is small, this might work well for you, but for most genomics applications, you'll get a tree-shaped fuzzball like this:


The solution to this is to load a library from the Bioconductor package, called "ctc". This will let us export the cluster object in the Newick file format. It can then be imported into other more powerful graphing programs. This is done like so:

library(ctc)
write.table(hc2Newick(HC), file="C:/Users/bushws1/Desktop/ibs_new.txt",row.names=FALSE,col.names=FALSE)

You now have a file in Newick format, but R puts quotes around the output for some annoying reason. Open the file in notepad and remove the quotes and it should be ready to use.

To get a better, more readable plot, download "Dendroscope" from the University of Tubingen (nice place, by the way). Dendroscope will let you import the Newick file you created and gives you extensive plotting options. Check out this wicked Circular Cladogram...



In this example, there isn't an easy, intuitive explanation for the groupings, but for other examples, the groups might make more sense. There are lots of options for computing the clustering, and they may give very different results, so proceed with caution, but in general hierarchical clustering can be a useful tool for lots of data analysis situations.

NYT: In Simulation Work, the Demand Is Real

The New York Times published this interesting article on how the ability to design and perform computer simulations is a highly marketable skill for careers across many disciplines.

In methodology development we use simulation nearly every day. We've developed our own specialized genetic data simulation software, genomeSIMLA, that's freely available here by request for PC, Mac, and Linux.

But if you have R on your computer (get it free here), here's how to do a really simple Monte Carlo simulation to determine the power of a one-sample t-test.

First, fire up R and type this command:

rnorm(100)

That command generates 100 random numbers drawn from a standard normal distribution, mean=0, sd=1. Now type this:

rnorm(100,mean=2,sd=7)

That also draws 100 random numbers from a normal distribution, but this time the mean is 2 and the standard deviation is 7. You can also get the same results by just typing this:

rnorm(100,2,7)

Now, let's do a one sample t-test:

t.test(rnorm(100,2,7))

That command performs a one-sample t-test on the 100 samples drawn from a normal distribution with mean=2 and sd=7. Remember, the null hypothesis of a one-sample t-test is usually "the mean is not significantly different from zero". So if the p-value is less than .05, we would typically reject this null, and say that the mean is significantly different from zero.

Now, we knew that the mean was different from zero, because we said draw from a distribution with mean=2. But if this was the case and we only drew 100 samples, how likely is it that we would detect a difference? That's the power of the test - given that the null is false, how likely is it that we reject the null hypothesis?

One way we can answer this question is with a simulation.

First, let's type the same command, but just get ONLY the p-value from the t-test:

t.test(rnorm(100,2,7))$p.value

Was is less than .05? Try typing it again (you can hit the up arrow key to bring up the last command in R, just like on the Linux command line). It will be different because we have a different set of 100 observations. Type it in over and over again. Sometimes it will be less than .05, other times it wont be. Let's do this 1000 times, and see how often it is less than .o5. Let's use the replicate command:

replicate(1000,t.test(rnorm(100,2,7))$p.value)

That simulates doing the t-test 1000 times, and gives you the p-value from each one.

Now, let's do a logical test to see which of those are less than .05:

replicate(1000,t.test(rnorm(100,2,7))$p.value)<0.05

If you typed that in you'll see lots of TRUE's and FALSE's. TRUE means that the t-test on that particular sample was less than .05. Now, internally, R represents TRUE as 1, and FALSE as 0. So if we take the average of all 1000 of these, that will tell us the proportion of times out of 1000 trials that the p-value of the one-sample t-test was less than .o5:

mean(replicate(1000,t.test(rnorm(100,2,7))$p.value)<0.05)

When I did this the power was right around 80%. If you do this again it will be slightly different because remember we are sampling randomly so the results will vary slightly!

Congratulations, you just did your first simulation / power study! Of course because we know what the null distribution of t-statistics looks like under the null, we can mathematically determine the power of a t-test without doing simulation studies:

power.t.test(n=100,delta=2,sd=7,sig.level=.05,type="one.sample")

But if we had developed our own method or algorithm we probably wouldn't have a mathematical formula to calculate power, which is why we rely on simulation studies. Be sure to check out my other posts on power calculation software, choosing the correct analyses, and code to run analyses in R and other software.

Monday, June 15, 2009

Side by side analyses in Stata, SPSS, SAS, and R

I've linked to UCLA's stat computing resources once before on a previous post about choosing the right analysis for the questions your asking and the data types you have. Here's another section of the same website that has code to run an identical analysis in all of these statistical packages, with examples to walk through (as they note here, just because they don't list code for a particular software for an analysis doesn't mean the software can't do it). They also have examples of power calculations for a smattering of statistics using previously mentioned G*power in addition to others.

UCLA Stat computing - data analysis examples with Stata, SPSS, SAS, R, and others

Thursday, June 11, 2009

Get your genome sequenced by Illumina for $48k

This week Illumina launched their own personal genome sequencing service. For $48,000 they'll send you the sequence of your entire genome on a Mac computer that you can keep. According to their website, all the sequencing is done in a CLIA-certified clinical lab. One thing different about this than other consumer genetics services is that they require you to consult your doctor before signing up, and have them request sequencing for you, like writing a prescription. Then they send the sequence back to your doctor to discuss your results.

Now, even as a geneticist I'm not sure what to tell a layperson to do with all 3 billion of their bases sequenced, so what is a general practitioner to do when their patient seeks medical advice based on a service like this? Share your thoughts in the comments.

http://www.everygenome.com/

Tuesday, June 9, 2009

Challenges of translating genetic tests into clinical and public health practice

Here's a good paper published online this morning in Nature Reviews Genetics. As the title suggests, the paper covers many of the challenges associated with translating molecular genetic tests into clinical practice. There are tons of references to articles on the ethical, legal, and social issues surrounding genetic testing in health management. The authors also introduce and give reference to several methods from econometrics (value-of-information analysis, cost-benefit analysis, cost-effectiveness analysis, cost-utility analysis, etc) that are frequently used by policy and decision makers to prioritize technology and research funding. However, as the authors point out, although prioritizing funding of technology and research should ideally be based on rigorous evidence-based assessment and appraisal of the existing scientific evidence, this evidence is often not available or would otherwise be prohibitively expensive to collect. The authors conclude with a statement that prioritizing funding for translational genetic research remains just as much of a challenge as the genomic research itself.

Challenges of translating genetic tests into clinical and public health practice (NRG AOP June 9)

Monday, June 8, 2009

Make Pretty Regression Tables in Stata

The estout package for Stata is useful for quickly creating nicely formatted tables from a regression analysis for tables or papers. To install it, fire up Stata and type in this command:

ssc install estout, replace

Stata will automatically download and install the package. Run the regression as you normally would, then use the esttab command (part of the estout package) to create a table using those results. See their examples to see how the command works.

Click the thumbnail below to check out the difference. The top half is Stata's default output. The bottom is estout's formatted output.


Friday, June 5, 2009

Introductory statistics with R

I know that a lot of you are scrambling to spend your training grant money by next week. If you think you'll ever need to use R, I strongly recommend buying this book: Introductory Statistics with R, by Peter Dalgaard ($48, Amazon). I picked this up a while back and read through most of it in a day or two. It's not the best book to learn basic stats, but if you already have some background it's a great resource for learning how to do analyses in R, suitable for people who've never used R before. Incidentally, the author is an editor of the previously mentioned R journal.

You can actually view most of the content from an older edition of this book online for free at Google Books.

Wednesday, June 3, 2009

Use meaningful color schemes in your figures

Some of the best figure design ideas come from cartographers. If you've ever read a Tufte book you've seen lots of examples. Let's talk about using color effectively. Penn State geography professor Cindy Brewer's ColorBrewer tool for selecting color schemes for figures has been conveniently packaged into an R library called RColorBrewer. You'll have to read up on the RColorBrewer documentation to see how it works, but here I just want to point out how to use color schemes in a meaningful way based on the type of data you're presenting.

In general, three useful color schemes are diverging, sequential, and qualitative. The example palettes shown here can be reproduced with the following R code:

install.packages("RColorBrewer")
library(RColorBrewer)
display.brewer.all(type="seq")
display.brewer.all(type="div")
display.brewer.all(type="qual")

A diverging color scheme is useful for de-emphasizing the mean value, or for drawing attention to departures from a critical midpoint in either direction, such as in a normal distribution. Here's an example color palette and an example of how you might use a diverging scheme:





A sequential color scheme is useful for de-emphasizing the zero, or the lower bound to the data, while highlighting the importance of increasing values. Here's a palette and an example barchart.





Finally, a qualitative color scheme provides high contrast between adjacent values, and is useful for categorical or nominal data. Here's a palette and for an example check out the picture from the previously mentioned genetic diversity in Africa paper:




See Cindy Brewer's explanation of this topic for more details and examples of each scheme.

Monday, June 1, 2009

A good GWAS review for the uninitiated

GWAS reviews are a dime a dozen these days, but I found this one in particular that's a good up-to-date review suitable for people relatively new to the field. While this one has a focus on psychiatric genetics, it has a short summary on topics like positional methods, candidate gene studies, common variation, rare variation, CNVs, GWAS design, and issues concerning power and sample size. There's a table defining some basic terminology, a timeline of methods from linkage to the 1000 genomes project, and a table summarizing potentially problematic GWAS design issues.

Genomewide Association Studies: History, Rationale, and Prospects for Psychiatric Disorders (Pubmed)

For other links to send new folks joining your group, check out posts tagged with Tutorials or Recommended Reading.
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.