Monday, November 23, 2009

NYT: SAS threatened by R

The New York Times had an interesting piece yesterday about how SAS is facing several business threats from companies like the recently IBM-acquired SPSS, and from burgeoning interest in open-source software like R.  The NYT ran an entire article about R earlier this year, and this article discusses how SAS has been revamping their technology to work seamlessly with R code in response to R's growing popularity in academia and other research labs.

NYT: At a Software Powerhouse, the Good Life Is Under Siege

NYT Slideshow: At SAS, Taking Care of Employees Is Good Business

Wednesday, November 18, 2009

Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat

The 2009 Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat will be held on Friday, December 4th, 2009, from 1:30 pm to 5:00 pm, on the eighth floor of the VICC building (898B PRB). The purpose of the retreat is to promote interactions among biostatisticians, bioinformaticians, epidemiologists, clinical investigators, and other translational researchers. The retreat will include several slide presentations that provide overviews of research direction and summaries of ongoing epidemiologic, biostatistics, and bioinformatics research at Vanderbilt, as well as over 30 poster presentations. These presentations will cover recent findings from epidemiologic studies; resources for epidemiologic and translational research; and statistical challenges and state-of-the-art methodologies in clinical trials, observational studies, and basic science, as well as bioinformatics and systems biology approaches to studying complex diseases.

Looks like the retreat is free to anyone who registers using the form on the website below.  I'm going, hope to see a few of you there.

Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat

Monday, November 16, 2009

Seminar: Reproducible Research with R, LaTeX, & Sweave

Theresa Scott, instructor of the previously mentioned R workshop and weekly R clinic, is giving a lecture entitled "Reproducible Research with R, LaTeX, & Sweave" in MRB III, room 1220, this Wednesday 11/18 at 1:30.  You can see more details about the lecture here.

Looks like her slides as well as much more introductory material on R, Latex, and Sweave are on her website.

Reproducible Research with R, LaTeX, & Sweave

Monday, November 9, 2009

QQ plots of p-values in R using ggplot2

Way back will wrote on this topic.  See his previous post for Stata code for doing this.  Unfortunately the R package that was used to create QQ-plots here has been removed from CRAN, so I wrote my own using ggplot2 and some code I received from Daniel Shriner at NHGRI.

Of course you can use R's built-in qqplot() function, but I could never figure out a way to add the diagonal using base graphics.  To get the function I wrote to make qqplots, copy and paste this into your R session:

qq = function(pvector, title="Quantile-quantile plot of p-values", spartan=F) {
    # Thanks to Daniel Shriner at NHGRI for providing this code for creating expected and observed values
    o = -log10(sort(pvector,decreasing=F))
    e = -log10( 1:length(o)/length(o) )
    # you could use base graphics
    #plot(e,o,pch=19,cex=0.25, xlab=expression(Expected~~-log[10](italic(p))), ylab=expression(Observed~~-log[10](italic(p))),xlim=c(0,max(e)),ylim=c(0,max(e)))
    #You'll need ggplot2 installed to do the rest
    plot=qplot(e,o, xlim=c(0,max(e)), ylim=c(0,max(o))) + stat_abline(intercept=0,slope=1, col="red")
    if (spartan) plot=plot+opts(panel.background=theme_rect(col="grey50"), panel.grid.minor=theme_blank())

Also, make sure you have ggplot2 installed:


If you already have it installed, load it:


The function takes a vector of p-values, optionally a title, and a third option to make the plot black and white.  You can check the default arguments to the function by typing args(qq).

qq(data$Pvals, title="My Quantile-Quantile Plot")

You can also make the plot more "Spartan" by removing the title (setting it to NULL) and by making the color scheme black on white:

qq(data$Pvals, title=NULL, spartan=TRUE)

Stay tuned - I'm also putting together a way to import your PLINK output files and make better Manhattan plots using R and ggplot2 than you ever could with Haploview.

Update Tuesday, November 10, 2009: A big tip of the hat to the anonymous commenter who pointed out that this can easily be done in the base graphics package, as well as adding confidence intervals.  As always, we welcome your comments if you know of a better or easier way to do anything mentioned here!

Wednesday, November 4, 2009

Split, apply, and combine in R using PLYR

While flirting around with previously mentioned ggplot2 I came across an incredibly useful set of functions in the plyr package, made by Hadley Wickham, the same guy behind ggplot2.  If you've ever used MySQL before, think of "GROUP BY", but here you can arbitrarily apply any R function to splits of the data, or write one yourself.

Imagine you have two SNPs, and 2 different variables you want info about across all three genotypes at each locus.  You can generate some fake data using this command:

mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2), SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)), SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10)))

The data should look like this:

            X1          X2 SNP1 SNP2
1  -0.73671602  2.76733658   AA   BB
2  -1.83947190  5.70194282   AA   BB
3  -0.46963370  4.89393264   AA   BB
4   0.85105460  6.37610477   AA   BB
5  -1.69043368  8.11399122   AA   BB
6  -1.47995127  4.32866212   AA   BB
7   0.21026063  4.32073489   AA   BB
8   0.19038088  4.26631298   AA   BB
9  -0.29759084  4.07349852   AA   BB
10 -0.46843672  7.01048905   AA   BB
11  0.93804369  3.78855823   Aa   Bb
12 -1.46223800  5.08756968   Aa   Bb
13 -0.01338604 -0.01494702   Aa   Bb
14  0.13704332  4.53625220   Aa   Bb
15  0.59214151  3.75293124   Aa   Bb
16 -0.27658821  5.78701933   Aa   Bb
17  0.05527139  5.99460464   Aa   Bb
18  1.00077756  5.26838121   Aa   Bb
19  0.62871877  6.98314827   Aa   Bb
20  1.18925876  8.61457227   Aa   Bb
21 -0.40389888  3.36446128   aa   bb
22 -1.43420595  6.50801614   aa   bb
23  1.79086285  5.39616828   aa   bb
24  0.15886243  3.98010396   aa   bb
25  0.57746024  4.93605364   aa   bb
26 -1.11158987  6.40760662   aa   bb
27  1.93484117  3.33698750   aa   bb
28  0.10803302  4.56442931   aa   bb
29  0.56648591  4.48502267   aa   bb
30  0.03616814  5.77160936   aa   bb

Now let's say you want the mean and standard deviation of X1 and X2 across each of the multilocus genotypes at SNP1 and SNP2.  First, install and load the plyr package:


Now run the following command:

ddply(mydata, c("SNP1","SNP2"), function(df) data.frame(meanX1=mean(df$X1), sdX1=sd(df$X1), meanX2=mean(df$X2), sdX2=sd(df$X2)))

And you get exactly what you asked for:

  SNP1 SNP2     meanX1      sdX1   meanX2     sdX2
1   aa   bb  0.2223019 1.0855063 4.875046 1.144623
2   Aa   Bb  0.2789043 0.7817727 4.979809 2.286914
3   AA   BB -0.5730538 0.8834038 5.185301 1.601626

That command above may appear complicated at first, but it isn't really.  Cerebral mastication gives a very easy to follow, concise explanation of what that command is doing.  Also, check out the slides he presented at a recent R meetup discussion:

There's more to plyr than ddply, so check it out for yourself!

Monday, November 2, 2009

Common disorders are quantitative traits

There are no common disorders - only extremes of quantitative traits.


That's the argument made by Plomin, Haworth, and Davis in a great review paper just published online in Nature Reviews Genetics.  One of the main themes presented here is that as a disorder becomes more common, a case-control design becomes less and less powerful to identify associated genetic variants because cases become less extreme and the control group becomes increasingly contaminated by "near cases".

Some qualitative disorders have obvious relevant quantitative traits - BMI for obesity, blood pressure for hypertension,  mood for depression.  I recently authored a review with Dana and Marylyn making a similar argument in the context of pharmacogenomics research.  The authors admit, however, that quantitative measurements are not at all apparent for some disorders, such as alcoholism, arthritis, autism, dementia, diabetes, or heart disease.

The review also has a glossary for the uninitiated, encouraging the use of quantitative vocabulary, like linear regression or ANOVA instead of logistic regression, variance and mean differences rather than risk, and covariance rather than comorbidity.

The conclude with statement that the extremes of a distribution are important medically, but there is no scientific or statistical advantage in analyzing artificially constructed disease labels that evolved historically on the basis of symptoms rather than etiology.

Common disorders are quantitative traits (NRG AOP).
Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.