The New York Times had an interesting piece yesterday about how SAS is facing several business threats from companies like the recently IBM-acquired SPSS, and from burgeoning interest in open-source software like R. The NYT ran an entire article about R earlier this year, and this article discusses how SAS has been revamping their technology to work seamlessly with R code in response to R's growing popularity in academia and other research labs.
NYT: At a Software Powerhouse, the Good Life Is Under Siege
NYT Slideshow: At SAS, Taking Care of Employees Is Good Business
Monday, November 23, 2009
Wednesday, November 18, 2009
Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat
The 2009 Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat will be held on Friday, December 4th, 2009, from 1:30 pm to 5:00 pm, on the eighth floor of the VICC building (898B PRB). The purpose of the retreat is to promote interactions among biostatisticians, bioinformaticians, epidemiologists, clinical investigators, and other translational researchers. The retreat will include several slide presentations that provide overviews of research direction and summaries of ongoing epidemiologic, biostatistics, and bioinformatics research at Vanderbilt, as well as over 30 poster presentations. These presentations will cover recent findings from epidemiologic studies; resources for epidemiologic and translational research; and statistical challenges and state-of-the-art methodologies in clinical trials, observational studies, and basic science, as well as bioinformatics and systems biology approaches to studying complex diseases.
Looks like the retreat is free to anyone who registers using the form on the website below. I'm going, hope to see a few of you there.
Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat
Looks like the retreat is free to anyone who registers using the form on the website below. I'm going, hope to see a few of you there.
Cancer Epidemiology, Biostatistics, and Bioinformatics Retreat
Tags:
Announcements
Monday, November 16, 2009
Seminar: Reproducible Research with R, LaTeX, & Sweave
Theresa Scott, instructor of the previously mentioned R workshop and weekly R clinic, is giving a lecture entitled "Reproducible Research with R, LaTeX, & Sweave" in MRB III, room 1220, this Wednesday 11/18 at 1:30. You can see more details about the lecture here.
Looks like her slides as well as much more introductory material on R, Latex, and Sweave are on her website.
Reproducible Research with R, LaTeX, & Sweave
Looks like her slides as well as much more introductory material on R, Latex, and Sweave are on her website.
Reproducible Research with R, LaTeX, & Sweave
Tags:
Announcements,
R
Monday, November 9, 2009
QQ plots of p-values in R using ggplot2
Way back will wrote on this topic. See his previous post for Stata code for doing this. Unfortunately the R package that was used to create QQ-plots here has been removed from CRAN, so I wrote my own using ggplot2 and some code I received from Daniel Shriner at NHGRI.
Of course you can use R's built-in qqplot() function, but I could never figure out a way to add the diagonal using base graphics. To get the function I wrote to make qqplots, copy and paste this into your R session:
Also, make sure you have ggplot2 installed:
If you already have it installed, load it:
library(ggplot2)
The function takes a vector of p-values, optionally a title, and a third option to make the plot black and white. You can check the default arguments to the function by typing args(qq).
qq(data$Pvals, title="My Quantile-Quantile Plot")
You can also make the plot more "Spartan" by removing the title (setting it to NULL) and by making the color scheme black on white:
qq(data$Pvals, title=NULL, spartan=TRUE)
Stay tuned - I'm also putting together a way to import your PLINK output files and make better Manhattan plots using R and ggplot2 than you ever could with Haploview.
Update Tuesday, November 10, 2009: A big tip of the hat to the anonymous commenter who pointed out that this can easily be done in the base graphics package, as well as adding confidence intervals. As always, we welcome your comments if you know of a better or easier way to do anything mentioned here!
Of course you can use R's built-in qqplot() function, but I could never figure out a way to add the diagonal using base graphics. To get the function I wrote to make qqplots, copy and paste this into your R session:
qq = function(pvector, title="Quantile-quantile plot of p-values", spartan=F) {
# Thanks to Daniel Shriner at NHGRI for providing this code for creating expected and observed values
o = -log10(sort(pvector,decreasing=F))
e = -log10( 1:length(o)/length(o) )
# you could use base graphics
#plot(e,o,pch=19,cex=0.25, xlab=expression(Expected~~-log[10](italic(p))), ylab=expression(Observed~~-log[10](italic(p))),xlim=c(0,max(e)),ylim=c(0,max(e)))
#lines(e,e,col="red")
#You'll need ggplot2 installed to do the rest
plot=qplot(e,o, xlim=c(0,max(e)), ylim=c(0,max(o))) + stat_abline(intercept=0,slope=1, col="red")
plot=plot+opts(title=title)
plot=plot+scale_x_continuous(name=expression(Expected~~-log[10](italic(p))))
plot=plot+scale_y_continuous(name=expression(Observed~~-log[10](italic(p))))
if (spartan) plot=plot+opts(panel.background=theme_rect(col="grey50"), panel.grid.minor=theme_blank())
plot
}
# Thanks to Daniel Shriner at NHGRI for providing this code for creating expected and observed values
o = -log10(sort(pvector,decreasing=F))
e = -log10( 1:length(o)/length(o) )
# you could use base graphics
#plot(e,o,pch=19,cex=0.25, xlab=expression(Expected~~-log[10](italic(p))), ylab=expression(Observed~~-log[10](italic(p))),xlim=c(0,max(e)),ylim=c(0,max(e)))
#lines(e,e,col="red")
#You'll need ggplot2 installed to do the rest
plot=qplot(e,o, xlim=c(0,max(e)), ylim=c(0,max(o))) + stat_abline(intercept=0,slope=1, col="red")
plot=plot+opts(title=title)
plot=plot+scale_x_continuous(name=expression(Expected~~-log[10](italic(p))))
plot=plot+scale_y_continuous(name=expression(Observed~~-log[10](italic(p))))
if (spartan) plot=plot+opts(panel.background=theme_rect(col="grey50"), panel.grid.minor=theme_blank())
plot
}
Also, make sure you have ggplot2 installed:
install.packages("ggplot2")
If you already have it installed, load it:
library(ggplot2)
The function takes a vector of p-values, optionally a title, and a third option to make the plot black and white. You can check the default arguments to the function by typing args(qq).
qq(data$Pvals, title="My Quantile-Quantile Plot")
You can also make the plot more "Spartan" by removing the title (setting it to NULL) and by making the color scheme black on white:
qq(data$Pvals, title=NULL, spartan=TRUE)
Stay tuned - I'm also putting together a way to import your PLINK output files and make better Manhattan plots using R and ggplot2 than you ever could with Haploview.
Update Tuesday, November 10, 2009: A big tip of the hat to the anonymous commenter who pointed out that this can easily be done in the base graphics package, as well as adding confidence intervals. As always, we welcome your comments if you know of a better or easier way to do anything mentioned here!
Tags:
ggplot2,
GWAS,
R,
Tutorials,
Visualization
Wednesday, November 4, 2009
Split, apply, and combine in R using PLYR
While flirting around with previously mentioned ggplot2 I came across an incredibly useful set of functions in the plyr package, made by Hadley Wickham, the same guy behind ggplot2. If you've ever used MySQL before, think of "GROUP BY", but here you can arbitrarily apply any R function to splits of the data, or write one yourself.
Imagine you have two SNPs, and 2 different variables you want info about across all three genotypes at each locus. You can generate some fake data using this command:
mydata=data.frame(X1=rnorm(30), X2=rnorm(30,5,2), SNP1=c(rep("AA",10), rep("Aa",10), rep("aa",10)), SNP2=c(rep("BB",10), rep("Bb",10), rep("bb",10)))
The data should look like this:
X1 X2 SNP1 SNP2 1 -0.73671602 2.76733658 AA BB 2 -1.83947190 5.70194282 AA BB 3 -0.46963370 4.89393264 AA BB 4 0.85105460 6.37610477 AA BB 5 -1.69043368 8.11399122 AA BB 6 -1.47995127 4.32866212 AA BB 7 0.21026063 4.32073489 AA BB 8 0.19038088 4.26631298 AA BB 9 -0.29759084 4.07349852 AA BB 10 -0.46843672 7.01048905 AA BB 11 0.93804369 3.78855823 Aa Bb 12 -1.46223800 5.08756968 Aa Bb 13 -0.01338604 -0.01494702 Aa Bb 14 0.13704332 4.53625220 Aa Bb 15 0.59214151 3.75293124 Aa Bb 16 -0.27658821 5.78701933 Aa Bb 17 0.05527139 5.99460464 Aa Bb 18 1.00077756 5.26838121 Aa Bb 19 0.62871877 6.98314827 Aa Bb 20 1.18925876 8.61457227 Aa Bb 21 -0.40389888 3.36446128 aa bb 22 -1.43420595 6.50801614 aa bb 23 1.79086285 5.39616828 aa bb 24 0.15886243 3.98010396 aa bb 25 0.57746024 4.93605364 aa bb 26 -1.11158987 6.40760662 aa bb 27 1.93484117 3.33698750 aa bb 28 0.10803302 4.56442931 aa bb 29 0.56648591 4.48502267 aa bb 30 0.03616814 5.77160936 aa bb
Now let's say you want the mean and standard deviation of X1 and X2 across each of the multilocus genotypes at SNP1 and SNP2. First, install and load the plyr package:
install.packages("plyr")
library(plyr)
Now run the following command:
ddply(mydata, c("SNP1","SNP2"), function(df) data.frame(meanX1=mean(df$X1), sdX1=sd(df$X1), meanX2=mean(df$X2), sdX2=sd(df$X2)))
And you get exactly what you asked for:
SNP1 SNP2 meanX1 sdX1 meanX2 sdX2 1 aa bb 0.2223019 1.0855063 4.875046 1.144623 2 Aa Bb 0.2789043 0.7817727 4.979809 2.286914 3 AA BB -0.5730538 0.8834038 5.185301 1.601626
That command above may appear complicated at first, but it isn't really. Cerebral mastication gives a very easy to follow, concise explanation of what that command is doing. Also, check out the slides he presented at a recent R meetup discussion:
There's more to plyr than ddply, so check it out for yourself!
Monday, November 2, 2009
Common disorders are quantitative traits
There are no common disorders - only extremes of quantitative traits.
---
That's the argument made by Plomin, Haworth, and Davis in a great review paper just published online in Nature Reviews Genetics. One of the main themes presented here is that as a disorder becomes more common, a case-control design becomes less and less powerful to identify associated genetic variants because cases become less extreme and the control group becomes increasingly contaminated by "near cases".
Some qualitative disorders have obvious relevant quantitative traits - BMI for obesity, blood pressure for hypertension, mood for depression. I recently authored a review with Dana and Marylyn making a similar argument in the context of pharmacogenomics research. The authors admit, however, that quantitative measurements are not at all apparent for some disorders, such as alcoholism, arthritis, autism, dementia, diabetes, or heart disease.
The review also has a glossary for the uninitiated, encouraging the use of quantitative vocabulary, like linear regression or ANOVA instead of logistic regression, variance and mean differences rather than risk, and covariance rather than comorbidity.
The conclude with statement that the extremes of a distribution are important medically, but there is no scientific or statistical advantage in analyzing artificially constructed disease labels that evolved historically on the basis of symptoms rather than etiology.
Common disorders are quantitative traits (NRG AOP).
---
That's the argument made by Plomin, Haworth, and Davis in a great review paper just published online in Nature Reviews Genetics. One of the main themes presented here is that as a disorder becomes more common, a case-control design becomes less and less powerful to identify associated genetic variants because cases become less extreme and the control group becomes increasingly contaminated by "near cases".
Some qualitative disorders have obvious relevant quantitative traits - BMI for obesity, blood pressure for hypertension, mood for depression. I recently authored a review with Dana and Marylyn making a similar argument in the context of pharmacogenomics research. The authors admit, however, that quantitative measurements are not at all apparent for some disorders, such as alcoholism, arthritis, autism, dementia, diabetes, or heart disease.
The review also has a glossary for the uninitiated, encouraging the use of quantitative vocabulary, like linear regression or ANOVA instead of logistic regression, variance and mean differences rather than risk, and covariance rather than comorbidity.
The conclude with statement that the extremes of a distribution are important medically, but there is no scientific or statistical advantage in analyzing artificially constructed disease labels that evolved historically on the basis of symptoms rather than etiology.
Common disorders are quantitative traits (NRG AOP).
Tags:
Recommended Reading
Subscribe to:
Posts (Atom)