A while back Will showed you how to create QQ plots of p-values in Stata and in R using the now-deprecated sma package. A bit later on I showed you how to do the same thing in R using ggplot2. As much as we (and our readers) love ggplot2 around here, it can be quite a bit slower than using the built in base graphics. This was only recently a problem for me when I tried creating a quantile-quantile plot of over 12-million p-values. I wrote the code to do this in base graphics, which is substantially faster than using the ggplot2 code I posted a while back. The code an an example are below.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# http://GettingGeneticsDone.blogspot.com/ | |
# See http://gettinggeneticsdone.blogspot.com/p/copyright.html | |
# Define the function | |
ggd.qqplot = function(pvector, main=NULL, ...) { | |
o = -log10(sort(pvector,decreasing=F)) | |
e = -log10( 1:length(o)/length(o) ) | |
plot(e,o,pch=19,cex=1, main=main, ..., | |
xlab=expression(Expected~~-log[10](italic(p))), | |
ylab=expression(Observed~~-log[10](italic(p))), | |
xlim=c(0,max(e)), ylim=c(0,max(o))) | |
lines(e,e,col="red") | |
} | |
#Generate some fake data that deviates from the null | |
set.seed(42) | |
pvalues=runif(10000) | |
pvalues[sample(10000,10)]=pvalues[sample(10000,10)]/5000 | |
# pvalues is a numeric vector | |
pvalues[1:10] | |
# Using the ggd.qqplot() function | |
ggd.qqplot(pvalues) | |
# Add a title | |
ggd.qqplot(pvalues, "QQ-plot of p-values using ggd.qqplot") |
Here's what the resulting QQ-plot will look like:
Actually in base graphics, the easiest way to do the equangular line is abline(0,1, col='red')
ReplyDeleteAh, thanks Abhijit. That'll might speed things up just slightly I imagine.
ReplyDeleteAlso, for the qqplot, you really aren't plotting 1/n,2/n,...,1 against the p-values. There is a R function ppoints which gives the values against which the qqplot should be plotted. For large data this doesn't really matter, but technically....
ReplyDeleteActually, you could just have done:
e = -log10(ppoints(10000))
o = -log10(pvalues)
qqplot(e, o, pch=19, cex=1, main=main, ...,
xlab=expression(Expected~~-log[10](italic(p))),
ylab=expression(Observed~~-log[10](italic(p))),
xlim=c(0,max(e)), ylim=c(0,max(e)))
abline(0,1, col='red')
If you wanted something quick and dirty you could just have done qqplot(ppoints(length(pvalues)), pvalues, log='xy') which would have done things in terms of log10(p) as opposed to -log10(p).
Shouldn't you be setting your y-limit to
ReplyDeleteylim=c(0,max(o)) ?
Otherwise wouldn't it be cutting off everyone's "good news" p-values ??
Good catch Mike. This is now fixed. Thanks.
ReplyDeleteStupid question, but why do you take the -log 10?
ReplyDelete