Getting Genetics Done: GWAS Manhattan plots and QQ plots using ggplot2 in R

Wednesday, January 20, 2010

GWAS Manhattan plots and QQ plots using ggplot2 in R

**** UPDATE, May 15 2014 *****
The functions described here have now been wrapped into an R package. View the updated blog post or see the online package vignette for how to install and use. If you'd still like to use the old code described here, you can access this at version 0.0.0 on GitHub. The code below likely won't work.
*****************************

Will posted earlier this week about how to produce manhattan plots of GWAS results using Stata, so I thought I'd share how I do this in R using ggplot2.

First, if you've never used ggplot2, you'll need to add it to your R installation by typing:

install.packages("ggplot2")

Once you've done that, copy and paste this command to download the functions I wrote necessary to produce these plots. If you'd like to see the source code yourself, copy the URL into your web browser.

source("http://dl.dropbox.com/u/66281/0_Permanent/qqman.r")

Next, read in the PLINK results file to a data frame. Substitute plink.qassoc with your own results filename.

mydata=read.table("plink.qassoc", header=TRUE)

Finally, run this function which will produce and save in the current directory both a QQ-plot and a manhattan plot of your results:

qqman(mydata)

QQ-Plot (these are simulated GWAS results):

Manhattan plot (click to enlarge):

A few notes: First, if you're doing this on a linux machine from your Windows computer, you'll need to be running the previously mentioned XMing server on your Windows computer for the plot to save correctly. Second, the qqman() function calls the manhattan() function, which is extremely slow and memory-intensive. It may take about 3 minutes to run for each dataset. The memory issue isn't a problem on 64-bit machines, but you may run out of memory on 32-bit machines if you're doing this with GWAS data. I'm going to try to improve this in the future. Finally, using that source command you also downloaded a function I wrote called qqmanall(), which does just what it sounds like - if you run it on a linux machine with no arguments it reads in ALL of the plink GWAS results stored in the current directory, and creates QQ and manhattan plots for all of them with a common upper limit for the y-axis corresponding to the most significant result. Enjoy.

...

Update Thursday, January 21, 2010: I neglected to mention yesterday the format of the plink.assoc or plink.qassoc files, in case you want to produce the same plots using results from another software other than plink. When you load your .assoc files in a data frame, the relevant columns are named "CHR", "BP", and "P". You can use this plotting function as long as you have these three columns in your data frame, regardless of whether you use PLINK or not.

The latest version of the code is reproduced below:

**** UPDATE, May 15 2014 *****
The functions described here have now been wrapped into an R package. View the updated blog post or see the online package vignette for how to install and use. If you'd still like to use the old code described here, you can access this at version 0.0.0 on GitHub. The code below likely won't work.
*****************************

...

**** UPDATE, May 15 2014 *****
The functions described here have now been wrapped into an R package. View the updated blog post or see the online package vignette for how to install and use. If you'd still like to use the old code described here, you can access this at version 0.0.0 on GitHub. The code below likely won't work.
*****************************

39 comments:

AnonymousJanuary 20, 2010 at 4:07 PM
Hi guys, I am sure if you are aware of R analytic flow (http://www.ef-prime.com/products/ranalyticflow_en/support.html#contact). I am using it now. I think it deserves to be popularized. Have a look please.
ReplyDelete
Replies
AnonymousJanuary 30, 2010 at 12:19 PM
Thank you for the code for the manhattan plot.
However, I think, it does not sort on the chromosome, which is a natural method to do it.
Also, its gray scale. I would like to see something similar to the SVS Suite.
Great work guys! I often learn something new from your site.. keep up the good work and may the best sequence win!!
http://www.goldenhelix.com/SNP_Variation/solutions/gwas/index.html
ReplyDelete
Replies
AnonymousFebruary 17, 2010 at 10:58 AM
Hi,
Has there been any changes to the script. I tried running today and it gave a warning message
"You requested annotation but provided no SNPlist!"
ReplyDelete
Replies
Stephen TurnerFebruary 17, 2010 at 4:17 PM
Should be fixed now. I'm working on a way to annotate the plot by highlighting SNPs in a given list of SNPs. I had turned on an experimental option by default. It's off now.
ReplyDelete
Replies
AnonymousFebruary 22, 2010 at 2:59 PM
Dear all,

Hi, I have been doing gwas analysis using SAS. It's not optimized - the logistic regression generally churns out about ~1000 regressions per minute depending on the dataset (Nsamples, covariates, etc)

I was just wondering if R is much faster is this regard - Can it run 10000 or 50000 logistic regressions per minute?

thanks,
chris
ReplyDelete
Replies
Stephen TurnerFebruary 22, 2010 at 3:03 PM
Chris,

GWAS can very quickly be done in PLINK. I can run regression on 550,000 SNPs in 4000 samples in <10 minutes. I use R to then plot the results of the analysis in PLINK.

http://pngu.mgh.harvard.edu/~purcell/plink/

Stephen
ReplyDelete
Replies
UnknownMarch 4, 2010 at 7:51 PM
Hi, It's only can show 23 chromosome?
I research focus on dog, there are 39 chromosomes, so, how can I do ? Many thanks.
Jack.
ReplyDelete
Replies
Stephen TurnerMarch 4, 2010 at 7:52 PM
Jack,

Copy the code from the URL that you sourced, and comment out line 29 which limits to chromosomes 1-23.

d=d[d$CHR %in% 1:23, ]

Stephen
ReplyDelete
Replies
AnonymousMarch 22, 2010 at 12:13 PM
Hello,

Thanks for sharing this great code. It runs very well and is helpful. I am now trying to overlay a second set of p-values (after adjustment/post-test stuff) and would like to add points of the same colour scheme but a different shape. I have been working with your source code, but haven't had any luck. Any suggestions/help would be much appreciated.

Thanks again,

Darren
ReplyDelete
Replies
Stephen TurnerMarch 22, 2010 at 12:36 PM
Darren,

Try adding another column to your results file, a factor variable that would be something like "raw" and "adjusted" for the different sets of p-values. Then in the qplot function, add the option shape=yourNewVariable.

Hopefully that helps,

Stephen
ReplyDelete
Replies
AnonymousMarch 23, 2010 at 12:13 PM
Hey Stephen,

Your suggestion worked perfectly!

Thanks for your help,

Darren
ReplyDelete
Replies
lpriebeApril 19, 2010 at 10:44 AM
Hi Stephen,

Great code! I find it very useful for genome-wide plots. I tried to make sepearte plots for each chromosome using the qqmanall() function. However, when using that function, I receive an error:

Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
replacement has 0 rows, data has 138906

And R returns to the prompt. I guess there might be something wrong with the input files. Could you provide a hint what the problem could be?

Kind regrads,

Lutz
ReplyDelete
Replies
AnonymousJuly 6, 2010 at 6:04 AM
Hi!
I'm beginner in R and I tested this code. Input file was PLINK .assoc file. Other steps went ok, but the last one qqman(mydata) gives the following error message:

Error in .Primitive("[")(dots[[1L]][[1L]], dots[[2L]][[1L]]) :
invalid subscript type 'list'

Help needed how to make this work. Thanks in advance !

Päivi
ReplyDelete
Replies
AnonymousJuly 6, 2010 at 6:32 AM
Sorry! My problem was caused by the memory running out. With a more powerful machine, everything worked ok! Great code!

Päivi
ReplyDelete
Replies
AnonymousJuly 22, 2010 at 6:03 PM
Hi,
I would like to use your function, but my R crashed running qqman ;( Ich have a dataset with 500 000 PValues. What could be the reason??

Cheers
J
ReplyDelete
Replies
AnonymousSeptember 3, 2010 at 1:01 AM
I'm new to R but when I input the line:

source("http://dl.dropbox.com/u/66281/0_Permanent/qqman.r")

in R, an error comes saying the website is forbidden to be accessed (403 error). Any ideas? Also, whats the best way to run this using a Linux machine?
ReplyDelete
Replies
Stephen TurnerSeptember 3, 2010 at 8:48 AM
It looks like you have a problem connecting to the web. I update the source code in that file on the web (in the source command), but you can also copy and paste the source code here into your R terminal and it should work, provided you have ggplot2 installed.
ReplyDelete
Replies
AnonymousSeptember 7, 2010 at 10:21 AM
thanks for the great functions. I'm trying to add extra data and noted your previous post

'Try adding another column to your results file, a factor variable that would be something like "raw" and "adjusted" for the different sets of p-values. Then in the qplot function, add the option shape=yourNewVariable.'

you couldn't spell this out a little for me could you? I can't see where to add this
ReplyDelete
Replies
AnonymousOctober 3, 2010 at 10:39 AM
hi,

Can anyone tell me in layman terms what do we try to interpret using Q-Q plot in GWAS
ReplyDelete
Replies
byaspanJanuary 12, 2011 at 11:35 AM
Thanks for the great code! I ran into an error, and figured out that if you have PLINK output with chr23 where the pvalue is "NA" the script will give this error:

Error in data.frame(x = c(5.99577642700764, 5.51865517228798, 5.29680642267162, :
arguments imply differing number of rows: 495161, 484419

If you take out the chr23 probes from the file, the script runs fine.
ReplyDelete
Replies
CFinnoJanuary 21, 2011 at 2:40 PM
The code is great but I had the same error as above and I would prefer not to have to hand-remove every SNP across the genome since I'm running a Q-Q genome-wide. Why won't the script accept NA for missing data as I thought R preferred that? I don't want to change it to 0 as that isn't accurate. Thanks for the help!
ReplyDelete
Replies
Stephen TurnerJanuary 23, 2011 at 11:53 AM
CFinno - I'm between jobs at the moment - about to start my postdoc. Once I'm there I'll get back to this and put in some checking and remove these lines so you won't have to do it before plotting. One simple way to remove these lines from your plink results is using grep.

grep -v NA plinkresult.txt > cleanresults.txt
ReplyDelete
Replies
UnknownJanuary 31, 2011 at 11:26 AM
Stephen- No problem I created an Excel macro to get rid of the lines.
After you are settled into your post-doc (good luck!), can you let me know if there is a way to obtain the Manhattan plots using this script for other species (i.e. with more than 23 chromosomes)? I can input my data, which goes up to 32 chromosomes, but the plot only extends to 23. I've been able to plot it using PLINK so this is more of a comparison than a necessity. Thanks!
ReplyDelete
Replies
UnknownMarch 16, 2011 at 7:39 PM
hi Stephen:
thank you for the codes, i got the same error like Lutz.

Error in `$<-.data.frame`(`*tmp*`, "pos", value = NA) :
replacement has 1 rows, data has 0
if you please could help i would be great.
ibrahim
ReplyDelete
Replies
Stephen TurnerMarch 16, 2011 at 8:26 PM
Please forgive me for my sloth in responding. In reality I can't promise to find time to investigate some of these issues at the moment – I recently started a new job, and the code I post here doesn’t do much for my CV. I hope you understand – I know it's a bit unsatisfactory – I write some software for myself for a quick one-off plot or scripting job, and I put the code on here touting it as useful, but it's not always clear how maintaining it fits into my job description. If you can fix it, or at least work out exactly what's going wrong, please leave a comment. If there are any useful changes I can make to the source code I will be happy to push a new version to GitHub and put the link here.

I will eventually have to get back to this code because it needs serious optimizing, but it may be a while. My apologies, and thanks for understanding.
ReplyDelete
Replies
Stephen TurnerApril 19, 2011 at 4:23 PM
I recently made a few bugfixes and rewrote the manhattan plot function using base graphics, which is way faster and more memory efficient. You can grab the code here: https://gist.github.com/929702
ReplyDelete
Replies
Stephen TurnerApril 25, 2011 at 3:27 PM
*** Update April 25, 2011: This code has gone through a major revision. Please see the updated code and tutorial here. ***
ReplyDelete
Replies
AnonymousJune 4, 2011 at 3:59 AM
Hi
I have a manhattan plot with 2 abline standing suggestiveline and genomewide thresholds respectively. I wanted to insert a legend for nominated FDR each line using legend() function, however it returns error.below is my legend()code

legend('topright',c("FDR =2%","FDR =1%"),lty=1,col=c("blue","red"))

Thanks,
Ali
ReplyDelete
Replies
AnonymousSeptember 24, 2011 at 10:09 AM
any way to get a chromosome zero in there?
ReplyDelete
Replies
AnonymousDecember 15, 2011 at 4:25 PM
Hi,

thanks for these nice plots. Good work, well commented, so it's easy to change in a little personalised. :-) And I hope you fell comfortable with your new job?!

Regards,
Jessi
ReplyDelete
Replies
MQNovember 7, 2012 at 7:28 PM
Im having this error.

I have a txt file with this headers and content:
CHR BP SNP CHI P O Allele
7 11564000 rs18 5.383549204225 0.020272838021356 97.14 A
7 11564000 rs18 4.3746157847449 0.03646503291877 97.14 G
7 11563681 rs19 6.1913037310257 0.012775209303653 98.57 A
7 11563681 rs19 7.7 0.0054704058252163 98.57 G
7 11563459 rs21 5.3638095238095 0.020504591038222 90 A
7 11563027 rs22 5.3638095238095 0.020504591038222 90 G
7 11564744 rs24 6.8433179723502 0.0088369383198263 90 C

I thought the R code could extract the CHR and BP and P value. because it worked before, but now it does not. The only thing i did now was to ouput out that file an extra column name V .

But anyway the code would work as before (thats what i expect), what wrong?

THe error is this:
Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 2L, 3L, 4L, 4L, 5L, 5L, :
max not meaningful for factors
ReplyDelete
Replies
Stephen TurnerNovember 8, 2012 at 4:49 AM
I've made a few changes since this post. Please see the updated code and tutorial.
ReplyDelete
Replies
UnknownNovember 20, 2012 at 4:08 AM
This comment has been removed by the author.
ReplyDelete
Replies
EinatJanuary 8, 2013 at 7:11 AM
Hello,
I am trying to create a simple Manhattan plot for a small list of 200 SNPs
spread out in the genome in different genes.
I have tried using your function but it does not work smoothly.
Can you suggest a simple way to create the plot (not for all 22
chromosomes)- with the x axis showing the genes name and not the chromosomal
location.
Thanks a lot,
Einat
ReplyDelete
Replies
UnknownOctober 11, 2013 at 12:53 PM
Hi Stephan,
How I can fix the following error. I want just to plot X & Y chromosome.

Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
replacement has 0 rows, data has 28687
ReplyDelete
Replies
UnknownJune 16, 2014 at 11:35 AM
Hi Stephan,

I have 36 chromosome but your limitchromosme limit from1 to 23, how to change the scripts to make the plots for 36 chrs?Thanks.
Lily
ReplyDelete
Replies
anonMarch 12, 2016 at 9:14 PM
Is it possible to add a confidence band to QQ plots created using qqman? If so, what would be the code? Thanks
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Wednesday, January 20, 2010

GWAS Manhattan plots and QQ plots using ggplot2 in R

39 comments: