Wednesday, January 20, 2010

GWAS Manhattan plots and QQ plots using ggplot2 in R

*** Update April 25, 2011: This code has gone through a major revision. Please see the updated code and tutorial here. ***

Will posted earlier this week about how to produce manhattan plots of GWAS results using Stata, so I thought I'd share how I do this in R using ggplot2.

First, if you've never used ggplot2, you'll need to add it to your R installation by typing:

install.packages("ggplot2")

Once you've done that, copy and paste this command to download the functions I wrote necessary to produce these plots.  If you'd like to see the source code yourself, copy the URL into your web browser.

source("http://dl.dropbox.com/u/66281/0_Permanent/qqman.r")

Next, read in the PLINK results file to a data frame. Substitute plink.qassoc with your own results filename.

mydata=read.table("plink.qassoc", header=TRUE)

Finally, run this function which will produce and save in the current directory both a QQ-plot and a manhattan plot of your results:

qqman(mydata)

QQ-Plot (these are simulated GWAS results):


Manhattan plot (click to enlarge):

A few notes:  First, if you're doing this on a linux machine from your Windows computer, you'll need to be running the previously mentioned XMing server on your Windows computer for the plot to save correctly.  Second, the qqman() function calls the manhattan() function, which is extremely slow and memory-intensive. It may take about 3 minutes to run for each dataset. The memory issue isn't a problem on 64-bit machines, but you may run out of memory on 32-bit machines if you're doing this with GWAS data. I'm going to try to improve this in the future. Finally, using that source command you also downloaded a function I wrote called qqmanall(), which does just what it sounds like - if you run it on a linux machine with no arguments it reads in ALL of the plink GWAS results stored in the current directory, and creates QQ and manhattan plots for all of them with a common upper limit for the y-axis corresponding to the most significant result. Enjoy.

...

Update Thursday, January 21, 2010: I neglected to mention yesterday the format of the plink.assoc or plink.qassoc files, in case you want to produce the same plots using results from another software other than plink. When you load your .assoc files in a data frame, the relevant columns are named "CHR", "BP", and "P". You can use this plotting function as long as you have these three columns in your data frame, regardless of whether you use PLINK or not.

The latest version of the code is reproduced below:

*** Update April 25, 2011: This code has gone through a major revision. Please see the updated code and tutorial here. ***



...

*** Update April 25, 2011: This code has gone through a major revision. Please see the updated code and tutorial here. ***

36 comments:

  1. Hi guys, I am sure if you are aware of R analytic flow (http://www.ef-prime.com/products/ranalyticflow_en/support.html#contact). I am using it now. I think it deserves to be popularized. Have a look please.

    ReplyDelete
  2. Thank you for the code for the manhattan plot.
    However, I think, it does not sort on the chromosome, which is a natural method to do it.
    Also, its gray scale. I would like to see something similar to the SVS Suite.
    Great work guys! I often learn something new from your site.. keep up the good work and may the best sequence win!!
    http://www.goldenhelix.com/SNP_Variation/solutions/gwas/index.html

    ReplyDelete
  3. Hi,
    Has there been any changes to the script. I tried running today and it gave a warning message
    "You requested annotation but provided no SNPlist!"

    ReplyDelete
  4. Should be fixed now. I'm working on a way to annotate the plot by highlighting SNPs in a given list of SNPs. I had turned on an experimental option by default. It's off now.

    ReplyDelete
  5. Dear all,

    Hi, I have been doing gwas analysis using SAS. It's not optimized - the logistic regression generally churns out about ~1000 regressions per minute depending on the dataset (Nsamples, covariates, etc)

    I was just wondering if R is much faster is this regard - Can it run 10000 or 50000 logistic regressions per minute?

    thanks,
    chris

    ReplyDelete
  6. Chris,

    GWAS can very quickly be done in PLINK. I can run regression on 550,000 SNPs in 4000 samples in <10 minutes. I use R to then plot the results of the analysis in PLINK.

    http://pngu.mgh.harvard.edu/~purcell/plink/

    Stephen

    ReplyDelete
  7. Hi, It's only can show 23 chromosome?
    I research focus on dog, there are 39 chromosomes, so, how can I do ? Many thanks.
    Jack.

    ReplyDelete
  8. Jack,

    Copy the code from the URL that you sourced, and comment out line 29 which limits to chromosomes 1-23.

    d=d[d$CHR %in% 1:23, ]

    Stephen

    ReplyDelete
  9. Hello,

    Thanks for sharing this great code. It runs very well and is helpful. I am now trying to overlay a second set of p-values (after adjustment/post-test stuff) and would like to add points of the same colour scheme but a different shape. I have been working with your source code, but haven't had any luck. Any suggestions/help would be much appreciated.

    Thanks again,

    Darren

    ReplyDelete
  10. Darren,

    Try adding another column to your results file, a factor variable that would be something like "raw" and "adjusted" for the different sets of p-values. Then in the qplot function, add the option shape=yourNewVariable.

    Hopefully that helps,

    Stephen

    ReplyDelete
  11. Hey Stephen,

    Your suggestion worked perfectly!

    Thanks for your help,

    Darren

    ReplyDelete
  12. Hi Stephen,

    Great code! I find it very useful for genome-wide plots. I tried to make sepearte plots for each chromosome using the qqmanall() function. However, when using that function, I receive an error:

    Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
    replacement has 0 rows, data has 138906

    And R returns to the prompt. I guess there might be something wrong with the input files. Could you provide a hint what the problem could be?

    Kind regrads,

    Lutz

    ReplyDelete
    Replies
    1. Hi Stephen,

      I get this same error message and would love to get to the bottom of this.

      My data is generated by SNPtest, and your program worked fine for the text files in my other outcomes, except one. I've tried reading text and .csv and I get the same error.

      Best,

      Lester Arguelles
      Northwestern University

      Delete
  13. Hi!
    I'm beginner in R and I tested this code. Input file was PLINK .assoc file. Other steps went ok, but the last one qqman(mydata) gives the following error message:

    Error in .Primitive("[")(dots[[1L]][[1L]], dots[[2L]][[1L]]) :
    invalid subscript type 'list'

    Help needed how to make this work. Thanks in advance !

    Päivi

    ReplyDelete
  14. Sorry! My problem was caused by the memory running out. With a more powerful machine, everything worked ok! Great code!

    Päivi

    ReplyDelete
  15. Hi,
    I would like to use your function, but my R crashed running qqman ;( Ich have a dataset with 500 000 PValues. What could be the reason??

    Cheers
    J

    ReplyDelete
  16. I'm new to R but when I input the line:

    source("http://dl.dropbox.com/u/66281/0_Permanent/qqman.r")

    in R, an error comes saying the website is forbidden to be accessed (403 error). Any ideas? Also, whats the best way to run this using a Linux machine?

    ReplyDelete
  17. It looks like you have a problem connecting to the web. I update the source code in that file on the web (in the source command), but you can also copy and paste the source code here into your R terminal and it should work, provided you have ggplot2 installed.

    ReplyDelete
  18. thanks for the great functions. I'm trying to add extra data and noted your previous post

    'Try adding another column to your results file, a factor variable that would be something like "raw" and "adjusted" for the different sets of p-values. Then in the qplot function, add the option shape=yourNewVariable.'

    you couldn't spell this out a little for me could you? I can't see where to add this

    ReplyDelete
  19. hi,

    Can anyone tell me in layman terms what do we try to interpret using Q-Q plot in GWAS

    ReplyDelete
  20. Thanks for the great code! I ran into an error, and figured out that if you have PLINK output with chr23 where the pvalue is "NA" the script will give this error:

    Error in data.frame(x = c(5.99577642700764, 5.51865517228798, 5.29680642267162, :
    arguments imply differing number of rows: 495161, 484419

    If you take out the chr23 probes from the file, the script runs fine.

    ReplyDelete
  21. The code is great but I had the same error as above and I would prefer not to have to hand-remove every SNP across the genome since I'm running a Q-Q genome-wide. Why won't the script accept NA for missing data as I thought R preferred that? I don't want to change it to 0 as that isn't accurate. Thanks for the help!

    ReplyDelete
  22. CFinno - I'm between jobs at the moment - about to start my postdoc. Once I'm there I'll get back to this and put in some checking and remove these lines so you won't have to do it before plotting. One simple way to remove these lines from your plink results is using grep.

    grep -v NA plinkresult.txt > cleanresults.txt

    ReplyDelete
  23. Stephen- No problem I created an Excel macro to get rid of the lines.
    After you are settled into your post-doc (good luck!), can you let me know if there is a way to obtain the Manhattan plots using this script for other species (i.e. with more than 23 chromosomes)? I can input my data, which goes up to 32 chromosomes, but the plot only extends to 23. I've been able to plot it using PLINK so this is more of a comparison than a necessity. Thanks!

    ReplyDelete
  24. hi Stephen:
    thank you for the codes, i got the same error like Lutz.

    Error in `$<-.data.frame`(`*tmp*`, "pos", value = NA) :
    replacement has 1 rows, data has 0
    if you please could help i would be great.
    ibrahim

    ReplyDelete
  25. Please forgive me for my sloth in responding. In reality I can't promise to find time to investigate some of these issues at the moment – I recently started a new job, and the code I post here doesn’t do much for my CV. I hope you understand – I know it's a bit unsatisfactory – I write some software for myself for a quick one-off plot or scripting job, and I put the code on here touting it as useful, but it's not always clear how maintaining it fits into my job description. If you can fix it, or at least work out exactly what's going wrong, please leave a comment. If there are any useful changes I can make to the source code I will be happy to push a new version to GitHub and put the link here.

    I will eventually have to get back to this code because it needs serious optimizing, but it may be a while. My apologies, and thanks for understanding.

    ReplyDelete
  26. I recently made a few bugfixes and rewrote the manhattan plot function using base graphics, which is way faster and more memory efficient. You can grab the code here: https://gist.github.com/929702

    ReplyDelete
  27. *** Update April 25, 2011: This code has gone through a major revision. Please see the updated code and tutorial here. ***

    ReplyDelete
  28. Hi
    I have a manhattan plot with 2 abline standing suggestiveline and genomewide thresholds respectively. I wanted to insert a legend for nominated FDR each line using legend() function, however it returns error.below is my legend()code

    legend('topright',c("FDR =2%","FDR =1%"),lty=1,col=c("blue","red"))

    Thanks,
    Ali

    ReplyDelete
  29. any way to get a chromosome zero in there?

    ReplyDelete
  30. Hi,

    thanks for these nice plots. Good work, well commented, so it's easy to change in a little personalised. :-) And I hope you fell comfortable with your new job?!

    Regards,
    Jessi

    ReplyDelete
  31. Im having this error.

    I have a txt file with this headers and content:
    CHR BP SNP CHI P O Allele
    7 11564000 rs18 5.383549204225 0.020272838021356 97.14 A
    7 11564000 rs18 4.3746157847449 0.03646503291877 97.14 G
    7 11563681 rs19 6.1913037310257 0.012775209303653 98.57 A
    7 11563681 rs19 7.7 0.0054704058252163 98.57 G
    7 11563459 rs21 5.3638095238095 0.020504591038222 90 A
    7 11563027 rs22 5.3638095238095 0.020504591038222 90 G
    7 11564744 rs24 6.8433179723502 0.0088369383198263 90 C


    I thought the R code could extract the CHR and BP and P value. because it worked before, but now it does not. The only thing i did now was to ouput out that file an extra column name V .

    But anyway the code would work as before (thats what i expect), what wrong?

    THe error is this:
    Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 2L, 3L, 4L, 4L, 5L, 5L, :
    max not meaningful for factors

    ReplyDelete
  32. I've made a few changes since this post. Please see the updated code and tutorial.

    ReplyDelete
  33. This comment has been removed by the author.

    ReplyDelete
  34. Hello,
    I am trying to create a simple Manhattan plot for a small list of 200 SNPs
    spread out in the genome in different genes.
    I have tried using your function but it does not work smoothly.
    Can you suggest a simple way to create the plot (not for all 22
    chromosomes)- with the x axis showing the genes name and not the chromosomal
    location.
    Thanks a lot,
    Einat

    ReplyDelete
  35. Hi Stephan,
    How I can fix the following error. I want just to plot X & Y chromosome.

    Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
    replacement has 0 rows, data has 28687

    ReplyDelete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.