Thursday, May 15, 2014

qqman: an R package for creating Q-Q and manhattan plots from GWAS results

Three years ago I wrote a blog post on how to create manhattan plots in R. After hundreds of comments pointing out bugs and other issues, I've finally cleaned up this code and turned it into an R package.

The qqman R package is on CRAN: http://cran.r-project.org/web/packages/qqman/

The source code is on GitHub: https://github.com/stephenturner/qqman

If you'd like to cite the qqman package (appreciated but not required), please cite this pre-print: Turner, S.D. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. biorXiv DOI: 10.1101/005165 (2014).

Something wrong? Please file bug reports, feature requests, or anything else related to the code as an issue on GitHub rather than commenting here. Also, please post the code you're using and some example data causing a failure in a publicly accessible place, such as a GitHub gist (no registration required). It's difficult to troubleshoot if I can't see the data where the code is failing. Want to contribute? Awesome! Send me a pull request.

Note: This release is substantially simplified for the sake of maintainability and creating an R package. The old code that allows confidence intervals on the Q-Q plot and allows more flexible annotation and highlighting is still available at the version 0.0.0 release on GitHub.

Here's a shout-out to all the blog commenters on the previous post for pointing out bugs and other issues, and a special thanks to Dan Capurso and Tim Knutsen for useful contributions and bugfixes.


qqman package tutorial

First things first. Install the package (do this only once), then load the package (every time you start a new R session)
# only once:
install.packages("qqman")

# each time:
library(qqman)
You can access this help any time from within R by accessing the vignette:
vignette("qqman")
The qqman package includes functions for creating manhattan plots and q-q plots from GWAS results. The gwasResults data.frame included with the package has simulated results for 16,470 SNPs on 22 chromosomes. Take a look at the data:
str(gwasResults)
'data.frame':   16470 obs. of  4 variables:
 $ SNP: chr  "rs1" "rs2" "rs3" "rs4" ...
 $ CHR: int  1 1 1 1 1 1 1 1 1 1 ...
 $ BP : int  1 2 3 4 5 6 7 8 9 10 ...
 $ P  : num  0.915 0.937 0.286 0.83 0.642 ...
head(gwasResults)
  SNP CHR BP      P
1 rs1   1  1 0.9148
2 rs2   1  2 0.9371
3 rs3   1  3 0.2861
4 rs4   1  4 0.8304
5 rs5   1  5 0.6417
6 rs6   1  6 0.5191
tail(gwasResults)
          SNP CHR  BP      P
16465 rs16465  22 530 0.5644
16466 rs16466  22 531 0.1383
16467 rs16467  22 532 0.3937
16468 rs16468  22 533 0.1779
16469 rs16469  22 534 0.2393
16470 rs16470  22 535 0.2630
How many SNPs on each chromosome?
as.data.frame(table(gwasResults$CHR))
   Var1 Freq
1     1 1500
2     2 1191
3     3 1040
4     4  945
5     5  877
6     6  825
7     7  784
8     8  750
9     9  721
10   10  696
11   11  674
12   12  655
13   13  638
14   14  622
15   15  608
16   16  595
17   17  583
18   18  572
19   19  562
20   20  553
21   21  544
22   22  535

Creating manhattan plots

Now, let's make a basic manhattan plot.
manhattan(gwasResults)

We can also pass in other graphical parameters. Let's add a title (main=), reduce the point size to 50%(cex=), and reduce the font size of the axis labels to 80% (cex.axis=):
manhattan(gwasResults, main = "Manhattan Plot", cex = 0.5, cex.axis = 0.8)

Let's change the colors and increase the maximum y-axis:
manhattan(gwasResults, col = c("blue4", "orange3"), ymax = 12)

Let's remove the suggestive and genome-wide significance lines:
manhattan(gwasResults, suggestiveline = F, genomewideline = F)

Let's look at a single chromosome:
manhattan(subset(gwasResults, CHR == 1))

Let's highlight some SNPs of interest on chromosome 3. The 100 SNPs we're highlighting here are in a character vector called snpsOfInterest. You'll get a warning if you try to highlight SNPs that don't exist.
str(snpsOfInterest)
 chr [1:100] "rs3001" "rs3002" "rs3003" "rs3004" "rs3005" ...
manhattan(gwasResults, highlight = snpsOfInterest)

We can combine highlighting and limiting to a single chromosome:
manhattan(subset(gwasResults, CHR == 3), highlight = snpsOfInterest, main = "Chr 3")

A few notes on creating manhattan plots:
  • Run str(gwasResults). Notice that the gwasResults data.frame has SNP, chromosome, position, and p-value columns named SNP, CHR, BP, and P. If you're creating a manhattan plot and your column names are different, you'll have to pass the column names to the chr=, bp=, p=, and snp= arguments. See help(manhattan) for details.
  • The chromosome column must be numeric. If you have “X,” “Y,” or “MT” chromosomes, you'll need to rename these 23, 24, 25, etc.
  • If you'd like to change the color of the highlight or the suggestive/genomewide lines, you'll need to modify the source code. Search for col="blue", col="red", or col="green3" to modify the suggestive line, genomewide line, and highlight colors, respectively.

Creating Q-Q plots

Creating Q-Q plots is straightforward - simply supply a vector of p-values to the qq() function. You can optionally provide a title.
qq(gwasResults$P, main = "Q-Q plot of GWAS p-values")


28 comments:

  1. Great work Stephen!
    Really like the updated source code with friendly comments.
    For my own work I'll try to implement colouring of the SNPs based on LD with the leading SNP.

    Tim

    ReplyDelete
    Replies
    1. Great - and don't hesitate to send a PR if you do. And I fully intend on re-integrating your and Dan's contributions - I just had to simplify code to the point where I felt comfortable documenting and maintaining before I could release as a package.

      Delete
  2. This is a really useful form of your previous work. One thing that many might like tweaked and is good for some low level users is an argument to not -logten transform the data. There are many types of genome wide results that this function is great for. it was pretty easy to tweek the function myself, but not everyone is comfortable doing that. Thanks!

    ReplyDelete
    Replies
    1. Good call. Looking at raw p-values in a manhattan plot isn't very useful, but looking at other types of scores, e.g., test statistics, peak heights, bayes factors, etc. might be useful when not -log10 transformed. Would you mind submitting this as an issue on GitHub? Thanks.

      Delete
    2. This issue is now resolved on the dev branch on github.

      Delete
  3. Hi Stephen,
    Thanks a lot for sharing this. Is there any information that you know about the default resolution of the plots?

    ReplyDelete
    Replies
    1. That's controlled with the options used for saving the image

      png("file.png", width=2000, height=1000, pointsize=18, ...)
      manhattan(...)
      dev.off()

      Delete
  4. Hi Stephen,

    I am trying to bastardise your fine work above to plot differences in alt alleles in SNPs between phenotypes (rightly or wrongly...). I had a look at the code but it wasn't immediately apparent where you plot and colour CHR so as to be able to visually discriminate between each in turn. I am a little short on time so thought I would ask if you could point out to me? Sorry for not RTFMing=P

    Bruce.

    ReplyDelete
    Replies
    1. Bruce, take a look at the source code. I believe you'll need to mod lines 149 and 156-160. Basically, I'm creating a vector of color choices longer than I'll need on line 149, then alternating through them with the for loop. Inelegant, but it works.

      Delete
    2. Great, thanks for that Stephen.

      Delete
  5. Hi Stephen,

    Thanks for the useful code and package. Sorry for the simple question - but how do I go about calculating the lambda and adding to the plot ?

    Thanks in advance,

    Luanna

    ReplyDelete
  6. Brilliant, just in time to cite it in a paper :D

    ReplyDelete
  7. Hi Stephen,

    Is there an option to modify the width of the points?, I looked into the code and it says pch= 20 but I don't know how to change it, I tried to enter "pch = 10 " in the function but an error appeared saying that "pch" matched by multiple actual arguments.

    Berna

    ReplyDelete
  8. Hi Stephen

    Some chromosome may have blank gaps as the SNPs within some region are not genotyped, you need to address that in your script...

    ReplyDelete
    Replies
    1. Yuan: how about posting some example data as a github gist on the qqman issues page? Thanks.

      Delete
  9. Hi Stephen,

    Thanks for such a beautiful script, and for such clear commenting! I am wondering if there is a way to alter the source code so that the ticks demarcating chromosomes fall at the outer boundaries of the chromosomes, rather than the middle, and the chromosome numbers fall in the middle? In the plots I am preparing I am not alternating colours of the chromosomes, and it is difficult to discern the boundaries between chromosomes with the ticks falling in the centre. I have tried fiddling around with the source code, but I am still fairly new to R and have not had any success.

    Jennie

    ReplyDelete
  10. Hi Stephen,

    Thanks for the useful package - I have been hacking the GenABEL plot function on and off over the years and yours is really nice and clean.

    In one of my OCD moments I noticed that the x axes tick markers weren't centered on the chromosomes, so I went digging through the source. It seems that you've made an assumption that SNPs are evenly distributed across chromosomes, which is probably correct 99.9% of the time. However a student in our lab was interested in a subset of ~16K SNPs in specific protein coding genes, which don't follow an even distribution (obviously a very specific use case).

    I believe that a small change to line 127;
    from:

    ticks=c(ticks, d[d$index==i, ]$pos[floor(length(d[d$index==i, ]$pos)/2)+1])
    # this takes the middle value of the number of observed positions, which isn't always going to be the 'mean' chromosomal position

    to:

    ticks = c(ticks, (min(d[d$CHR == i,]$pos) + max(d[d$CHR == i,]$pos))/2 + 1)
    # this modification should identify the 'true' mean chromosomal position

    This fixes the issue with various data I've run through. I don't imagine the modification would cause issues down the track... but don't quote me on that!

    Please disregard if I'm talking nonsense!

    - Miles

    ReplyDelete
    Replies
    1. Miles, thanks so much for the fix. I haven't tested it out yet. Would you mind uploading some sample data to gist.github.com so I can test this out to see what you're dealing with? Or you can just email me if it's not too large. I'd love to include this in the next update. Thanks.

      Delete
    2. No problem Stephen. I have sent through some example data via email, I hope it is useful. Please let me know if you would like me to test anything.

      Delete
  11. Hi Stephen,

    thanks for the great package!
    I'm using it to plot my GWAS data and I was wondering if there is a command to add the rs numbers to SNPs that are above a certain threshold. (in my case the p=10^-5 line)
    I have a basic understanding of R, but I cannot do any programming myself.
    Could you help my with this?
    Thanks in advance,

    Roos

    ReplyDelete
    Replies
    1. Roos - this is literally next on my to-do list with this code. I'll post here when it's updated. Thanks.

      Delete
    2. That would be great!!
      Looking forward to it.
      Thanks!

      Delete
  12. Hi Stephan,
    Very nice. I found the single chromosome results came out smashed on a vertical line.
    This seems to be because "d$pos=d$BP/1e6" in manhattan.R (the x axis is in megabytes, even if x limits have been specified)
    I know you have highlighted ggplot2 in the past. Any reason for not using it here?
    Thanks,
    Aaron

    ReplyDelete
    Replies
    1. Hm, didn't consider that when merging this PR. Will revisit.

      Delete
  13. Hi Stephan,

    For some reason, when I try to change the colors for my plot, I get the following error. Do you know why?

    manhattan(data_test, ymax=8, col=c("blue4","orange3"))


    Error in localWindow(xlim, ylim, log, asp, ...) :
    formal argument "col" matched by multiple actual arguments

    ReplyDelete
    Replies
    1. hi @jmtoung, please post this as an issue on the issue tracker on GitHub. Thanks.

      Delete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.