Thursday, May 15, 2014

qqman: an R package for creating Q-Q and manhattan plots from GWAS results

Three years ago I wrote a blog post on how to create manhattan plots in R. After hundreds of comments pointing out bugs and other issues, I've finally cleaned up this code and turned it into an R package.

The qqman R package is on CRAN: http://cran.r-project.org/web/packages/qqman/

The source code is on GitHub: https://github.com/stephenturner/qqman

If you'd like to cite the qqman package (appreciated but not required), please cite this pre-print: Turner, S.D. qqman: an R package for visualizing GWAS results using Q-Q and manhattan plots. biorXiv DOI: 10.1101/005165 (2014).

Something wrong? Please file bug reports, feature requests, or anything else related to the code as an issue on GitHub rather than commenting here. Also, please post the code you're using and some example data causing a failure in a publicly accessible place, such as a GitHub gist (no registration required). It's difficult to troubleshoot if I can't see the data where the code is failing. Want to contribute? Awesome! Send me a pull request.

Note: This release is substantially simplified for the sake of maintainability and creating an R package. The old code that allows confidence intervals on the Q-Q plot and allows more flexible annotation and highlighting is still available at the version 0.0.0 release on GitHub.

Here's a shout-out to all the blog commenters on the previous post for pointing out bugs and other issues, and a special thanks to Dan Capurso and Tim Knutsen for useful contributions and bugfixes.


qqman package tutorial

First things first. Install the package (do this only once), then load the package (every time you start a new R session)
# only once:
install.packages("qqman")

# each time:
library(qqman)
You can access this help any time from within R by accessing the vignette:
vignette("qqman")
The qqman package includes functions for creating manhattan plots and q-q plots from GWAS results. The gwasResults data.frame included with the package has simulated results for 16,470 SNPs on 22 chromosomes. Take a look at the data:
str(gwasResults)
'data.frame':   16470 obs. of  4 variables:
 $ SNP: chr  "rs1" "rs2" "rs3" "rs4" ...
 $ CHR: int  1 1 1 1 1 1 1 1 1 1 ...
 $ BP : int  1 2 3 4 5 6 7 8 9 10 ...
 $ P  : num  0.915 0.937 0.286 0.83 0.642 ...
head(gwasResults)
  SNP CHR BP      P
1 rs1   1  1 0.9148
2 rs2   1  2 0.9371
3 rs3   1  3 0.2861
4 rs4   1  4 0.8304
5 rs5   1  5 0.6417
6 rs6   1  6 0.5191
tail(gwasResults)
          SNP CHR  BP      P
16465 rs16465  22 530 0.5644
16466 rs16466  22 531 0.1383
16467 rs16467  22 532 0.3937
16468 rs16468  22 533 0.1779
16469 rs16469  22 534 0.2393
16470 rs16470  22 535 0.2630
How many SNPs on each chromosome?
as.data.frame(table(gwasResults$CHR))
   Var1 Freq
1     1 1500
2     2 1191
3     3 1040
4     4  945
5     5  877
6     6  825
7     7  784
8     8  750
9     9  721
10   10  696
11   11  674
12   12  655
13   13  638
14   14  622
15   15  608
16   16  595
17   17  583
18   18  572
19   19  562
20   20  553
21   21  544
22   22  535

Creating manhattan plots

Now, let's make a basic manhattan plot.
manhattan(gwasResults)

We can also pass in other graphical parameters. Let's add a title (main=), reduce the point size to 50%(cex=), and reduce the font size of the axis labels to 80% (cex.axis=):
manhattan(gwasResults, main = "Manhattan Plot", cex = 0.5, cex.axis = 0.8)

Let's change the colors and increase the maximum y-axis:
manhattan(gwasResults, col = c("blue4", "orange3"), ymax = 12)

Let's remove the suggestive and genome-wide significance lines:
manhattan(gwasResults, suggestiveline = F, genomewideline = F)

Let's look at a single chromosome:
manhattan(subset(gwasResults, CHR == 1))

Let's highlight some SNPs of interest on chromosome 3. The 100 SNPs we're highlighting here are in a character vector called snpsOfInterest. You'll get a warning if you try to highlight SNPs that don't exist.
str(snpsOfInterest)
 chr [1:100] "rs3001" "rs3002" "rs3003" "rs3004" "rs3005" ...
manhattan(gwasResults, highlight = snpsOfInterest)

We can combine highlighting and limiting to a single chromosome:
manhattan(subset(gwasResults, CHR == 3), highlight = snpsOfInterest, main = "Chr 3")

A few notes on creating manhattan plots:
  • Run str(gwasResults). Notice that the gwasResults data.frame has SNP, chromosome, position, and p-value columns named SNP, CHR, BP, and P. If you're creating a manhattan plot and your column names are different, you'll have to pass the column names to the chr=, bp=, p=, and snp= arguments. See help(manhattan) for details.
  • The chromosome column must be numeric. If you have “X,” “Y,” or “MT” chromosomes, you'll need to rename these 23, 24, 25, etc.
  • If you'd like to change the color of the highlight or the suggestive/genomewide lines, you'll need to modify the source code. Search for col="blue", col="red", or col="green3" to modify the suggestive line, genomewide line, and highlight colors, respectively.

Creating Q-Q plots

Creating Q-Q plots is straightforward - simply supply a vector of p-values to the qq() function. You can optionally provide a title.
qq(gwasResults$P, main = "Q-Q plot of GWAS p-values")


59 comments:

  1. Great work Stephen!
    Really like the updated source code with friendly comments.
    For my own work I'll try to implement colouring of the SNPs based on LD with the leading SNP.

    Tim

    ReplyDelete
    Replies
    1. Great - and don't hesitate to send a PR if you do. And I fully intend on re-integrating your and Dan's contributions - I just had to simplify code to the point where I felt comfortable documenting and maintaining before I could release as a package.

      Delete
  2. This is a really useful form of your previous work. One thing that many might like tweaked and is good for some low level users is an argument to not -logten transform the data. There are many types of genome wide results that this function is great for. it was pretty easy to tweek the function myself, but not everyone is comfortable doing that. Thanks!

    ReplyDelete
    Replies
    1. Good call. Looking at raw p-values in a manhattan plot isn't very useful, but looking at other types of scores, e.g., test statistics, peak heights, bayes factors, etc. might be useful when not -log10 transformed. Would you mind submitting this as an issue on GitHub? Thanks.

      Delete
    2. This issue is now resolved on the dev branch on github.

      Delete
  3. Hi Stephen,
    Thanks a lot for sharing this. Is there any information that you know about the default resolution of the plots?

    ReplyDelete
    Replies
    1. That's controlled with the options used for saving the image

      png("file.png", width=2000, height=1000, pointsize=18, ...)
      manhattan(...)
      dev.off()

      Delete
  4. Hi Stephen,

    I am trying to bastardise your fine work above to plot differences in alt alleles in SNPs between phenotypes (rightly or wrongly...). I had a look at the code but it wasn't immediately apparent where you plot and colour CHR so as to be able to visually discriminate between each in turn. I am a little short on time so thought I would ask if you could point out to me? Sorry for not RTFMing=P

    Bruce.

    ReplyDelete
    Replies
    1. Bruce, take a look at the source code. I believe you'll need to mod lines 149 and 156-160. Basically, I'm creating a vector of color choices longer than I'll need on line 149, then alternating through them with the for loop. Inelegant, but it works.

      Delete
    2. Great, thanks for that Stephen.

      Delete
  5. Hi Stephen,

    Thanks for the useful code and package. Sorry for the simple question - but how do I go about calculating the lambda and adding to the plot ?

    Thanks in advance,

    Luanna

    ReplyDelete
  6. Brilliant, just in time to cite it in a paper :D

    ReplyDelete
  7. Hi Stephen,

    Is there an option to modify the width of the points?, I looked into the code and it says pch= 20 but I don't know how to change it, I tried to enter "pch = 10 " in the function but an error appeared saying that "pch" matched by multiple actual arguments.

    Berna

    ReplyDelete
  8. Hi Stephen

    Some chromosome may have blank gaps as the SNPs within some region are not genotyped, you need to address that in your script...

    ReplyDelete
    Replies
    1. Yuan: how about posting some example data as a github gist on the qqman issues page? Thanks.

      Delete
  9. Hi Stephen,

    Thanks for such a beautiful script, and for such clear commenting! I am wondering if there is a way to alter the source code so that the ticks demarcating chromosomes fall at the outer boundaries of the chromosomes, rather than the middle, and the chromosome numbers fall in the middle? In the plots I am preparing I am not alternating colours of the chromosomes, and it is difficult to discern the boundaries between chromosomes with the ticks falling in the centre. I have tried fiddling around with the source code, but I am still fairly new to R and have not had any success.

    Jennie

    ReplyDelete
  10. Hi Stephen,

    Thanks for the useful package - I have been hacking the GenABEL plot function on and off over the years and yours is really nice and clean.

    In one of my OCD moments I noticed that the x axes tick markers weren't centered on the chromosomes, so I went digging through the source. It seems that you've made an assumption that SNPs are evenly distributed across chromosomes, which is probably correct 99.9% of the time. However a student in our lab was interested in a subset of ~16K SNPs in specific protein coding genes, which don't follow an even distribution (obviously a very specific use case).

    I believe that a small change to line 127;
    from:

    ticks=c(ticks, d[d$index==i, ]$pos[floor(length(d[d$index==i, ]$pos)/2)+1])
    # this takes the middle value of the number of observed positions, which isn't always going to be the 'mean' chromosomal position

    to:

    ticks = c(ticks, (min(d[d$CHR == i,]$pos) + max(d[d$CHR == i,]$pos))/2 + 1)
    # this modification should identify the 'true' mean chromosomal position

    This fixes the issue with various data I've run through. I don't imagine the modification would cause issues down the track... but don't quote me on that!

    Please disregard if I'm talking nonsense!

    - Miles

    ReplyDelete
    Replies
    1. Miles, thanks so much for the fix. I haven't tested it out yet. Would you mind uploading some sample data to gist.github.com so I can test this out to see what you're dealing with? Or you can just email me if it's not too large. I'd love to include this in the next update. Thanks.

      Delete
    2. No problem Stephen. I have sent through some example data via email, I hope it is useful. Please let me know if you would like me to test anything.

      Delete
  11. Hi Stephen,

    thanks for the great package!
    I'm using it to plot my GWAS data and I was wondering if there is a command to add the rs numbers to SNPs that are above a certain threshold. (in my case the p=10^-5 line)
    I have a basic understanding of R, but I cannot do any programming myself.
    Could you help my with this?
    Thanks in advance,

    Roos

    ReplyDelete
    Replies
    1. Roos - this is literally next on my to-do list with this code. I'll post here when it's updated. Thanks.

      Delete
    2. That would be great!!
      Looking forward to it.
      Thanks!

      Delete
    3. Version 0.1.3 now has the ability to highlight SNPs below a particular threshold or annotate top SNPs on each chromosome. The release is already live on GitHub, and should propagate to CRAN within a day or two.

      Delete
    4. Is there a way to increase the point size of just the highlighted SNPs? They're slightly obscured when plotting millions of SNPs.

      Delete
  12. Hi Stephan,
    Very nice. I found the single chromosome results came out smashed on a vertical line.
    This seems to be because "d$pos=d$BP/1e6" in manhattan.R (the x axis is in megabytes, even if x limits have been specified)
    I know you have highlighted ggplot2 in the past. Any reason for not using it here?
    Thanks,
    Aaron

    ReplyDelete
    Replies
    1. Hm, didn't consider that when merging this PR. Will revisit.

      Delete
    2. Any update on this? It still displays single chromosomes in a vertical line due to the position on the x axis being in Mb.

      Delete
  13. Hi Stephan,

    For some reason, when I try to change the colors for my plot, I get the following error. Do you know why?

    manhattan(data_test, ymax=8, col=c("blue4","orange3"))


    Error in localWindow(xlim, ylim, log, asp, ...) :
    formal argument "col" matched by multiple actual arguments

    ReplyDelete
  14. Hi Stephen,

    Thank you for creating such useful code! I am trying to plot Likelihood values from a non-human species. I am finding that the option "chrlabs=c(1:38)" is not changing the chromosome labels on my plot. It also does not work if I use the option "limitchromosomes=1:38". Also, I am using the logp=FALSE flag since I want to plot just the likelihood values. There is no difference in my plot if I use logp=FALSE or logp=TRUE and the log10 values are being plotted in both cases.

    Rena

    ReplyDelete
  15. Hi Stephen,

    Thank you for creating usefull tool. I am new to R and manhattan plot is my first assignment. I am trying to adjust y axis for plot and its giving me warning " In plot.xy(xy.coords(x, y), type = type, ...) :
    "ymin" is not a graphical parameter"

    How do i adjust Y axis ?
    Thank you,
    Lata

    ReplyDelete
    Replies
    1. Dear Stephen, many thanks for a useful and practical package, but the ylim=c(a,b) seems to be unresponsive to any values I feed it?

      many thanks, Isi

      Delete
    2. Please post your example data and code as an issue on the github issue tracker. Thanks.

      Delete
  16. Hi Stephen,

    Useful package! Maybe you can add an option to the manhattan plot to show the rsid's of the highlighted SNPs (at least if there not too much SNPs). For example add:

    with(d.highlight, text(pos, logp, SNP, col = "green3",
    ...))

    Or use an additional argument to the manhattan function.

    Cheers,
    Maarten

    ReplyDelete
  17. Thanks a lot!!!!

    This it's going to be very usefull to me (and easier that using the entire code)

    ReplyDelete
  18. Hi Stephen,
    Thanks for this useful package.
    I have a persistent error "Error in plot.window(...) : need finite 'ylim' values" which I cant seem to eliminate.
    Some of my previous examples worked fine.

    ReplyDelete
    Replies
    1. open an issue on github and leave some example code and data as a github gist.

      Delete
    2. I am having the same issue with my dataset. My dataset is an imputed list of snps and I consistently get the Error in plot.window(...) : need finite 'ylim' values -- error message
      also using ymax = 10 I get the following response from R In plot.window(...) : "ymax" is not a graphical parameter

      I can't see where the person has placed the problem on github


      Delete
    3. Please open an issue on github and leave a reproducible example as a github gist, with data.

      Delete
    4. Fredrick and "Unknown"
      I had similar issues which I figured out.

      The Error in plot window.... is because you probably have NAs in your data.
      Solution: Get rid of them with >results <- na.omit(results)

      "ymax" error

      Solution: use "ylim" not "ymax"

      Delete
  19. Thanks for a very useful package. I have a naive question in that I am not entirely certain what the convention is for a suggestive threshold (maybe because I work with plants?). Any comments or direction to literature would be appreciated.

    ReplyDelete
  20. This comment has been removed by the author.

    ReplyDelete
  21. Hi. while trying qqman I'm running into the following error "Error in plot.window(...) : need finite 'ylim' values". I don't have any NAs in my data. Is there something i'm doing wrong here?

    ReplyDelete
    Replies
    1. Please open an issue on github and leave a reproducible example as a github gist, with data.

      Delete
  22. Hello Stephen,
    Thank you for this useful package (I'm really happy with mine manhattan plots). =)
    I want to know if is possible to create a qq plot with multiples pvalue vectors? I mean, I'm trying to make a plot to compare different models for association analysis similar to the one in: http://www.sciencedirect.com/science/article/pii/S0168945212001628#fig0020
    Thanks!!!


    ReplyDelete
    Replies
    1. Pamela, certainly possible, but would require some retooling. The easiest way to do this now would be to use ggplot2.

      Delete
  23. It works fine with this small data. But I can't see any highlighted SNPs with a bigger data file. Does it have something to do with the P-value or do i need to scale the graph? In the attached file "Highlight.txt" I can only see the top 5 SNPs in the manhattan plot and not the bottom 5.
    https://gist.github.com/4cea1b9ba9bf3081a065.git

    ReplyDelete
  24. This is a great package, thank you so much. I'm just wondering if there is a way to annotate the genome-wide significant probes/SNPs with specific data from another column in my dataframe (for example, closest gene or rsID)?
    Many thanks

    ReplyDelete
    Replies
    1. I think you could just pass in the other variable as a vector to highlight. e.g., qqman(..., highlight=mydata$mycol)

      Delete
    2. Many thanks for your quick reply. I've realised that I didn't actually explain my question very well. What I mean by 'annotate' is that I would like text on the manhattan plot. For example, for the highly significant probes/SNPs I would like text stating the gene or rsID. Thanks for your help.

      Delete
    3. Ah, I see what you're saying. Unfortunately, no, that isn't implemented yet. But I'd recommend taking a look at textxy in the calibrate package - you could probably hack this into the manhattan source code somewhere that would work... http://www.inside-r.org/packages/cran/calibrate/docs/textxy

      Delete
  25. I'd recommend everyone go over and look at manhattanly - an R package based on qqman that allows for interactive manhattan plots. http://moderndata.plot.ly/manhattanly-r-package-for-interactive-manhattan-plots/

    ReplyDelete
  26. Nice little package Stephen! Can't believe this is my first time doing these plots after so many bioinformatics years, but this made the intro easy.

    ReplyDelete
  27. Hello Stephen I liked your package so much, but I found a problem when I use the manhattan(subset(gwasResults, CHR == 1)). I would like to now if when I am working with the subset of a CHR, I cant use the option to use another variable in the place of P-Value. Like in the
    manhattan(gwasResults, p = "zscore", logp = FALSE, ylab = "Z-score", genomewideline = FALSE, suggestiveline = FALSE, main = "Manhattan plot of Z-scores").
    Thanks in advance!

    ReplyDelete

Note: Only a member of this blog may post a comment.

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.