Monday, April 25, 2011

Annotated Manhattan plots and QQ plots for GWAS using R, Revisited

Last year I showed you how to create manhattan plots, and later how to highlight regions of interest, using ggplot2 in R. The code was slow, required a lot of memory, and was difficult to maintain and modify.

I finally found time to rewrite the code using base graphics rather than ggplot2. The code is now much faster, and if you're familiar with base R's plot options and graphical parameters, most of these can now be passed to the functions to tweak the plots' appearance. The code also behaves differently depending on whether you have results for one or more than one chromosome.

Here's a quick demo.

First, either copy and paste the code from GitHub, or run the following command in R to source the function from the web:

source("http://www.StephenTurner.us/qqman.r")

Next, load some GWAS results, and take a look at the relevant columns. This is standard output from PLINK's --assoc option. This may take a minute or so (~ 20MB file):

results <- read.table("http://www.StephenTurner.us/plinkresults.assoc",T)

head(subset(results, select=c(SNP, CHR, BP, P)))

The manhattan function assumes you have columns named SNP, CHR, BP, and P, corresponding to the SNP name (rs number), chromosome number, genomic coordinate, and p-value. Missing values (where regression failed to converge, for instance) should be recoded NA before reading into R. Do this with a quick sed command. Here's what the data looks like:

         SNP CHR        BP       P
1 rs10495434   1 235800006 0.62220
2  rs6689417   1  46100028 0.06195
3  rs3897197   1 143700035 0.10700
4  rs2282450   1 202300047 0.47280
5   rs567279   1  66400050      NA
6 rs11208515   1  64900051 0.53430


Now, create a basic manhattan plot (click the image to enlarge):

manhattan(results)



If you type args(manhattan) you can see the options you can set. Here are a few:

colors: this is a character vector specifying the colors to cycle through for coloring each point. Here's a PDF chart of R's color names.

ymax: this is the y-axis limit. If ymax="max" (default), the y-axis will always be a little bit higher than the most significant -log10(p-value). Otherwise you can set this value yourself.

cex.x.axis: this can be used to shrink the x-axis labels by setting this value less than 1. This is handy if some of the tick labels aren't showing up because the plot region is too small.

*** Update March 9 2012*** cex.x.axis is deprecated. To change the x-axis size, use the default base graphics argument cex.axis.

limitchromosomes: you can limit which chromosomes you want to display. By default this restricts the plot to chromosomes 1-23(x).

suggestiveline and genomewideline: set these to FALSE if you don't want threshold lines, or change the thresholds yourself.

annotate: by default this is undefined. If you supply a character vector of SNP names (e.g. rs numbers), any SNPs in the results data frame that also show up here will be highlighted in green by default. example below.

... : The dot-dot-dot means you can pass most other plot or graphical parameters to these functions (e.g. main, cex, pch, etc).

Make a better looking manhattan plot. Change the plot colors, point shape, and remove the threshold lines:

manhattan(results, colors=c("black","#666666","#CC6600"), pch=20, genomewideline=F, suggestiveline=F)


Now, read in a text file with SNP names that you want to highlight, then make a manhattan plot highlighting those SNPs, and give the plot a title:

snps_to_highlight <- scan("http://www.StephenTurner.us/snps.txt", character())

manhattan(results, annotate=snps_to_highlight, pch=20, main="Manhattan Plot")



Finally, zoom in and plot only the results for chromosome 11, still highlighting those results. Notice that the x-axis changes from chromosome to genomic coordinate.

manhattan(subset(results, CHR==11), pch=20, annotate=snps_to_highlight, main="Chromosome 11")



Finally, make a quantile-quantile plot of the p-values. To make a basic qq-plot of the p-values, pass the qq() function a vector of p-values:

qq(results$P)


Perhaps we should have made the qq-plot first, as it looks like we might have some unaccounted-for population stratification or other bias.

The code should run much faster and use less memory than before. All the old functions that use ggplot2 are still available, now prefixed with "gg." Please feel free to use, modify, and redistribute, but kindly link back to this post. Here's the code:

67 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. This is fantastic! Just what I was looking for. I stumbled across your earlier version this afternoon, and was having difficulty getting it to work. The new version worked beautifully and gave me the plot I was dreading having to code up myself from scratch. Could not have come a at a better time! Thanks!

    ReplyDelete
  3. Just wanted to add: great! Really, thanks a bunch. It works like a charm...
    I've added some code so it'll automatically outputs .EPS-files.

    Only thing I'm looking for now, is to annotate the top hits (-log10(P)>=8) with the gene name... That would be great!

    Thanks!

    ReplyDelete
  4. I have run it and this code is very cooool
    Also, a quick question, when I creat my own results and followed the format of “SNP CHR BP P”
    However, R returns
    “Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
    replacement has 0 rows, data has 2”

    Anybody knows why?
    Thanks

    ReplyDelete
  5. I'm guessing the error might be because you have something other than a number in the p-value column (maybe a NaN?). This column can only be a number between 0 and 1 or NA for missing. Next time I update the code I'll add in some checks for this.

    ReplyDelete
  6. Thank you!

    I started learning how to run GWAS two hours ago, and thanks to this I'm already looking at some preliminary results.

    Science has never been simpler! ;)

    ReplyDelete
  7. Thanks for the nice script! While using it I found out, that it gives an error when there are no snps for a specific Chromosome.

    ReplyDelete
  8. Many thanks! This is great!
    One thing though: when I try to use the "ymax" in order to limit the Y axis values, it doesn't really work. maybe I'm using it the wrong way.
    Here's what I typed:

    manhattan(results, ymax=5, limitchromosomes=F, colors=c("black","#666666","#CC6600"), pch=20, genomewideline=F, suggestiveline=F)

    Thanks!

    ReplyDelete
  9. Sagi, I think I see your problem here. These two lines gave it the behavior I desired:

    if (ymax=="max") ymax<-ceiling(max(d$logp))
    if (ymax<8) ymax <- 8

    ... which is to say, if ymax="max", make the maximum value on the y axis at least one point higher on the -log10p scale than the most significant SNP. But if the max y was less than 8, set it at 8 (because genome-wide significance is conventionally 5e-8).

    If you want this code to behave the way you want, remove the second line that resets ymax to 8 if it's less than that.

    ReplyDelete
  10. Hi,

    As many wrote: thanks a bunch! I'm having difficulty getting the genome-wide significance at 5e-8. It just won't get to 8, it's always a little off... Any clues?
    I changed your code a little, but only in making the line for instance dotted. Here's the code:

    # manhattan plot using base graphics
    manhattan = function(dataframe, colors=c("black", "gray50"), ymax="max", cex.x.axis=0.8, limitchromosomes=1:23, suggestiveline=-log10(1e-4), genomewideline=-log10(5e-8), annotate=NULL, ...) {

    d=dataframe
    if (!("CHR" %in% names(d) & "BP" %in% names(d) & "P" %in% names(d))) stop("Make sure your data frame contains columns CHR, BP, and P")

    if (any(limitchromosomes)) d=d[d$CHR %in% limitchromosomes, ]
    d=subset(na.omit(d[order(d$CHR, d$BP), ]), (P>0 & P<=1)) # remove na's, sort, and keep only 0<P<=1
    d$logp = -log10(d$P)
    d$pos=NA
    ticks=NULL
    lastbase=0
    colors <- rep(colors,max(d$CHR))[1:max(d$CHR)]
    if (ymax=="max") ymax<-ceiling(max(d$logp))
    if (ymax<9) ymax<-9

    numchroms=length(unique(d$CHR))
    if (numchroms==1) {
    d$pos=d$BP
    ticks=floor(length(d$pos))/2+1
    } else {
    for (i in unique(d$CHR)) {
    if (i==1) {
    d[d$CHR==i, ]$pos=d[d$CHR==i, ]$BP
    } else {
    lastbase=lastbase+tail(subset(d,CHR==i-1)$BP, 1)
    d[d$CHR==i, ]$pos=d[d$CHR==i, ]$BP+lastbase
    }
    ticks=c(ticks, d[d$CHR==i, ]$pos[floor(length(d[d$CHR==i, ]$pos)/2)+1])
    }
    }

    if (numchroms==1) {
    with(d, plot(pos, logp, ylim=c(0,ymax), ylab=expression(-log[10](italic(P))), xlab=paste("Chromosome",unique(d$CHR),"position"), ...))
    } else {
    with(d, plot(pos, logp, ylim=c(0,ymax), ylab=expression(-log[10](italic(P))), xlab="Chromosome", xaxt="n", type="n", ...))
    axis(1, at=ticks, lab=unique(d$CHR), cex.axis=cex.x.axis)
    icol=1
    for (i in unique(d$CHR)) {
    with(d[d$CHR==i, ],points(pos, logp, col=colors[icol], ...))
    icol=icol+1
    }
    }

    if (!is.null(annotate)) {
    d.annotate=d[which(d$SNP %in% annotate), ]
    with(d.annotate, points(pos, logp, col="darkorange", ...))
    }

    if (suggestiveline) abline(h=suggestiveline, col="royalblue4", lty=5)
    if (genomewideline) abline(h=genomewideline, col="red",lty=5)
    }

    Thanks!

    ReplyDelete
  11. @swvanderlaan - the reason the line won't get to 8 on the y-axis is because genomewideline is set to -log10(5e-8), which is about 7.30 on the -log10 scale. The p=5e-8 number comes from the notion that there are ~1 million independent tests in the genome, so by a Bonferroni correction, 0.05/1e6=5e-8, the widely accepted holy grail p-value for "genome-wide significance."

    If you want the line to show up at 8 on the y-axis, set genomewideline=-log10(1e-8), or simply, genomewideline=8.

    Hope this helps.

    ReplyDelete
  12. Hey Stephen,

    So, you stare for minutes, hours at the code, and you just don't see it... Haha, thanks!

    Now, what I wrote earlier: I want names with my top hits. And I'll figure that one out, I'm sure. And when I do, I'll be sure to get back to this post, to share it!

    Thanks.

    Sander

    ReplyDelete
  13. Try using the text() function to add text to the plot. It probably wouldn't be too hard to work this into the function to annotate SNPs with gene names using the text() function. You'll probably want to limit it to SNPs having annotation information AND having p-value less than a certain threshold, so that it's readable and looks nice.

    ReplyDelete
  14. Does Dr. swvanderlaan or Dr. Stephen Turner can help to solve how to make top 10 SNPs name on the plots with R?

    ReplyDelete
  15. Can anyone tell me how to adjust the size of the points on the plot?

    ReplyDelete
  16. Use the cex= option. See here: http://www.inside-r.org/r-doc/graphics/par

    ReplyDelete
  17. hmm...Read through that but didn't understand a whole lot. par(pch=20, cex=0.5) kind of does what I want but it also resizes the axis label fonts. One would think there would be something like pch.cex to do this... Can you give me any clearer guidance please?

    ReplyDelete
  18. Oh, I see what's going on here now. I included a graphical parameter that I named cex.x.axis to resize x-axis labels. When using the cex= argument, it conflicts with the argument I supply. Source this version, and you should be able to change the point size with the cex option.

    manhattan(results) #normal size
    manhattan(results, cex=.5) #half size

    https://gist.github.com/1094160

    ReplyDelete
  19. Thanks Stephen that worked.

    ReplyDelete
  20. Hi Stephen,

    I ran this script and got the following error message :
    "Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
    replacement has 0 rows, data has 36720"

    a similiar problem somebody had.

    I double-checked the P values, all fall in (0,1] and no missing values.

    Do you know how this would happen? anything is wrong in my dataset?

    Many thanks,

    ReplyDelete
  21. Hmm, not sure without seeing your data. Are you sure that all the other relevant columns contain numeric values only? How many chromosomes do you have?

    ReplyDelete
  22. Hi Stephen,

    I'm trying to run this script on some results from mice, and am getting this error: Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :max not meaningful for factors.
    I thought this may be due to the choromosome number being different, but I have now changed the script to chromosome 1:19 and it still has the same error. Do you know what might be causing this problem? Thanks.

    ReplyDelete
  23. Hi Stephen,

    I posted a problem a few days ago when I ran this script and the error message was:
    :"Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
    replacement has 0 rows, data has 36720"

    I double-checked the varaibles in the dataset and all the other relevant columns contain numeric value. However, there was only 21 chromosomes (chromomsome 3 was missing). I am wondering if this is the reason which caused the problem?

    Thanks,

    ReplyDelete
  24. Anonymous 1 and anonymous 2: I'll try to take a look at the code sometime soon but I've got a major grant deadline coming up that's sucking much of my time away. Anonymous 2: That could easily be the problem - for a dirty fix, try putting in a single line in your results file between chr 2 and 4, where the chr is 3, position is 1, and p-value is 1, see if it works and let me know. Anonymous 1 - it's difficult for me to troubleshoot without some example data. Could you use something like http://jotonce.com/ to copy a few lines out of your results file, and email me the password to access it?

    ReplyDelete
  25. Stephen,

    after adding a redundant line for chr 3, the function works well! thank you so much!

    ReplyDelete
  26. Thanks for helping me figure that one out. I'm soon going to be updating the code to fix a few other things and add things like automatic annotation based on gene region, SNP features, etc. I'll add this to the list of things to fix.

    ReplyDelete
  27. Hey Stephen,

    I'm running into the following error when trying to create a manhattan plot:
    Error in Math.factor(d$P) : log10 not meaningful for factors In addition: Warning messages: 1: In Ops.factor(P, 0) : > not meaningful for factors 2: In Ops.factor(P, 1) : <= not meaningful for factors

    Any insight into what might be the source of this error would be greatly appreciated as I'm stumped!

    ReplyDelete
  28. Do you have something other than numbers between 0 and 1 (not including 0)?

    ReplyDelete
  29. Thank you for sharing the code! I need to prepare one of these plots quickly and your programming is pretty straightforward. Nice work!

    ReplyDelete
  30. Is there a way to get a chromosome 0 for scaffolds not incorporated into CHR?

    ReplyDelete
  31. I'll add it to the growing list of changes for the next update. For now, what happens if you simply have a chromosome 0, or chromosome 24? Make sure you appropriately change the limitchromosomes argument.

    ReplyDelete
  32. If I change the limit chromosome argument to 0:8, and include CHR 0 in my source file I get this error:

    Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
    replacement has 0 rows, data has 213

    I get the same error if I set limitchromosomes=F

    I have not tried it with 24 as my organism only has a few chromosomes.

    ReplyDelete
  33. Also, if I leave it as 1:8 it plots perfectly, but just excludes the CHR 0 data.

    ReplyDelete
  34. Changing this i==1 to i ==0 allows for CHR 0,but for some reason the last chromosome has not points.

    + } else {
    + for (i in unique(d$CHR)) {
    + if (i==0) {

    ReplyDelete
  35. and changing icol=icol+0 restores those points for the last CHR, but the colors by CHR are missing

    + for (i in unique(d$CHR)) {
    + with(d[d$CHR==i, ],points(pos, logp, col=colors[icol], ...))
    + icol=icol+0

    ReplyDelete
  36. Finally got it with changing from i==1 to i==0 like so:

    + } else {
    + for (i in unique(d$CHR)) {
    + if (i==0) {

    and then changing type= "n" to type = "p"


    + with(d, plot(pos, logp, ylim=c(0,ymax), ylab=expression(-log[10](italic(p))), xlab="Chromosome", xaxt="n", type="p", ...))

    ReplyDelete
  37. Excellent! Thanks for hammering away at this. I'll try to make the code more flexible on my next round of updates.

    ReplyDelete
  38. Hi Stephen,

    Thanks. This is great !

    I have tried to add the SNP name in the plot using text. Using the BP for that SNP, how can I find the x-coordinate in the plot? For example, what is x-coordinate in the plot for CHR=2, BP=53606599)?

    Thanks in advance.

    ReplyDelete
  39. If you look at the code here, in lines 45-53 I'm creating a new variable in the data frame called d$pos. On chromosome 1 d$pos is set to d$BP. For every remaining chromosome, d$pos is set to the BP position on that chromosome, added to the last base position on the previous chromosome. For instance, if I have 3 SNPs on chromosome 1 at positions 10, 20, and 30, and 3 SNPs on chromosome 2 at positions 5, 15, 25, the X coordinates for the six SNPs would be 10, 20, 30, 35, 45, 55. The x-axis uses the d$pos variable, and you can mod the code to use d$pos as the x coordinate for your text. I'd love to see what you do with this, please post the code to https://gist.github.com/. Thanks!

    ReplyDelete
  40. We were just using plink the other day and I wanted to create a Manhattan plot for our data. I am running into this error:
    Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
    replacement has 0 rows, data has 23

    My plink.assoc file is from a small dataset so it only contains two CHR. I've searched the P column for non-0 or 1 but didn't find any.

    Thanks for writing this code! I appreciate the time and help!

    I can e-mail the the plink.assoc file if you need it.

    ara

    ReplyDelete
  41. I actually figured it out using the args(manhattan). When you have less then 23 CHR you need to use the limitchromosomes=#,# . Where #s are your CHRs that are present in the file.

    ReplyDelete
  42. Well I thought I had it solved but...

    So I have two CHR 11 and 17. I use the limitchromosomes to get it to plot 11 but how to let R know I want just 11 and 17?

    ReplyDelete
  43. Thanks for figuring this out. As with the many other little inconsistencies, I'll try to iron this out and make the code a little more robust in the next version.

    ReplyDelete
  44. Just another quick note. I tried to use the limitchromosomes=11:12 (basically turned my 17 into 12 so it's consecutive) the same error pops up. It looks like you can do a single CHR using limitchromosomes but not consecutive CHRs unless you have all 1:23 present.

    ReplyDelete
  45. Is there any way to change the colour of the annotation from green to red? I don't quite understand how the dot-dot-dot function is supposed to work?

    ReplyDelete
  46. Hi Stephen,

    Just a quick question: how do you usually select you annotation SNPs? Do you say (using PLINK): pick a window of 100 SNPs around my hit? Or do make it broader?

    Thanks!

    Sander

    ReplyDelete
  47. Where I've needed this in the past is to highlight known hits based on RS number from the GWAS catalog or something. If you wanted to highlight "skyscrapers" in the manhattan plot, you could just select all the SNPs within a certain window like you said.

    ReplyDelete
  48. Exactly, but what range would you take 100-250kb, or would rather take 500-1500kb?

    ReplyDelete
  49. It's hard to say really. You might base it on how big the haplotype blocks are in that region. But if you're just coloring neighboring SNPs to highlight the peak on the plot, you probably just want to eyeball it, and pick whatever size looks good.

    ReplyDelete
  50. It works great! Thanks a lot!!

    I also encountered that error when there are no snps for a specific Chromosome. I solved it with a minor adjustment to the loop on line 45. Below is the code.

    CHRs <- unique(d$CHR)
    numchroms=length(CHRs)
    if (numchroms==1) {
    d$pos=d$BP
    ticks=floor(length(d$pos))/2+1
    } else {
    for (i in 1:numchroms) {
    if (i==1) {
    d[d$CHR==CHRs[i], ]$pos=d[d$CHR==CHRs[i], ]$BP
    } else {
    lastbase=lastbase+tail(subset(d,CHR==CHRs[i-1])$BP, 1)
    d[d$CHR==CHRs[i], ]$pos=d[d$CHR==CHRs[i], ]$BP+lastbase
    }
    ticks=c(ticks, d[d$CHR==CHRs[i], ]$pos[floor(length(d[d$CHR==CHRs[i], ]$pos)/2)+1])
    }
    }

    ReplyDelete
  51. Bahar - thanks for sharing your solution. I think others have had this problem before but I never knew what was causing it.

    ReplyDelete
  52. This is great. Thanks a lot Dr. Turner.
    I was trying to figure it how to Zoom in a specific location of Chromosome. For example, if I want to have Manhattan plot only for BP "1e+07 to 2e+07" of CHR1. Any suggestions?

    ReplyDelete
  53. Hi Stephen,

    Thanks for your code. It's awesome. I did a MH plot using your code. In the x-axis, all the chr # didn't show up. How do I resolve this issue?
    All the #s came up between 1 and 9 and then I have 11,13, 15, 19 etc.

    Thanks,

    -Joey

    ReplyDelete
  54. Joey - make a bigger plog. Save the plot to a file and make it wider.

    ReplyDelete
  55. Hi Stephen,

    This function is really helpful. I love reading your blog, really very informative, for people like me who have to learn statistics to be in field of human genetics.

    I used the function to plot Manhattan but, I am having trouble with pch, dots are very big and when I plot 6million SNP's with imputed data big dots merge with each other and it looks messy. I tried to change size with cex, but it changed the size of x axis and dots remained of same size. Can you suggest something to deal with this?

    Thanks,

    ReplyDelete
  56. Ah, I see what's going on here. I named the argument "cex.x.axis" so I could control this at line 60 in the code. Copy the function, and rename the "cex.x.axis" to something like xcex, and when you pass in the cec= argument, it should work properly to resize the points, not the x-axis labels.

    ReplyDelete
  57. Great codes! Very useful. Thank you Stephen

    ReplyDelete
  58. Hi Stephen,

    This code is so useful - thank you for sharing it!

    I have been adapting your script, but I'm having some problems with adding a title to the Manhattan plot (although I can successfully change the colours):

    # manhattan plot using base graphics
    manhattan = function(dataframe, colors=c("#8A2BE2", "#5F9EA0", "#483D8B" ), ymax="max", cex.x.axis=1, limitchromosomes=1:23, suggestiveline=-log10(1e-5), genomewideline=-log10(5e-8), annotate=NULL,
    main="Viral load in cell lines", ...) {

    If you could suggest what I should do to add a title, that would be much appreciated!

    ReplyDelete
  59. I'm also having trouble changing the size of the points using pch:

    # manhattan plot using base graphics
    manhattan = function(dataframe, colors=c("#8A2BE2", "#5F9EA0", "#483D8B" ), ymax="max", cex.x.axis=1, limitchromosomes=1:23, suggestiveline=-log10(1e-5), genomewideline=-log10(5e-8), annotate=NULL,
    main="Viral load in cell lines", pch=20, ...) {

    ReplyDelete
  60. Hello Anonymous from July 19 - I believe I solved your cex problem.

    ReplyDelete
  61. Hi Stephen,

    Awesome work! I am using R studio to run the code. I am having trouble getting the last command to execute. manhattan(results) gives an Error: could not find function "manhattan".

    Any ideas?

    Thanks

    ReplyDelete
  62. It looks like you didn't source in the function. Copy and paste the function yourself, or source it from the web directly using source("http://www.StephenTurner.us/qqman.r")

    ReplyDelete
  63. Hello Stephen,
    Its a very nice piece of code you have compiled. I am using R to plot my GWAS data. I was wondering if we could plot lines instead of dots with colour intensity proportional to LOD scores. In that way, it'll be easy to draw region specific conclusion about association.

    Regards,
    Amol

    Best Wishes

    ReplyDelete
  64. Amol - that would be pretty simple with the old ggplot2 code - just use geom="line" and colour=LOD or the like. I'm not sure how to build this into the current implementation, but contributions are welcome.

    ReplyDelete

Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at GettingGeneticsDone.blogspot.com.