I finally found time to rewrite the code using base graphics rather than ggplot2. The code is now much faster, and if you're familiar with base R's plot options and graphical parameters, most of these can now be passed to the functions to tweak the plots' appearance. The code also behaves differently depending on whether you have results for one or more than one chromosome.
Here's a quick demo.
First, either copy and paste the code from GitHub, or run the following command in R to source the function from the web:
source("http://www.StephenTurner.us/qqman.r")
Next, load some GWAS results, and take a look at the relevant columns. This is standard output from PLINK's --assoc option. This may take a minute or so (~ 20MB file):
results <- read.table("http://www.StephenTurner.us/plinkresults.assoc",T)
head(subset(results, select=c(SNP, CHR, BP, P)))
The manhattan function assumes you have columns named SNP, CHR, BP, and P, corresponding to the SNP name (rs number), chromosome number, genomic coordinate, and p-value. Missing values (where regression failed to converge, for instance) should be recoded NA before reading into R. Do this with a quick sed command. Here's what the data looks like:
SNP CHR BP P 1 rs10495434 1 235800006 0.62220 2 rs6689417 1 46100028 0.06195 3 rs3897197 1 143700035 0.10700 4 rs2282450 1 202300047 0.47280 5 rs567279 1 66400050 NA 6 rs11208515 1 64900051 0.53430
Now, create a basic manhattan plot (click the image to enlarge):
manhattan(results)
If you type args(manhattan) you can see the options you can set. Here are a few:
colors: this is a character vector specifying the colors to cycle through for coloring each point. Here's a PDF chart of R's color names.
ymax: this is the y-axis limit. If ymax="max" (default), the y-axis will always be a little bit higher than the most significant -log10(p-value). Otherwise you can set this value yourself.
*** Update March 9 2012*** cex.x.axis is deprecated. To change the x-axis size, use the default base graphics argument cex.axis.
limitchromosomes: you can limit which chromosomes you want to display. By default this restricts the plot to chromosomes 1-23(x).
suggestiveline and genomewideline: set these to FALSE if you don't want threshold lines, or change the thresholds yourself.
annotate: by default this is undefined. If you supply a character vector of SNP names (e.g. rs numbers), any SNPs in the results data frame that also show up here will be highlighted in green by default. example below.
... : The dot-dot-dot means you can pass most other plot or graphical parameters to these functions (e.g. main, cex, pch, etc).
Make a better looking manhattan plot. Change the plot colors, point shape, and remove the threshold lines:
manhattan(results, colors=c("black","#666666","#CC6600"), pch=20, genomewideline=F, suggestiveline=F)
Now, read in a text file with SNP names that you want to highlight, then make a manhattan plot highlighting those SNPs, and give the plot a title:
snps_to_highlight <- scan("http://www.StephenTurner.us/snps.txt", character())
manhattan(results, annotate=snps_to_highlight, pch=20, main="Manhattan Plot")
Finally, zoom in and plot only the results for chromosome 11, still highlighting those results. Notice that the x-axis changes from chromosome to genomic coordinate.
manhattan(subset(results, CHR==11), pch=20, annotate=snps_to_highlight, main="Chromosome 11")
Finally, make a quantile-quantile plot of the p-values. To make a basic qq-plot of the p-values, pass the qq() function a vector of p-values:
qq(results$P)
Perhaps we should have made the qq-plot first, as it looks like we might have some unaccounted-for population stratification or other bias.
The code should run much faster and use less memory than before. All the old functions that use ggplot2 are still available, now prefixed with "gg." Please feel free to use, modify, and redistribute, but kindly link back to this post. Here's the code:






This comment has been removed by the author.
ReplyDeleteThis is fantastic! Just what I was looking for. I stumbled across your earlier version this afternoon, and was having difficulty getting it to work. The new version worked beautifully and gave me the plot I was dreading having to code up myself from scratch. Could not have come a at a better time! Thanks!
ReplyDeleteJust wanted to add: great! Really, thanks a bunch. It works like a charm...
ReplyDeleteI've added some code so it'll automatically outputs .EPS-files.
Only thing I'm looking for now, is to annotate the top hits (-log10(P)>=8) with the gene name... That would be great!
Thanks!
I have run it and this code is very cooool
ReplyDeleteAlso, a quick question, when I creat my own results and followed the format of “SNP CHR BP P”
However, R returns
“Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
replacement has 0 rows, data has 2”
Anybody knows why?
Thanks
I'm guessing the error might be because you have something other than a number in the p-value column (maybe a NaN?). This column can only be a number between 0 and 1 or NA for missing. Next time I update the code I'll add in some checks for this.
ReplyDeleteThank you!
ReplyDeleteI started learning how to run GWAS two hours ago, and thanks to this I'm already looking at some preliminary results.
Science has never been simpler! ;)
Thanks for the nice script! While using it I found out, that it gives an error when there are no snps for a specific Chromosome.
ReplyDeleteMany thanks! This is great!
ReplyDeleteOne thing though: when I try to use the "ymax" in order to limit the Y axis values, it doesn't really work. maybe I'm using it the wrong way.
Here's what I typed:
manhattan(results, ymax=5, limitchromosomes=F, colors=c("black","#666666","#CC6600"), pch=20, genomewideline=F, suggestiveline=F)
Thanks!
Sagi, I think I see your problem here. These two lines gave it the behavior I desired:
ReplyDeleteif (ymax=="max") ymax<-ceiling(max(d$logp))
if (ymax<8) ymax <- 8
... which is to say, if ymax="max", make the maximum value on the y axis at least one point higher on the -log10p scale than the most significant SNP. But if the max y was less than 8, set it at 8 (because genome-wide significance is conventionally 5e-8).
If you want this code to behave the way you want, remove the second line that resets ymax to 8 if it's less than that.
Hi,
ReplyDeleteAs many wrote: thanks a bunch! I'm having difficulty getting the genome-wide significance at 5e-8. It just won't get to 8, it's always a little off... Any clues?
I changed your code a little, but only in making the line for instance dotted. Here's the code:
# manhattan plot using base graphics
manhattan = function(dataframe, colors=c("black", "gray50"), ymax="max", cex.x.axis=0.8, limitchromosomes=1:23, suggestiveline=-log10(1e-4), genomewideline=-log10(5e-8), annotate=NULL, ...) {
d=dataframe
if (!("CHR" %in% names(d) & "BP" %in% names(d) & "P" %in% names(d))) stop("Make sure your data frame contains columns CHR, BP, and P")
if (any(limitchromosomes)) d=d[d$CHR %in% limitchromosomes, ]
d=subset(na.omit(d[order(d$CHR, d$BP), ]), (P>0 & P<=1)) # remove na's, sort, and keep only 0<P<=1
d$logp = -log10(d$P)
d$pos=NA
ticks=NULL
lastbase=0
colors <- rep(colors,max(d$CHR))[1:max(d$CHR)]
if (ymax=="max") ymax<-ceiling(max(d$logp))
if (ymax<9) ymax<-9
numchroms=length(unique(d$CHR))
if (numchroms==1) {
d$pos=d$BP
ticks=floor(length(d$pos))/2+1
} else {
for (i in unique(d$CHR)) {
if (i==1) {
d[d$CHR==i, ]$pos=d[d$CHR==i, ]$BP
} else {
lastbase=lastbase+tail(subset(d,CHR==i-1)$BP, 1)
d[d$CHR==i, ]$pos=d[d$CHR==i, ]$BP+lastbase
}
ticks=c(ticks, d[d$CHR==i, ]$pos[floor(length(d[d$CHR==i, ]$pos)/2)+1])
}
}
if (numchroms==1) {
with(d, plot(pos, logp, ylim=c(0,ymax), ylab=expression(-log[10](italic(P))), xlab=paste("Chromosome",unique(d$CHR),"position"), ...))
} else {
with(d, plot(pos, logp, ylim=c(0,ymax), ylab=expression(-log[10](italic(P))), xlab="Chromosome", xaxt="n", type="n", ...))
axis(1, at=ticks, lab=unique(d$CHR), cex.axis=cex.x.axis)
icol=1
for (i in unique(d$CHR)) {
with(d[d$CHR==i, ],points(pos, logp, col=colors[icol], ...))
icol=icol+1
}
}
if (!is.null(annotate)) {
d.annotate=d[which(d$SNP %in% annotate), ]
with(d.annotate, points(pos, logp, col="darkorange", ...))
}
if (suggestiveline) abline(h=suggestiveline, col="royalblue4", lty=5)
if (genomewideline) abline(h=genomewideline, col="red",lty=5)
}
Thanks!
@swvanderlaan - the reason the line won't get to 8 on the y-axis is because genomewideline is set to -log10(5e-8), which is about 7.30 on the -log10 scale. The p=5e-8 number comes from the notion that there are ~1 million independent tests in the genome, so by a Bonferroni correction, 0.05/1e6=5e-8, the widely accepted holy grail p-value for "genome-wide significance."
ReplyDeleteIf you want the line to show up at 8 on the y-axis, set genomewideline=-log10(1e-8), or simply, genomewideline=8.
Hope this helps.
Hey Stephen,
ReplyDeleteSo, you stare for minutes, hours at the code, and you just don't see it... Haha, thanks!
Now, what I wrote earlier: I want names with my top hits. And I'll figure that one out, I'm sure. And when I do, I'll be sure to get back to this post, to share it!
Thanks.
Sander
Try using the text() function to add text to the plot. It probably wouldn't be too hard to work this into the function to annotate SNPs with gene names using the text() function. You'll probably want to limit it to SNPs having annotation information AND having p-value less than a certain threshold, so that it's readable and looks nice.
ReplyDeleteDoes Dr. swvanderlaan or Dr. Stephen Turner can help to solve how to make top 10 SNPs name on the plots with R?
ReplyDeleteCan anyone tell me how to adjust the size of the points on the plot?
ReplyDeleteUse the cex= option. See here: http://www.inside-r.org/r-doc/graphics/par
ReplyDeletehmm...Read through that but didn't understand a whole lot. par(pch=20, cex=0.5) kind of does what I want but it also resizes the axis label fonts. One would think there would be something like pch.cex to do this... Can you give me any clearer guidance please?
ReplyDeleteOh, I see what's going on here now. I included a graphical parameter that I named cex.x.axis to resize x-axis labels. When using the cex= argument, it conflicts with the argument I supply. Source this version, and you should be able to change the point size with the cex option.
ReplyDeletemanhattan(results) #normal size
manhattan(results, cex=.5) #half size
https://gist.github.com/1094160
Thanks Stephen that worked.
ReplyDeleteHi Stephen,
ReplyDeleteI ran this script and got the following error message :
"Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
replacement has 0 rows, data has 36720"
a similiar problem somebody had.
I double-checked the P values, all fall in (0,1] and no missing values.
Do you know how this would happen? anything is wrong in my dataset?
Many thanks,
Hmm, not sure without seeing your data. Are you sure that all the other relevant columns contain numeric values only? How many chromosomes do you have?
ReplyDeleteHi Stephen,
ReplyDeleteI'm trying to run this script on some results from mice, and am getting this error: Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :max not meaningful for factors.
I thought this may be due to the choromosome number being different, but I have now changed the script to chromosome 1:19 and it still has the same error. Do you know what might be causing this problem? Thanks.
Hi Stephen,
ReplyDeleteI posted a problem a few days ago when I ran this script and the error message was:
:"Error in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
replacement has 0 rows, data has 36720"
I double-checked the varaibles in the dataset and all the other relevant columns contain numeric value. However, there was only 21 chromosomes (chromomsome 3 was missing). I am wondering if this is the reason which caused the problem?
Thanks,
Anonymous 1 and anonymous 2: I'll try to take a look at the code sometime soon but I've got a major grant deadline coming up that's sucking much of my time away. Anonymous 2: That could easily be the problem - for a dirty fix, try putting in a single line in your results file between chr 2 and 4, where the chr is 3, position is 1, and p-value is 1, see if it works and let me know. Anonymous 1 - it's difficult for me to troubleshoot without some example data. Could you use something like http://jotonce.com/ to copy a few lines out of your results file, and email me the password to access it?
ReplyDeleteStephen,
ReplyDeleteafter adding a redundant line for chr 3, the function works well! thank you so much!
Thanks for helping me figure that one out. I'm soon going to be updating the code to fix a few other things and add things like automatic annotation based on gene region, SNP features, etc. I'll add this to the list of things to fix.
ReplyDeleteHey Stephen,
ReplyDeleteI'm running into the following error when trying to create a manhattan plot:
Error in Math.factor(d$P) : log10 not meaningful for factors In addition: Warning messages: 1: In Ops.factor(P, 0) : > not meaningful for factors 2: In Ops.factor(P, 1) : <= not meaningful for factors
Any insight into what might be the source of this error would be greatly appreciated as I'm stumped!
Do you have something other than numbers between 0 and 1 (not including 0)?
ReplyDeleteThank you for sharing the code! I need to prepare one of these plots quickly and your programming is pretty straightforward. Nice work!
ReplyDeleteIs there a way to get a chromosome 0 for scaffolds not incorporated into CHR?
ReplyDeleteI'll add it to the growing list of changes for the next update. For now, what happens if you simply have a chromosome 0, or chromosome 24? Make sure you appropriately change the limitchromosomes argument.
ReplyDeleteIf I change the limit chromosome argument to 0:8, and include CHR 0 in my source file I get this error:
ReplyDeleteError in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
replacement has 0 rows, data has 213
I get the same error if I set limitchromosomes=F
I have not tried it with 24 as my organism only has a few chromosomes.
Also, if I leave it as 1:8 it plots perfectly, but just excludes the CHR 0 data.
ReplyDeleteChanging this i==1 to i ==0 allows for CHR 0,but for some reason the last chromosome has not points.
ReplyDelete+ } else {
+ for (i in unique(d$CHR)) {
+ if (i==0) {
and changing icol=icol+0 restores those points for the last CHR, but the colors by CHR are missing
ReplyDelete+ for (i in unique(d$CHR)) {
+ with(d[d$CHR==i, ],points(pos, logp, col=colors[icol], ...))
+ icol=icol+0
Finally got it with changing from i==1 to i==0 like so:
ReplyDelete+ } else {
+ for (i in unique(d$CHR)) {
+ if (i==0) {
and then changing type= "n" to type = "p"
+ with(d, plot(pos, logp, ylim=c(0,ymax), ylab=expression(-log[10](italic(p))), xlab="Chromosome", xaxt="n", type="p", ...))
Excellent! Thanks for hammering away at this. I'll try to make the code more flexible on my next round of updates.
ReplyDeleteHi Stephen,
ReplyDeleteThanks. This is great !
I have tried to add the SNP name in the plot using text. Using the BP for that SNP, how can I find the x-coordinate in the plot? For example, what is x-coordinate in the plot for CHR=2, BP=53606599)?
Thanks in advance.
If you look at the code here, in lines 45-53 I'm creating a new variable in the data frame called d$pos. On chromosome 1 d$pos is set to d$BP. For every remaining chromosome, d$pos is set to the BP position on that chromosome, added to the last base position on the previous chromosome. For instance, if I have 3 SNPs on chromosome 1 at positions 10, 20, and 30, and 3 SNPs on chromosome 2 at positions 5, 15, 25, the X coordinates for the six SNPs would be 10, 20, 30, 35, 45, 55. The x-axis uses the d$pos variable, and you can mod the code to use d$pos as the x coordinate for your text. I'd love to see what you do with this, please post the code to https://gist.github.com/. Thanks!
ReplyDeleteWe were just using plink the other day and I wanted to create a Manhattan plot for our data. I am running into this error:
ReplyDeleteError in `$<-.data.frame`(`*tmp*`, "pos", value = numeric(0)) :
replacement has 0 rows, data has 23
My plink.assoc file is from a small dataset so it only contains two CHR. I've searched the P column for non-0 or 1 but didn't find any.
Thanks for writing this code! I appreciate the time and help!
I can e-mail the the plink.assoc file if you need it.
ara
I actually figured it out using the args(manhattan). When you have less then 23 CHR you need to use the limitchromosomes=#,# . Where #s are your CHRs that are present in the file.
ReplyDeleteWell I thought I had it solved but...
ReplyDeleteSo I have two CHR 11 and 17. I use the limitchromosomes to get it to plot 11 but how to let R know I want just 11 and 17?
Thanks for figuring this out. As with the many other little inconsistencies, I'll try to iron this out and make the code a little more robust in the next version.
ReplyDeleteJust another quick note. I tried to use the limitchromosomes=11:12 (basically turned my 17 into 12 so it's consecutive) the same error pops up. It looks like you can do a single CHR using limitchromosomes but not consecutive CHRs unless you have all 1:23 present.
ReplyDeleteThanks for the note!
ReplyDeleteIs there any way to change the colour of the annotation from green to red? I don't quite understand how the dot-dot-dot function is supposed to work?
ReplyDeleteChange the color on line 70.
ReplyDeleteHi Stephen,
ReplyDeleteJust a quick question: how do you usually select you annotation SNPs? Do you say (using PLINK): pick a window of 100 SNPs around my hit? Or do make it broader?
Thanks!
Sander
Where I've needed this in the past is to highlight known hits based on RS number from the GWAS catalog or something. If you wanted to highlight "skyscrapers" in the manhattan plot, you could just select all the SNPs within a certain window like you said.
ReplyDeleteExactly, but what range would you take 100-250kb, or would rather take 500-1500kb?
ReplyDeleteIt's hard to say really. You might base it on how big the haplotype blocks are in that region. But if you're just coloring neighboring SNPs to highlight the peak on the plot, you probably just want to eyeball it, and pick whatever size looks good.
ReplyDeleteIt works great! Thanks a lot!!
ReplyDeleteI also encountered that error when there are no snps for a specific Chromosome. I solved it with a minor adjustment to the loop on line 45. Below is the code.
CHRs <- unique(d$CHR)
numchroms=length(CHRs)
if (numchroms==1) {
d$pos=d$BP
ticks=floor(length(d$pos))/2+1
} else {
for (i in 1:numchroms) {
if (i==1) {
d[d$CHR==CHRs[i], ]$pos=d[d$CHR==CHRs[i], ]$BP
} else {
lastbase=lastbase+tail(subset(d,CHR==CHRs[i-1])$BP, 1)
d[d$CHR==CHRs[i], ]$pos=d[d$CHR==CHRs[i], ]$BP+lastbase
}
ticks=c(ticks, d[d$CHR==CHRs[i], ]$pos[floor(length(d[d$CHR==CHRs[i], ]$pos)/2)+1])
}
}
Bahar - thanks for sharing your solution. I think others have had this problem before but I never knew what was causing it.
ReplyDeleteThis is great. Thanks a lot Dr. Turner.
ReplyDeleteI was trying to figure it how to Zoom in a specific location of Chromosome. For example, if I want to have Manhattan plot only for BP "1e+07 to 2e+07" of CHR1. Any suggestions?
Use LocusZoom.
ReplyDeleteHi Stephen,
ReplyDeleteThanks for your code. It's awesome. I did a MH plot using your code. In the x-axis, all the chr # didn't show up. How do I resolve this issue?
All the #s came up between 1 and 9 and then I have 11,13, 15, 19 etc.
Thanks,
-Joey
Joey - make a bigger plog. Save the plot to a file and make it wider.
ReplyDeleteHi Stephen,
ReplyDeleteThis function is really helpful. I love reading your blog, really very informative, for people like me who have to learn statistics to be in field of human genetics.
I used the function to plot Manhattan but, I am having trouble with pch, dots are very big and when I plot 6million SNP's with imputed data big dots merge with each other and it looks messy. I tried to change size with cex, but it changed the size of x axis and dots remained of same size. Can you suggest something to deal with this?
Thanks,
Ah, I see what's going on here. I named the argument "cex.x.axis" so I could control this at line 60 in the code. Copy the function, and rename the "cex.x.axis" to something like xcex, and when you pass in the cec= argument, it should work properly to resize the points, not the x-axis labels.
ReplyDeleteGreat codes! Very useful. Thank you Stephen
ReplyDeleteHi Stephen,
ReplyDeleteThis code is so useful - thank you for sharing it!
I have been adapting your script, but I'm having some problems with adding a title to the Manhattan plot (although I can successfully change the colours):
# manhattan plot using base graphics
manhattan = function(dataframe, colors=c("#8A2BE2", "#5F9EA0", "#483D8B" ), ymax="max", cex.x.axis=1, limitchromosomes=1:23, suggestiveline=-log10(1e-5), genomewideline=-log10(5e-8), annotate=NULL,
main="Viral load in cell lines", ...) {
If you could suggest what I should do to add a title, that would be much appreciated!
I'm also having trouble changing the size of the points using pch:
ReplyDelete# manhattan plot using base graphics
manhattan = function(dataframe, colors=c("#8A2BE2", "#5F9EA0", "#483D8B" ), ymax="max", cex.x.axis=1, limitchromosomes=1:23, suggestiveline=-log10(1e-5), genomewideline=-log10(5e-8), annotate=NULL,
main="Viral load in cell lines", pch=20, ...) {
Hello Anonymous from July 19 - I believe I solved your cex problem.
ReplyDeleteHi Stephen,
ReplyDeleteAwesome work! I am using R studio to run the code. I am having trouble getting the last command to execute. manhattan(results) gives an Error: could not find function "manhattan".
Any ideas?
Thanks
It looks like you didn't source in the function. Copy and paste the function yourself, or source it from the web directly using source("http://www.StephenTurner.us/qqman.r")
ReplyDeleteHello Stephen,
ReplyDeleteIts a very nice piece of code you have compiled. I am using R to plot my GWAS data. I was wondering if we could plot lines instead of dots with colour intensity proportional to LOD scores. In that way, it'll be easy to draw region specific conclusion about association.
Regards,
Amol
Best Wishes
Amol - that would be pretty simple with the old ggplot2 code - just use geom="line" and colour=LOD or the like. I'm not sure how to build this into the current implementation, but contributions are welcome.
ReplyDelete