I'm calling variants from exome sequencing data and I need to evaluate the efficiency of the capture and the coverage along the target regions.
This sounds like a great use case for bedtools, your swiss-army knife for genomic arithmetic and interval manipulation. I'm lucky enough to be able to walk down the hall and bug Aaron Quinlan, creator of bedtools, whenever I have a "how do I do X with bedtools?" question (which happens frequently).
As open-source bioinformatics software documentation goes, bedtools' documentation is top-notch. In addition, Aaron recently pointed out a work-in-progress bedtools cookbook that he's putting together, giving code examples for both typical and clever uses of bedtools.
Getting back to my exome data, one way to visualize this is to plot the cumulative distribution describing the fraction of targeted bases that were covered by >10 reads, >20 reads, >80 reads, etc. For example, covering 90% of the target region at 20X coverage may be one metric to assess your ability to reliably detect heterozygotes. Luckily for me, there's a bedtools protocol for that.
The basic idea is that for each sample, you're using bedtools coverage to read in both a bam file containing your read alignments and a bed file containing your target capture regions (for example, you can download NimbleGen's V3 exome capture regions here). The -hist option outputs a histogram of coverage for each feature in the BED file as well as a summary histogram for all of the features in the BED file. Below I'm using GNU parallel to run this on all 6 of my samples, piping the output of bedtools to grep out only the lines starting with "all."
Now that I have text files with coverage histograms for all the regions in the capture target, I can now plot this using R.
You can see that sample #2 had some problems. Relative to the rest of the samples you see it take a quick nose-dive on the left side of the plot relative to the others. Running this through Picard showed that 86% of the reads from this sample were duplicates.
Thanks, Aaron, for the tips.
Bedtools protocols: https://github.com/arq5x/bedtools-protocols
Hi Stephen,
ReplyDeleteThanks for this blog post! This worked really well for me.
Just wondering if you had any thoughts on how to generate labels for specific samples only. For example i've plotted over 300 exomes and I want to quickly know which ones have Depth ≤10 for 90% of target bases.
Secondly, this works well for exomes but I've found it doesn't work so well for genomes unless you're looking at target regions (eg. the exome capture regions). What do you think is the best way to apply this to genomes? A bedfile with 1MB windows?
I've used textxy in the calibrate package to do this, but you'd have to manually place the label using x and y coordinates. I'm not sure how to automatically produce these in a sensible location.
DeleteRE whole genome, I'm not sure why you couldn't use a similar approach, but I suppose breaking up into fraction of 1kb or 1MB windows covered at X depth could work too. You'd have to play around a bit to see what makes sense.
Hi Septhen, this also worked nicely for me. Thanks for sharing. Just a question (which might be more suited to Aaron than you): do your bam filles contain paired-end reads? If so, is there some double counting in coverageBed? I tried to look for an answer but no dice.
ReplyDeleteHi Stephen
ReplyDeleteI tried to run the above command
find /Dir/*bam | parallel -j+0 'bedtools coverage -hist -abam {} -b /Dir/target_regions.bed | grep ^all > {}.hist.all.txt
which works fine, but when i add nohup, it does not work
nohup find /Dir/*bam | parallel -j+0 'bedtools coverage -hist -abam {} -b /Dir/target_regions.bed | grep ^all > {}.hist.all.txt' &
IT shows error.
I also created a bashscript.sh with above command and tried
nohup ./bashscript.sh & and it does not work.
Any suggestions how to run, not sure why it is not working when parallel is there.
Thank you !
cheers
Chirag
not sure. try gnu screen instead of nohup ... &
DeleteHi Turner,
ReplyDeleteIs there a way to calculate coverage for WGS?
Regards,
Waqasuddin Khan.
Thanks, this was really useful. Here are the commands I used to plot it with ggplot.
ReplyDeletelibrary(ggplot2)
library(RColorBrewer)
# get file names
print(files <- list.files(pattern="all.txt$"))
# load and parse the files
cov<-NULL
for (f in files){
thisDF=read.table(f, sep="\t")[,2:5]
colnames(thisDF)[c(1,4)]=c("covThreshold", "diffFracBelowThreshold")
thisDF$fracAboveThreshold=1-cumsum(thisDF$diffFracBelowThreshold)
cov =rbind(cov, cbind(thisDF, sample= gsub("^(SAMPLE[1-3]).*$", "\\1", f)))
}
# plot the data
maxCov=1000
p<-ggplot(subset(cov, covThreshold<maxCov), aes(x= covThreshold, y=100* fracAboveThreshold, color=sample)) + ylim(0, 100) + scale_color_brewer(palette="Set1")
ggsave("coverageByThreshold.png")
Hey Stephen, I love the blog as it saves me from having to reinvent the wheel -- you might want to update this oldie but goodie -- the command line options have changed for bedtools (-abam should be -b and -b should be -a)...
ReplyDeleteThanks for posting this update -- saved me some time!
DeleteHi Stephen, thank you for this post, it really helped me a lot and I think that this dataset of coverages and fractions is very powerful for a lot of QC analyses. I want to second Brandi's comment about changing -abam to -b and -b to -a.
ReplyDeleteFurthermore I have a comment on the calculation of "cov_cumul". I believe this vector should be shifted down by 1 per sample/BED entry and the first element should be 1. You can already see this in your graph. At 0 depth, the "fraction of capture target bases >= depth" should be 1! While this may not matter much on the exome level, it can make big differences on smaller target enrichments and definitely when you use this script to look at the separate BED elements (so not just at the "all" lines). Below I have a piece of adapted script that takes this comment into account and creates a data.frame with data from the "all"-files:
cov <- list()
for (i in 1:length(files)) {
cov[[i]] <- read.table(files[i])[,c(2,5)]
cov_cumul=1-cumsum(cov[[i]][,2])
cov[[i]]$cov_cumul <- c(1,cov_cumul[-length(cov_cumul)])
cov[[i]]$sample=labs[i]
}
cov_df=do.call("rbind",cov)
names(cov_df)[1:2]=c("depth","fraction")
And if you want to look at the separate BED entries (in my case exons) this is my code:
cov_exon <- list()
for (i in 1:length(files)) {
cov_exon[[i]] <- read.table(files[i])[,c(4,5,8)]
cov_cumul=sapply(unique(cov_exon[[i]]$V4),function(x) 1-cumsum(cov_exon[[i]][cov_exon[[i]]$V4==x,"V8"]))
cov_exon[[i]]$cov_cumul=unlist(sapply(cov_cumul,function(x) c(1,x[-length(x)])))
cov_exon[[i]]$sample=labs[i]
}
cov_exon_df=do.call("rbind",cov_exon)
names(cov_exon_df)[1:3]=c("exon","depth","fraction")
I hope this helps.
Cheers,
Arne