Getting Genetics Done: Visualize coverage for targeted NGS (exome) experiments

Thursday, March 20, 2014

Visualize coverage for targeted NGS (exome) experiments

I'm calling variants from exome sequencing data and I need to evaluate the efficiency of the capture and the coverage along the target regions.

This sounds like a great use case for bedtools, your swiss-army knife for genomic arithmetic and interval manipulation. I'm lucky enough to be able to walk down the hall and bug Aaron Quinlan, creator of bedtools, whenever I have a "how do I do X with bedtools?" question (which happens frequently).

As open-source bioinformatics software documentation goes, bedtools' documentation is top-notch. In addition, Aaron recently pointed out a work-in-progress bedtools cookbook that he's putting together, giving code examples for both typical and clever uses of bedtools.

Getting back to my exome data, one way to visualize this is to plot the cumulative distribution describing the fraction of targeted bases that were covered by >10 reads, >20 reads, >80 reads, etc. For example, covering 90% of the target region at 20X coverage may be one metric to assess your ability to reliably detect heterozygotes. Luckily for me, there's a bedtools protocol for that.

The basic idea is that for each sample, you're using bedtools coverage to read in both a bam file containing your read alignments and a bed file containing your target capture regions (for example, you can download NimbleGen's V3 exome capture regions here). The -hist option outputs a histogram of coverage for each feature in the BED file as well as a summary histogram for all of the features in the BED file. Below I'm using GNU parallel to run this on all 6 of my samples, piping the output of bedtools to grep out only the lines starting with "all."

Now that I have text files with coverage histograms for all the regions in the capture target, I can now plot this using R.

You can see that sample #2 had some problems. Relative to the rest of the samples you see it take a quick nose-dive on the left side of the plot relative to the others. Running this through Picard showed that 86% of the reads from this sample were duplicates.

Thanks, Aaron, for the tips.

Bedtools protocols: https://github.com/arq5x/bedtools-protocols

10 comments:

UnknownMay 10, 2014 at 2:47 AM
Hi Stephen,

Thanks for this blog post! This worked really well for me.

Just wondering if you had any thoughts on how to generate labels for specific samples only. For example i've plotted over 300 exomes and I want to quickly know which ones have Depth ≤10 for 90% of target bases.

Secondly, this works well for exomes but I've found it doesn't work so well for genomes unless you're looking at target regions (eg. the exome capture regions). What do you think is the best way to apply this to genomes? A bedfile with 1MB windows?
ReplyDelete
Replies
krespimMay 29, 2014 at 7:12 AM
Hi Septhen, this also worked nicely for me. Thanks for sharing. Just a question (which might be more suited to Aaron than you): do your bam filles contain paired-end reads? If so, is there some double counting in coverageBed? I tried to look for an answer but no dice.
ReplyDelete
Replies
CNJuly 8, 2014 at 2:10 PM
Hi Stephen

I tried to run the above command
find /Dir/*bam | parallel -j+0 'bedtools coverage -hist -abam {} -b /Dir/target_regions.bed | grep ^all > {}.hist.all.txt
which works fine, but when i add nohup, it does not work
nohup find /Dir/*bam | parallel -j+0 'bedtools coverage -hist -abam {} -b /Dir/target_regions.bed | grep ^all > {}.hist.all.txt' &
IT shows error.
I also created a bashscript.sh with above command and tried
nohup ./bashscript.sh & and it does not work.
Any suggestions how to run, not sure why it is not working when parallel is there.

Thank you !
cheers
Chirag
ReplyDelete
Replies
UnknownMay 10, 2015 at 9:01 AM
Hi Turner,

Is there a way to calculate coverage for WGS?

Regards,

Waqasuddin Khan.
ReplyDelete
Replies
ManducaNovember 20, 2015 at 6:03 PM
Thanks, this was really useful. Here are the commands I used to plot it with ggplot.

library(ggplot2)
library(RColorBrewer)

# get file names
print(files <- list.files(pattern="all.txt$"))

# load and parse the files
cov<-NULL
for (f in files){
thisDF=read.table(f, sep="\t")[,2:5]
colnames(thisDF)[c(1,4)]=c("covThreshold", "diffFracBelowThreshold")
thisDF$fracAboveThreshold=1-cumsum(thisDF$diffFracBelowThreshold)
cov =rbind(cov, cbind(thisDF, sample= gsub("^(SAMPLE[1-3]).*$", "\\1", f)))
}

# plot the data
maxCov=1000
p<-ggplot(subset(cov, covThreshold<maxCov), aes(x= covThreshold, y=100* fracAboveThreshold, color=sample)) + ylim(0, 100) + scale_color_brewer(palette="Set1")

ggsave("coverageByThreshold.png")
ReplyDelete
Replies
BrandiJanuary 26, 2016 at 2:41 PM
Hey Stephen, I love the blog as it saves me from having to reinvent the wheel -- you might want to update this oldie but goodie -- the command line options have changed for bedtools (-abam should be -b and -b should be -a)...
ReplyDelete
Replies
ArneMarch 11, 2016 at 3:43 AM
Hi Stephen, thank you for this post, it really helped me a lot and I think that this dataset of coverages and fractions is very powerful for a lot of QC analyses. I want to second Brandi's comment about changing -abam to -b and -b to -a.
Furthermore I have a comment on the calculation of "cov_cumul". I believe this vector should be shifted down by 1 per sample/BED entry and the first element should be 1. You can already see this in your graph. At 0 depth, the "fraction of capture target bases >= depth" should be 1! While this may not matter much on the exome level, it can make big differences on smaller target enrichments and definitely when you use this script to look at the separate BED elements (so not just at the "all" lines). Below I have a piece of adapted script that takes this comment into account and creates a data.frame with data from the "all"-files:

cov <- list()
for (i in 1:length(files)) {
cov[[i]] <- read.table(files[i])[,c(2,5)]
cov_cumul=1-cumsum(cov[[i]][,2])
cov[[i]]$cov_cumul <- c(1,cov_cumul[-length(cov_cumul)])
cov[[i]]$sample=labs[i]
}
cov_df=do.call("rbind",cov)
names(cov_df)[1:2]=c("depth","fraction")

And if you want to look at the separate BED entries (in my case exons) this is my code:

cov_exon <- list()
for (i in 1:length(files)) {
cov_exon[[i]] <- read.table(files[i])[,c(4,5,8)]
cov_cumul=sapply(unique(cov_exon[[i]]$V4),function(x) 1-cumsum(cov_exon[[i]][cov_exon[[i]]$V4==x,"V8"]))
cov_exon[[i]]$cov_cumul=unlist(sapply(cov_cumul,function(x) c(1,x[-length(x)])))
cov_exon[[i]]$sample=labs[i]
}
cov_exon_df=do.call("rbind",cov_exon)
names(cov_exon_df)[1:3]=c("exon","depth","fraction")

I hope this helps.

Cheers,

Arne
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Thursday, March 20, 2014

Visualize coverage for targeted NGS (exome) experiments

10 comments: