Getting Genetics Done: Identifying Pathogens in Sequencing Data

Thursday, June 21, 2012

Identifying Pathogens in Sequencing Data

I just read an interesting paper on pathogen discovery using next-generation sequencing data, recommended to me by Nick Loman.

A previously described algorithm (PathSeq, Kostic et al) for discovering microbes by deep-sequencing human tissue uses computational subtraction, whereby the initial collection of reads is depleted of human DNA by consecutive alignment to the human reference using MAQ and BLAST.

The PathSeq method: computational subtraction by depleting complete read set of all reads mapping to the human reference.

The method described here, Rapid Identification of Nonhuman Sequences (RINS), uses and intersection-based workflow rather than computational subtraction. RINS first maps to a user-supplied custom reference (e.g. a collection of all known viruses and bacteria), thereby drastically lowering computational requirements and increasing sensitivity. Contigs are then assembled de novo, and the original reads are then mapped back onto assembled contigs, which increases specificity.

The RINS method - uses intersection rather than subtraction to identify non-human reads.

The authors of the RINS (intersection) paper noted similar sensitivity and specificity to the PathSeq (subtraction) method in a fraction of the time (2 hours on a desktop machine versus 13 hours on the cloud).

Rapid Identification of Nonhuman Sequences in High Throughput Sequencing Data Sets

3 comments:

Andreas KlostermannJune 21, 2012 at 11:15 PM
An even quicker way might be to use Chaos Game Representation, which not only provides pretty pictures, but can also distinguish vertebrate and non-vertebrate DNA quite handily. Short reads might compromise the quality though.
ReplyDelete
Replies
AnonymousJanuary 1, 2013 at 3:58 PM
I think there is level of uncertainty about what sequence is human and whats not. How much of this variability between individuals is not clear.
ReplyDelete
Replies
Chad BrewbakerJanuary 28, 2013 at 11:07 AM
I never understood the alignment disease. Just sort the kmers and focus on unexpected missing or present kmers. A few terabytes of kmer sequence from GenBank should be enough to bootstrap what is or isn't expected.
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.

This blog has moved!

Thursday, June 21, 2012

Identifying Pathogens in Sequencing Data

3 comments: