Thursday, June 21, 2012

Identifying Pathogens in Sequencing Data

I just read an interesting paper on pathogen discovery using next-generation sequencing data, recommended to me by Nick Loman.

A previously described algorithm (PathSeq, Kostic et al) for discovering microbes by deep-sequencing human tissue uses computational subtraction, whereby the initial collection of reads is depleted of human DNA by consecutive alignment to the human reference using MAQ and BLAST.
The PathSeq method: computational subtraction by depleting complete read set of all reads mapping to the human reference.

The method described here, Rapid Identification of Nonhuman Sequences (RINS), uses and intersection-based workflow rather than computational subtraction. RINS first maps to a user-supplied custom reference (e.g. a collection of all known viruses and bacteria), thereby drastically lowering computational requirements and increasing sensitivity. Contigs are then assembled de novo, and the original reads are then mapped back onto assembled contigs, which increases specificity.
The RINS method - uses intersection rather than subtraction to identify non-human reads.

The authors of the RINS (intersection) paper noted similar sensitivity and specificity to the PathSeq (subtraction) method in a fraction of the time (2 hours on a desktop machine versus 13 hours on the cloud).

Rapid Identification of Nonhuman Sequences in High Throughput Sequencing Data Sets


  1. An even quicker way might be to use Chaos Game Representation, which not only provides pretty pictures, but can also distinguish vertebrate and non-vertebrate DNA quite handily. Short reads might compromise the quality though.

  2. I think there is level of uncertainty about what sequence is human and whats not. How much of this variability between individuals is not clear.

  3. I never understood the alignment disease. Just sort the kmers and focus on unexpected missing or present kmers. A few terabytes of kmer sequence from GenBank should be enough to bootstrap what is or isn't expected.


Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.