Metagenomics is the study of DNA collected from environmental samples (e.g., seawater, soil, acid mine drainage, the human gut, sputum, pus, etc.). While traditional microbial genomics typically means sequencing a pure cultured isolate, metagenomics involves taking a culture-free environmental sample and sequencing a single gene (e.g. the 16S rRNA gene), multiple marker genes, or shotgun sequencing everything in the sample in order to determine what's there.
A challenge in shotgun metagenomics analysis is the
sequence classification problem: i.e., given a sequence, what's it's origin? I.e., did this sequence read come from
E. coli or some other enteric bacteria? Note that sequence classification does not involve
genome assembly - sequence classification is done on
unassembled reads. If you could perfectly classify the origin of every sequence read in your sample, you would know exactly what organisms are in your environmental sample and how abundant each one is.
The solution to this problem isn't simply BLAST'ing every sequence read that comes off your HiSeq 2500 against NCBI nt/nr. The computational cost of this BLAST search would be many times more expensive than the sequencing itself. There are
many algorithms for sequence classification. This paper examines a wide range of the available algorithms and software implementations for sequence classification as applied to metagenomic data:
Bazinet, Adam L., and Michael P. Cummings. "
A comparative evaluation of sequence classification programs."
BMC Bioinformatics 13.1 (2012): 92.
In this paper, the authors comprehensively evaluated the performance of over 25 programs that fall into three categories: alignment-based, composition-based, and phylogeny-based. For illustrative purposes, the authors constructed a "phylogenetic tree" that shows how each of the 25 methods they evaluated are related to each other:
|
Figure 1: Program clustering. A neighbor-joining tree that clusters the classification programs based on their similar attributes. |
The performance evaluation was done on several different datasets where the composition was known, using a similar set of evaluation criteria (sensitivity = number of correct assignments / number of sequences in the data; precision = number of correct assignments/number of assignments made). They concluded that the performance of particular methods varied widely between datasets due to reasons like highly variable taxonomic composition and diversity, level of sequence representation in underlying databases, read lengths, and read quality. The authors specifically point out that just because some methods lack sensitivity (as they've defined it), they are still useful because they have high precision. For example, marker-based approaches (like
Metaphyler) might only classify a small number of reads, but they're highly precise, and may still be enough to accurately recapitulate organismal distribution and abundance.
Importantly, the authors note that you can't ignore computational requirements, which varied by orders of magnitude between methods. Selection of the right method depends on the goals (is sensitivity or precision more important?) and the available resources (time and compute power are never infinite - these are tangible limitations that are imposed in the real world).
This paper was first received at BMC Bioinformatics a year ago, and since then many new methods for sequence classification have been published. Further, this paper only evaluates methods for classification of unassembled reads, and does not evaluate methods that rely on metagenome assembly (that's the subject of another much longer post, but check out
Titus Brown's blog for lots more on this topic).
Overall, this paper was a great demonstration of how one might attempt to evaluate many different tools ostensibly aimed at solving the same problem but functioning in completely different ways.
Bazinet, Adam L., and Michael P. Cummings. "A comparative evaluation of sequence classification programs." BMC Bioinformatics 13.1 (2012): 92.