Metagenomic Data Handling and Analysis Challenges
Source: Janet Jansson
Better algorithms, new bioinformatics tools, and “terabytes” of computer storage are required to accommodate metagenomic sequence data from analysis of soil samples.
Although the sequencing of DNA is no longer a bottleneck, large amounts of sequence data generated from analyzing highly diverse soil communities are proving a challenge to accommodate. This issue is exacerbated by the need to cope with short reads—for example, 75-125 bp—that arise from analyses using the Illumina instrument. Thus, better algorithms, new bioinformatics tools, and "terabytes" of computer storage are required.
Increased access to supercomputers, such as the National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory, can help. For instance, Jansson’s team used NERSC to perform BLASTX of permafrost metagenome data they had gathered (190 million reads, with approximately 50 gigabases of Illumina 113 bp x 2 paired end sequences). This analysis took ~800,000 core hours, or the equivalent of more than 85 computer years, which lasted 2 weeks using the NERSC supercomputer and nodes at JGI. Cloud computing will further help to reduce this bottleneck.
Another challenge is the large numbers of errors that different sequencing platforms generate. How can we differentiate sequencing errors from microheterogeneity within DNA samples from soil microbial communities? Also, there can be difficulties with different steps in sample processing. For example, each DNA extraction procedure can introduce its own bias with respect to sample loss or preferential lysis of some members of the microbial community over others. Ideally, different laboratories should each use the same extraction protocol, but despite the availability of commercial kits, laboratories typically follow their own DNA extraction methods.
Another problem lies with soil samples that have low biomass or high levels of contaminants such as humic acids that result in low DNA yields. For example, permafrost soils yield relatively little DNA., but amplifying DNA before preparing a library could compensate for this lack. Two DNA-amplifying methods are used: multiple displacement amplification (MDA) and emulsion PCR (emPCR). Of the two, the MDA approach is subject to considerable bias, whereas emPCR should be less biased because each template is separately amplified. However, apparently no investigations have been done directly comparing the two methods.
Sometimes the volume of data falls short for conducting a metagenome analysis. For instance, when Susannah Tringe and coworkers at JGI first assembled soil metagenome data, their efforts failed because the 100 Mbp of sequence data that they collected proved insufficient. They estimated that they would need 2–5 Gbp to obtain draft genome assemblies of the most dominant organisms in soil, and current estimates from analysis of the Great Prairie metagenome data suggest that probably closer to 2 Tbp of data are needed. However, even a relatively low level of coverage was sufficient for some initial comparisons of the soil metagenome from a Minnesota farm to other available metagenome sequence datasets.
Recently, Etienne Yergeau et al. at the National Research Council of Canada produced 1 Gbp (from Yergeau et al 2010: 853 Mb raw data, 533 Mb after filtering) of sequence data from permafrost soil after amplifying their sample via MDA, which introduced considerable bias. Nevertheless, when these data were compared to other metagenome data, DNA extracts from Minnesota farm soil—but not data from marine or other habitats—proved to be most closely related to the permafrost sample.