Earth Sciences Division (ESD) Department of Energy (DOE) Lawrence Berkeley National Laboratory (LBNL)

ESD News and Events Watch ESD on Vimeo

« Nano-Foam Could Plug Underground CO2 leaks | Main | Achieving Carbon Sequestration and Geothermal Energy Production: A Win-Win! »

06/23/2011

Simrank: A Rapid and Sensitive General-Purpose k-mer Search Tool

Source:  Todd DeSantis, Ulas Karaoz, Dan Hawkes

Molecular ecology methods often require the collection of thousands of biological sequences (DNA, RNA, or proteins) extracted from microbial specimens (individuals or communities). The processing of this raw data typically involves a time-consuming similarity search step against one or more reference databases. The results from this matching enable the deduction of community composition or inference of functional capacity within organisms or across microbial communities. The most popular method for sequence comparison has been to find local alignment pairings, but other, faster software has emerged to bypass the time-consuming alignment step by simply counting the number of short subsequences shared between two sequences. These subsequences are referred to as k-mers and are the set of possible fragments of a given length (2-mer, 3-mer, 4-mer, etc.) from a biological sequence.

Present alignment-based matching strategies, in which investigators compare DNA-to-DNA, or RNA-to-RNA, or protein-to-protein within and across projects, can cause massive processing bottlenecks. While software applications for sequence database partitioning, guide-tree estimation, molecular classification, and alignment acceleration have benefited from k-mer searches, a rapid general-purpose, open-source, flexible, stand-alone k-mer tool has not up to now been available.

A team of researchers led by ESD’s Todd DeSantis, and including ESD’s Ulas Karaoz, Navjeet Singh, Eoin Brodie, and Gary Andersen, has recently come up with a solution called Simrank, a stand-alone, rapid and powerful general-purpose k-mer search software tool. Given a query sequence, it allows users to rapidly identify the most similar database sequences. This information can then be used to attach biological annotation to the query sequence. Performance testing of Simrank against popular existing tools showed that Simrank was 10 to 928 times faster—whether that data be DNA, RNA, proteins sequences or even human-readable sentences. As reported in DeSantis et al. (2011), Simrank provides molecular ecologists with an unprecedented high-throughput, open source choice for comparing large sequence sets to find similarity.

When DNA from bacteria are found in clinical or environmental samples some of the first questions asked are “What kind of bacteria is it?”, “Is it a dangerous bacterium?”, “Has this bacterial strain ever been seen before?” Simrank helps answer this question. In conjunction with the 16S reference database hosted at greengenes.lbl.gov, Simrank can compare DNA from any sample against well-characterized reference genes. Finding close matches allows inference of the type (family) of bacteria. The name of the match allows the researcher to infer the pathogenic potential. Conversely, the absence of any close Simrank match in the greengenes database enables the researcher to further investigate and perhaps isolate this novel organism.

Another way Simrank is being used is to rapidly find bacteria with exact and near-exact matches to 16S DNA probes on the Berkeley Lab PhyloChip.  Second Genome, Inc., has licensed the PhyloChip from LBNL in 2010, and will use Simrank to update the probe annotations regularly against the ever-growing reference set.

Desantis_fig

Large, collaborative, multinational consortia projects, such as the Human Microbiome Project. (http://nihroadmap.nih.gov/hmp.), are soon expected to build terabyte-scale collections of sequence data. The scale of such projects is unprecedented, largely because sequencing technology has gotten cheaper, more practical, and more accessible. The creation of Simrank gives researchers a software tool that can rapidly negotiate this massive amount of data.

Figure: Simrank can be used as an alternate to popular but slower approaches such as DNAML-F84.
Citation
DeSantis, T.Z., K. Keller, U. Karaoz, A.V Alekseyenko, N.N.S. Singh, E.L Brodie, Z. Pei, G.L Andersen, and N. Larsen (2011), Simrank: Rapid and sensitive general-purpose k-mer search tool. BMC Ecology, 11 (11); 11 (11), published online, DOI:10.1186/1472-6785-11-11. LBNL-4596E.
Additional Information