The riboFrame (under construction) Page

- Authors: Matteo Ramazzotti*, Luisa Berná, Caludio Donati and Duccio Cavalieri
- Contact: matteo_dot_ramazzotti_at_unifi_dot_it
- Dataset: download
- Program: riboFrame @ GitHub
- Citation: Ramazzotti M et al. Front. Genet., 17 November 2015

* contact for technical details
riboFrame: an improved method for microbial taxonomy profiling from non targeted metagenomics

Introduction

Non targeted metagenomics offers the unprecedented possibility of simultaneously investigate the microbial profile and the genetic capabilities of a sample by a direct analysis of its entire DNA content. The assessment of the microbial taxonomic composition is currently obtained by mapping reads to genomic databases that, although growing, are still limited and biased.
Here we present riboFrame, a novel method for microbial profiling based on the identification and classification of 16S rRNA sequences in non targeted metagenomics datasets. Reads overlapping the 16S rRNA genes are identified using Hidden Markov Models and a taxonomic assignment is obtained by naïve Bayesian classification. All reads identified as ribosomal are coherently positioned in the 16S rRNA gene, allowing the use of the topology of the gene (i.e. the secondary structure and the location of variable regions) to guide the abundance analysis.
We tested and verified the efficacy of our method on simulated ribosomal data, on simulated metagenomes and on a real dataset.
riboFrame exploits the taxonomic potentialities of the 16S rRNA gene in the context of non-targeted metagenomics, giving an accurate perspective on the microbial profile in metagenomic samples.

Program installation

The package riboFrame is composed by two perl scripts, riboTrap.pl and riboMap.pl that can be downloaded from the link above.
A minimal Perl installation is sufficient and needed for the scripts to work.
Two other programs are required, that constitute the working ground of the riboFrame scripts:
  • Eddy's HMMER3 (hmmer.janelia.org), read more here
  • RDP's RDPclassifier (rdp.cme.msu.edu), read more here
HMMER3 should be in your distro's repositories and can be easily installed with your package manager. The 2.6 version of RDPclassifier (minimal installation) is shipped with the riboFrame program (download it at the link above) in a zipped format. If you unzip the file, the following files should be created
d: riboFrame
d:   lib
f:      RDPclassifier libraries
d:   hmms
f:      files with bacterial and archeal hmms
f:   riboTrap.pl
f:   riboMap.pl
f:   RDPclassifier.jar
d:   samples

If your setup is ok (and you have HMMER3 installed) we can start learning the basic steps of riboFrame, using the following demo riboFrame analysis.

Demo riboFrame analysis

In order to run the following example, open a linux terminal and go the to the newly created riboFrame folder.

The example that follows runs a full riboFrame analysis on a linux machine and on quality checked Illumina data. The sample come from HMP and can be downloaded and prepared following the commands listed hereinafter. All files but the original .fastq and .fasta generated by the following commands are contained in the dataset file that can be downloaded fom the link above, but if you want to do everything from scratch the following steps must be run:

1. Download QC Illumina reads from HMP with axel (warning, the file is a 6 Gb tar.gz archive!) and extract the tarball
cd sample
axel -n 10 http://downloads.hmpdacc.org/data/Illumina/stool/SRS011061.tar.bz2
tar -xjf SRS011061.tar.bz2
cd ..

2. Extact fasta from fastq. We'll use grep, but feel free to change this step according to your preferences. A third file is present containing singletons. We'll skip it for brevity...
grep -A1 -E "^@.+\/1$" samples/SRS011061.denovo_duplicates_marked.trimmed.1.fastq | grep -v -e "--" | sed 's/^@/>/' > samples/SRS011061.1.fasta
grep -A1 -E "^@.+\/2$" samples/SRS011061.denovo_duplicates_marked.trimmed.2.fastq | grep -v -e "--" | sed 's/^@/>/' > samples/SRS011061.2.fasta
#grep -A1 -E "^@.+\/1$" samples/SRS011061.denovo_duplicates_marked.trimmed.singleton.fastq | grep -v -e "--" | sed 's/^@/>/' >> samples/SRS011061.1.fasta
#grep -A1 -E "^@.+\/2$" samples/SRS011061.denovo_duplicates_marked.trimmed.singleton.fastq | grep -v -e "--" | sed 's/^@/>/' >> samples/SRS011061.2.fasta

3. Catch ribosomal reads with hmmsearch using the HMMs in the provided hmm folder (feel free to change the --cpu option or remove it)
hmmsearch -E 0.00001 --domtblout samples/SRS011061.1.fwd.bact.ribosomal.table --noali --cpu 4 -o /dev/null hmms/16s_bact_for3.hmm samples/SRS011061.1.fasta
hmmsearch -E 0.00001 --domtblout samples/SRS011061.1.rev.bact.ribosomal.table --noali --cpu 4 -o /dev/null hmms/16s_bact_rev3.hmm samples/SRS011061.1.fasta
hmmsearch -E 0.00001 --domtblout samples/SRS011061.1.fwd.arch.ribosomal.table --noali --cpu 4 -o /dev/null hmms/16s_arch_for3.hmm samples/SRS011061.1.fasta
hmmsearch -E 0.00001 --domtblout samples/SRS011061.1.rev.arch.ribosomal.table --noali --cpu 4 -o /dev/null hmms/16s_arch_rev3.hmm samples/SRS011061.1.fasta
hmmsearch -E 0.00001 --domtblout samples/SRS011061.2.fwd.bact.ribosomal.table --noali --cpu 4 -o /dev/null hmms/16s_bact_for3.hmm samples/SRS011061.2.fasta
hmmsearch -E 0.00001 --domtblout samples/SRS011061.2.rev.bact.ribosomal.table --noali --cpu 4 -o /dev/null hmms/16s_bact_rev3.hmm samples/SRS011061.2.fasta
hmmsearch -E 0.00001 --domtblout samples/SRS011061.2.fwd.arch.ribosomal.table --noali --cpu 4 -o /dev/null hmms/16s_arch_for3.hmm samples/SRS011061.2.fasta
hmmsearch -E 0.00001 --domtblout samples/SRS011061.2.rev.arch.ribosomal.table --noali --cpu 4 -o /dev/null hmms/16s_arch_rev3.hmm samples/SRS011061.2.fasta

RiboTrap just use the tabular output of hmmsearch (--domtblout) , so the output "-o" is directed to the null and this output is suppressed.

WARNING: hmmsearch (as BLAST does) stops fasta headers after the first whitespace! riboFrame is compliant with this behavious, so be sure that your fasta headers do not contain whitespaces or at least that reads will stay uniquely identified even after deleting from their headers all characters following the whitespace (included)!!!

4. Use riboTrap to apply 16S rRNA topology to the extracted reads
perl riboTrap.pl samples/SRS011061

The main output of riboTrap is SRS011061.16S.fasta. We'll move on from this. But please note that the option covplot=1 specifies to riboTrap to emit the coverage plot, that is fundamental to understand wether sufficient sampling of the metagenomic dataset is done (i.e. if a sufficient number of ribosomal reads have been extacted) to ensure that abundance will not be biased. To have the covplot plotted, the R software must be installed on your system. The covplot for out data looks like this:



5. Classify 16S reads using RDPclassifier

java -Xmx1g -jar rdp_classifier.jar -q samples/SRS011061.16S.fasta -o samples/SRS011061.16S.rdp

RDPclassifier output SRS011061.16S.rdp can eventually be used for abundance analysis with riboMap.

If you have downlaoded the dataset file and uncompressed it, you can start from here.

Let's impose a taxonomy confidence threshold of 0.8 and accumulate counts. The examples below show four different topology-assisted classifications using
  1. all the available reads
  2. perl riboMap.pl file=samples/SRS011061.16S.rdp var=full conf=0.8 cross=over percmin=0.5 covplot=1 outplot=1 out=samples/SRS011061.ab_full
  3. reads falling in all variable regions
  4. perl riboMap.pl file=samples/SRS011061.16S.rdp var=all conf=0.8 cross=over percmin=0.5 tol=10% covplot=1 outplot=1 out=samples/SRS011061.ab_all
  5. reads contained in the V2 and V3 regions with a 10% tolerance
  6. perl riboMap.pl file=samples/SRS011061.16S.rdp var=V2,V3 conf=0.8 cross=over percmin=0.5 tol=10% covplot=1 outplot=1 out=samples/SRS011061.ab_V2,V3
  7. reads mapping from he beginning of the V5 region to the end of the V6 region
  8. perl riboMap.pl file=samples/SRS011061.16S.rdp var=V5-V6 conf=0.8 cross=over percmin=0.5 tol=10% covplot=1 outplot=1 out=samples/SRS011061.ab_V5-V6
In all cases we set riboMap to avoid reporting elements whose abundance percent was lower than 1%.
We also asked the coverage plots and the abundance plots (works only if you have R installed)

Let's first look at the coverage plots for the four examples, that indicate how many reads were selected for the abundance analysis (try to guess which covplot comes from which example...):



It shoul be clear how the recruitment system works. The black line represents all available reads, the red line represents the selected reads.

Now let's investigate how abundance plots vary in the different analyses:

Example 1. Plots of ranks > 1% with no region selection for reads (var=full)

domain

phylum

class

order

family

genus

Example 2. Plots of ranks > 1% with variable region selection for reads (var=all)

domain

phylum

class

order

family

genus

Example 3. Plots of ranks > 1% using only reads mapping in the V2 OR V3 regions, with 10% length tolerance (var=V2,V3 tol=10%)

domain

phylum

class

order

family

genus

Example 4. Plots of ranks > 1% using only reads mapping from the beginning of the V5 region to the end of the V6 region, with 10% length tolerance (var=V5-V6 tol=10%)

domain

phylum

class

order

family

genus

In the end, ask riboMap how to investigate specific regions
perl riboMap.pl var

A bash script have been created to run the example above. Feel free to download it here

riboTrap outputs explained

riboTrap script emits two types of output files
  • .fasta: for each run of a sample XXX, three fasta files are written, namely XXX.16S.fasta, XXX.arch.16S.fasta and XXX.bact.16S.fasta.
    The .bact.16S and the .arch.16S files contain reads specifically assigned to bacteria and archea, respectively.
    The .16S file is a combination of the two and it shoud be used for taxonomic classification.
    The header of the fasta format is taken from source reads and the read length and the mapping position on the 16S gene is added. Something like this:
    >61NLYAAXX100508:6:109:10872:17888:101:225.325
    ---------------------------------- LLL SSS.EEE
    
    -: original read name
    L: read length
    SSS: start position
    EEE: end position
    
  • .cov: for each run of a sample XXX, three coverage files are written detailing the coverage on the 16S genes of the archeal, bacterial and combined reads. These files contain the 16S position in the first column (pos) and the coverage in next columns (cnt). It can be used to evaluate how reads of your dataset covers the 16S gene, e.g by producing plots like this:

Appendix
  • Obtaining raw data from Human Microbiome Project (HMP)
    Browse the page http://www.hmpdacc.org/HMASM/ searching for a sample of interest in the data table. Download the reads data from the Reads column (they are > 5Gb files, so we suggest to use the Mozilla DownThemall plugin (that support pause/resume) or the axel downloader for parallel download).