Microbiome Taxonomic Profiling Pipeline used in BIOiPLUG

The primary objective of analyzing 16S amplicon sequencing data is to profile sequencing reads into a known taxonomic structure. The figure below illustrates the current bioinformatics pipeline used in the BIOiPLUG cloud service.

Raw data that is uploaded to the pipeline can be any one of the following:

  • Single-end reads generated by Roche 454
  • Single or paired-end reads generated by Illumina platforms (MiSeq, HiSeq etc.)
  • CCS (circular consensus sequencing) reads generated by Pacific Biosciences (PacBio) platforms. Please consult PacBio’s manual for how to generate CCS reads.
  • Any other NGS platforms that can generate FASTQ or FASTA format outputs.

 

MTP pipeline

Pipeline for Microbiome Taxonomic Profiling used in BIOiPLUG cloud

 

Merging paired-end reads:
 In the case of paired-end sequencing (typically MiSeq 250 bp x 2), two sequences representing each end of the same PCR amplicon are merged using the overlapping sequence information. For single-end or CCS sequencing, this step is not required. Those reads that can not be merged are omitted from the subsequent steps. PANDAseq software (Masella et al., 2012) is used, if applicable.
Trimming primers:
 When PCR amplicons are sequenced, primers used for PCR are not considered “sequenced.” These regions for primers are from the annealing process, rather than direct sequencing. Therefore, our pipeline removes primer sequences that were used for PCR of 16S. An in-house code is used for processing.
Filtering by quality:
 Even though present-day NGS machines produce high-quality sequences, sequences with low quality can be also generated. We applied several measures to detect and filter out the sequences with low quality [Learn more].
Denoising  and extracting non-redundant reads:
In general, NGS raw data contains ~0.5% sequencing errors, which occur randomly.  Since the same gene is sequenced many times over in microbiome sequencing, we can correct these sequencing errors with adequate error modeling. This process is called “denoising” and we use new software called DUDE-Seq. The identical sequences are de-replicated in this step to reduce computational time.
Taxonomic assignment:
Denoised and dereplicated sequences are then subjected to taxonomic assignment. We use USEARCH program to search and calculate sequence similarities of the query NGS reads against the EzBioCloud 16S database. 97% 16S similarity is used as the cutoff for species-level identification. Other sequence similarity cut-offs are used for genus or higher taxonomic ranks.

  • = sequence similarity to reference sequences; species (x ≥ 97%), genus (97> x ≥94.5%), family (94.5> x ≥86.5%), order (86.5> x ≥82%), class (82> x ≥78.5%), and phylum (78.5> x ≥75%). Cutoff values are taken from Yarza et al. (2014).

To reduce computation and accuracy, we built different versions of reference 16S databases that match various regions of 16S sequences. For example, full-length version (V1-V9) is used for PacBio ccs data whereas the V3-V4 version is used for MiSeq 250 bp paired-end sequencing data.

Detecting chimeras:
We assume that NGS sequencing reads which match the reference sequences in EzBioCloud database are not chimeric. Only the remaining reads are checked for chimera using the UCHIME program [Learn more].
Picking OTUs:
OTU (operational taxonomic unit) is a widely used term in microbiome research and can be regarded as “species” [Learn more]. All sequences from a sample can be clustered into many OTUs using different algorithms and software tools. Rideout et al. (2014) evaluated three algorithms (de novo, close-reference and open-reference). BIOiPLUG pipeline adopted “open-reference” method with the following three steps:

  1. All quality controlled query sequences are matched to EzBioCloud 16S database to achieve the species level identification (97% cutoff).
  2. The sequences that are not matched by 97% are then clustered using UCLUST tool with 97% similarity boundary. An OTU is defined as a group of clusters.
  3. The species identified in step 1 and OTUs obtained by step 2 are combined to become the final set of OTUs. This information is later used for calculating alpha diversity indices.
  4. Any remaining singletons are ignored in the OTU picking process. This is particularly important for Illumina short reads, which may over-estimate the number of OTUs [Learn more].
Estimating alpha diversity indices:
Using OTU information (number of OTUs and sequences in each OTU), various alpha diversity indices can be calculated. These include species richness, Shannon and Simpson diversity indices, and many more.
Secondary analysis using BIOiPLUG:
Once all calculations are carried out for a single microbiome sample in the BIOiPLUG pipeline, all the information about that sample is saved as an object named Microbiome Taxonomic Profile (MTP).  BIOiPLUG is installed on the Amazon Cloud, and you use the BIOiPLUG web-based user-interface, to run comparative analysis and data-mining on sets of MTPs of your own choice. This process is called “secondary analysis”. Typical secondary analyses require only a few mouse clicks and you have the results in seconds.

 

 


The BIOiPLUG team / Last edited on Feb. 19, 2018