How to use EzBioCloud 16S database with MOTHUR

How to use EzBioCloud 16S database with QIIME
05/15/2017

How to use EzBioCloud 16S database with MOTHUR

MOTHUR is a widely used, open-source bioinformatics pipeline for microbiome analysis (https://www.mothur.org/). In this document, we provide a guide for using EzBioCloud’s 16S database with MOTHUR’s pipeline (https://www.mothur.org/wiki/MiSeq_SOP).

Unlike other public databases, EzBioCloud’s 16S database can be used for species-level identification of OTUs and is freely available for academic, not-for-profit purposes. To request, please visit our form here.

0. Preparation

To begin microbiome analysis using EzBioCloud’s 16S database with MOTHUR, a few preparatory steps are required:

  • Installation of mothur
  • a FASTA file
    • ensure that sequence barcodes are removed
    • merge paired-end sequences and avoid not paired FASTQs
    • ensure that sequences are processed to filter by quality, length, etc.)
  • EzBioCloud’s 16S database
    • aligned FASTA file
    • MAPPING file

Figure 1. Overview of mothur pipeline based on MiSeq SOP

1. Unique sequences

Before mothur’s pipeline is able to function, duplicated sequences in the fasta file, processed by quality control, should be removed. You can use unique.seqs command.

  • Command line

unique.seqs(fasta=example.fasta)

  • Option description
    fasta=path to the input fasta file

2. Align sequences

The fasta file aligned by EzBioCloud’s 16S database file can be inputted using align.seqs. If you need more machine memory, try using multi-threads by adding a processor parameter.

  • Command line
align.seqs(fasta=example.unique.fasta, reference=eztaxon_full.align)
  • Option description
    fasta=path to the input fasta file
    reference=path to the reference fasta file
    processors=Integer : multiple processors

3. Clean alignment

3.1. screen sequences

The screen.seqs command enables you to keep sequences that fulfill certain user defined criteria.

  • Command line
screen.seqs(fasta=example.unique.align,name=example.names,optimize=start-end-maxhomop,criteria=95)
  • Option description
    fasta=path to the input fasta file
    optimize, criteria=The optimize and criteria parameters allow you set the start, end, maxabig, maxhomop, minlength, maxlength, minoverlap, ostart, oend, mismatches, maxn, minscore, maxinsert and minsim parameters relative to your set of sequences.

3.2. filter sequences

filter.seqs removes columns from alignments based on a criteria defined by the user.

  • Command line
filter.seqs(fasta=example.unique.good.align,vertical=T,trump=.)
  • Option description
    fasta=path to the input fasta file
    vertical=any column that only contains gap characters (i.e. ‘-‘ or ‘.’) is ignored.
    trump=The trump option will remove a column if the trump character is found at that position in any sequence of the alignment.

3.3. unique sequences filtered

This step is to check and remove duplicate sequences corrected by screen.seqs and filter.seqs.

  • Command line
unique.seqs(fasta=example.unique.good.filter.fasta,name=example.good.names)
  • Option description
    fasta=path to the input fasta file
    name=The name file is used to show the relationship between a representative sequence and the sequences it represents.

4. Pre-cluster sequences

The pre.cluster command implements a pseudo-single linkage algorithm with the goal of removing sequences that are likely due to pyrosequencing errors.

  • Command line
pre.cluster(fasta=example.unique.good.filter.unique.fasta,name=example.unique.good.filter.names,diffs=2)
  • Option description
    fasta=path to the input fasta file
    name=The name file
    diffs=2 : pre.cluster command will look for sequences that are within 2 mismatch of the sequence being considered.

5. Detect chimeric sequences

The chimera.uchime command reads a fasta and reference file, and outputs potentially chimeric sequences.

  • Command line
chimera.uchime(fasta=example.unique.good.filter.unique.precluster.fasta,name=example.unique.good.filter.unique.precluster.names)
  • Option description
    fasta=path to the input fasta file
    name=The name file

6. Remove chimeric sequences

The remove.seqs command takes a list of sequence names and a fasta file to generate a new file that does not contain sequences on the list.

  • Command line
remove.seqs(fasta=example.unique.good.filter.unique.precluster.fasta,name=example.unique.good.filter.unique.precluster.names,accnos=example.unique.good.filter.unique.precluster.uchime.accnos)
  • Option description
    fasta=path to the input fasta file
    name=The name file
    accnos=the file including chimeric sequences

7. Classify sequences

The classify.seqs command allows the user to use several different methods to assign their sequences to the taxonomy outline of their choice.

  • Command line
classify.seqs(fasta=example.unique.good.filter.unique.precluster.pick.fasta,name=example.unique.good.filter.unique.precluster.pick.names,template=eztaxon_full.align,taxonomy=eztaxon_id_taxonomy.tax,method=knn,numwanted=1)
  • Option description
    fasta=path to the input fasta file
    name=The name file
    template=DB fasta file
    taxonomy=The taxonomy id file
    method=knn : k-Nearest Neighbor algorithm
    numwanted= 1 : you instead only want the value of 1 to be 3

8. Remove non-bacteria sequences

The remove.lineage command reads the taxonomy file and taxon and generates a new file that contains only sequences without the taxon provided above.

  • Command line
remove.lineage(fasta=example.unique.good.filter.unique.precluster.pick.fasta,name=example.unique.good.filter.unique.precluster.pick.names,taxonomy=example.unique.good.filter.unique.precluster.pick..taxonomy,taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown)
  • Option description
    fasta=path to the input fasta file
    name=The name file
    taxonomy=The taxonomy id file
    taxon=The taxon parameter allows you to select the taxons you would like to remove, and is required

9. Calculate uncorrected pairwise distances

The dist.seqs command will calculate uncorrected pairwise distances between aligned DNA sequences.

  • Command line
dist.seqs(fasta=example.unique.good.filter.unique.precluster.pick.pick.fasta,cutoff=0.15)
  • Option description
    fasta=path to the input fasta file
    cutoff=If you know that you are not going to form OTUs with distances larger than 0.15, you can tell mothur to not save any distances larger than 0.15.

10. Assign sequences to OTUs

Once a distance matrix gets read into mothur, the cluster command can be used to assign sequences to OTUs.

  • Command line
cluster(column=example.unique.good.filter.unique.precluster.pick.pick.dist,name=example.unique.good.filter.unique.precluster.pick.pick.names)
  • Option description
    column=To read in a column-formatted distance matrix you must provide a filename for the name option.
    name=The name file

11. Classify OTUs

The classify.otu command is used to get a consensus taxonomy for an OTU.

  • Command line
classify.otu(taxonomy=example.unique.good.filter.unique.precluster.pick..pick.taxonomy,list=example.unique.good.filter.unique.precluster.pick.pick.an.list,name=example.unique.good.filter.unique.precluster.pick.pick.names)
  • Option description
    taxonomy=taxonomy file
    list=The list file result from cluster
    name=The name file

Written by Jimmy Lim (Dec 2016); Edited by Mikael Hwang (Jan 2017)