How to use EzBioCloud 16S database with QIIME

How to use EzBioCloud 16S database with MOTHUR
05/13/2017
About EzBioCloud.net
05/15/2017

How to use EzBioCloud 16S database with QIIME

QIIME is the most widely used open-source bioinformatics pipeline for microbiome analysis (http://qiime.org/). In this document, we provide a guide for using EzBioCloud’s 16S database with QIIME’s pipeline.

Unlike other public databases, EzBioCloud’s 16S database can be used for species-level identification of OTUs and is freely available for academic, not-for-profit purposes. To request, please visit our form here.

Version:1.5
Update:2017.01

0. Preparation

To begin microbiome analysis using EzBioCloud’s 16S database with QIIME, a few preparatory steps are required:

  • Installation of QIIME and USEARCH
  • Obtain FASTQ file
    • ensure that sequence barcodes are removed
    • merge paired-end sequences and avoid not paired FASTQs
    • ensure that sequences are processed to filter by quality, length, etc.)
  • Obtain EzBioCloud’s 16S database containing
    • FASTA file
    • MAPPING file

1. Indexing sequences

All sequences in the FASTQ file must be indexed for the analysis to proceed. For example, all sequences in the FASTQ file with the name “A” must be indexed as “A_1”, “A_2”, “A_n”…

In this way “[sample name] _ [number]” should be at the beginning of each sequence name. You can apply such indexing by using other programs but in this case, we recommend using QIIME’s “split_libraries_fastq.py”source code.

  • Command line
    $ split_libraries_fastq.py -i example.fastq -o example_index --barcode_type='not-barcoded' --sample_ids=example --phred_offset=33 -q 0 -p 0.00001
  • Option description
    -i : The sequence read FASTQ files (comma-separated if more than oen)
    -o : directory to store output files
    --barcode_type : Type of barcode used (this can be an integer)
    --sample_ids : comma-separated list of samples ids to be applied to all sequences (must be one per input file path)
    --phred_offset : the ascii offset to use when decoding phred scores
    -q : the maximum unacceptable phred quality score
    -p : minimum number of consecutive high quality base calls to include a read
  • Related QIIME site
    http://qiime.org/scripts/split_libraries_fastq.html

2. Identifying and filtering out chimeric sequences

This step is to detect and filter out chimeric sequences from the indexed FASTQ file using a chimera detection program such as usearch61.

  • Command line
    $ identify_chimeric_seqs.py -m usearch61 -i example_index/seqs.fna --suppress_usearch61_ref -o chimera
    $ filter_fasta.py -f example_index/seqs.fna -s chimera/chimeras.txt -n -o example.non_chimera.fasta
  • Option description
    • identify_chimeric_seqs.py
       -i : path to the input fasta file
       -m : chimera detection method. Choices: blast_fragments or ChimeraSlayer or usearch61
       --suppress_usearch61_ref : use to suppress reference based chimera detection with usearch61
       -o : path to store output, output filepath in the case of blast_fragments and ChimeraSlayer, or directory in case of usearch61
    • filter_fasta.py
       -f : path to the input fasta file
       -s : a list of sequence identifiers (or tab-delimited lines with a seq identifier in the first field)
       -n : discard passed seq ids rather than keep passed seq ids
       -o : The output fasta filepath

3. Perform open-reference OTU picking process

QIIME’s tutorial describes an open-referenced OTU picking process as:
“In an open-reference OTU picking process, reads are clustered against a reference sequence collection and any reads which do not hit the reference sequence collection are subsequently clustered de novo.”

In this guide, we’ll be using EzBioCloud’s 16S database as a reference with the open-reference protocol. However, as EzBioCloud’s 16S database is not QIIME’s default database (Greengene’s database is QIIME’s default), we must manually set a path for EzBioCloud’s taxonomy-id-file to map the results of the analysis file in order to overwrite the result’s default Greengene id. To do so:

  • Command line
    $ pick_open_reference_otus.py -i example.non_chimera.fasta -r db_files/eztaxon_qiime_full.fasta -o results -a -O 4
  • Option description

    -i : the input sequences filepath or comma-separated list of filepaths
    -r : the reference sequences
    -o : the output directory
    -a : run in parallel where available
    -O : number of jobs to start

4. Overwriting OTU table

Because of the issue described in section 3, “parallel_assign_taxonomy_uclust.py” should be performed again after setting EzBioCloud’s 16S database. Then, you should overwrite the previously generated “BIOM table.

In this process, we provide option descriptions to expire secies-level identification of OTUs. These descriptions are highlighted in bold. Once you add these options into “parallel_assign_taxonomy_uclust.py” of the command line, you can get species-level identification results of OTUs.

  • Command line

    $ parallel_assign_taxonomy_uclust.py -i results/rep_set.fna -r db_files/eztaxon_qiime_full.fasta -t db_files/eztaxon_id_taxonomy.txt -o results/uclust_assigned_taxonomy -T -O 4
    $ biom add-metadata -i results/otu_table_mc2.biom --observation-metadata-fp results/uclust_assigned_taxonomy/rep_set_tax_assignments.txt  -o results/otu_table_mc2_w_tax.biom  --sc-separated taxonomy   --observation-header OTUID,taxonomy
  • Option description
    • parallel_assign_taxonomy_uclust.py
      -i : full path to fasta file containing query sequences
      -r : ref seqs to search against
      -t : full path to id_to_taxonomy mapping file
      -o : path to store output files
      -T : poll directly for job completion rather than running poller as a separate job
      -O : Number of jobs to start
      --min_consensus_fraction=1 : Minimum fraction of database hits that must have a specific taxonomic assignment to assign that taxonomy to a query [default: 0.51]
      --uclust_max_accepts=1: number of database hits to consider when making an assignment [default: 3]
    • biom add-metadata
      -i : the input BIOM table
      -o : the output BIOM table
      --observation-metadata-fp : the observation metadata mapping file
      --sc-separated : comma-separated list of the metadata fields
      --observation-header : comma-separated list of the observation

Edited by Mikael Hwang (Jan 2017)