[UBCG] User’s manual

[OAU] Manual
05/15/2017
[UBCG] Gene set
12/03/2017

[UBCG] User’s manual

What is the UBCG?

UBCG stands for the up-to-date bacteria core gene. UBCG is a method and software tool for inferring phylogenetic relationship using bacterial core gene set that is defined by up-to-date bacterial genome database.

This document is for version 1.  If you have an older version, please download and install the latest version.

How to cite the UBCG pipeline

If you use this tools, please cite the following:

Na, S. I., Kim, Y. O., Yoon, S. H., Ha, S. M., Baek, I. & Chun, J. (2018). UBCG: Up-to-date bacterial core gene set and pipeline for phylogenomic tree reconstruction. J Microbiol 56, (in press).

Gene set used in the UBCG pipeline

The most widely employed method for phylogenetic tree reconstruction using genome sequences is using the core gene set. The core gene set can be defined as

  • Genes that are present in the majority of genomes, if not all
  • Genes that are present in a single copy (orthologous but not paralogous)

The number of core gene set can vary depending on the scope of a target taxon. If you generate a phylogenetic tree for a species, the core gene set may consist of up to thousands of genes. However, to cover any taxa in the domain Bacteria, the core gene set should be restricted to the highly conserved ones (Bacterial Core Gene [BCG]).

Because the number and taxonomic coverage of complete genome sequences in the public database have been limited, the number of BCG set has varied over time.

Here, we compiled the latest bacterial core gene set, named UBCG, using the largest dataset ever (1,429 complete genome sequences, a genome per a species, covering 28 phyla). The current UBCG set consists of 92 genes whose details are given here.

Concept of the UBCG pipeline

We designed the pipeline for users to handle hundreds of genomes, if not thousands. Here, the concept behind our design is detailed to help you understand and maximize our pipeline.

  • All UBCG sequences extracted from each genome sequence are stored in a single file (*.bcg). This file also contains a label with full information about the strain (e.g. Escherichia coli K12 MG1665) and other details (e.g. database accession). Once a bcg file is generated, it can be used for different analyses. This allows users to change the labels in the phylogenetic trees.
  • A run is carried out using a set of bcg files of user’s choice. For this, selected bcg files are saved in a single directory, then the UBCG pipeline will align each of the core genes, concatenate them, filter aligned positions, and calculate phylogenetic trees and gene support indices (GSIs).
  • If a user wants to run the pipeline for another set of bcg files, store the desired bcg files in bcg directory and re-run the pipeline. In other words, the set of bcg files to be analyzed together is controlled by the content of a directory holding bcg files.

Installation

  • The lasted version is available at https://www.ezbiocloud.net/tools/ubcg.
  • It has been tested on Linux and Mac OS X 10 or higher. MS Windows is not supported due to the external programs used.
  • Unzip the UBCG.zip file in the desired directory.

File formats used in UBCG tool

  • *.bcg: The files with *.bcg extension are of JSON format and contain all extracted UBCG gene sequences with metadata (data about data). This file is a text format and readable by any text editor. So, you can extract sequence information and edit metadata, if necessary. bcg is designed to hold all necessary information about genome and strain.
  • *.fasta: FASTA format is a standard one for holding genome sequences. In UBCG tool, all fasta files containing genome sequences should be converted to bcg files before generating multiple alignments and inferring phylogenetic trees. The results of multiple alignments are also written as fasta format files.
  • *.nwk: Newick is a standard format for phylogenetic trees.
  • *.trm: A JSON format file containing Newick-format trees and metadata of individual core gene trees and a UBCG tree.
  • *.log: A log file is a text format file that contains detailed information about the pipeline run.

Typical structure of directories

  • The program’s root directory should contain the UBCG.jar file and programPath file that contains the locations of the external software tools.
  • “fasta” directory contains the FASTA format files holding genome/contig sequences.
  • “bcg” directory contains JSON format files holding UBCG gene sequences with metadata.
  • “output” directory contains all output files generated by the UBCG tool. Within output, results of each run are stored in the separate directory.

Installing external programs

  • The following programs should be installed. The locations of programs should be written in “programPath” file.
  • PRODIGAL
  • HMMER3
  • FastTree
  • You may also install and use other tools for phylogenetic inferences. Since we provide multiple alignment files, any program can be used to draw phylogenetic trees.

Running UBCG pipeline

Step 1: Converting contigs (fasta) to bcg files

  • Command: java -jar UBCG.jar extract
  • This command converts a fasta file to bcg file using prodigal and hmmsearch tools.
  • You are required to designate the following parameters:
    • -i                 : name of input fasta file containing contigs.
    • -bcg_dir    : directory for all bcg files. The name of bcg file will be same as fasta file.
    • -label          : full label of the strain/genome. It should be encompassed by single quotes. E.g. “Escherichia coli O157 876”
  • The followings are optional, but useful metadata
    • -taxon   : name of species
    • -strain   : name of the strain
    • -type      : add this if a strain is the type strain of species or subspecies
    • -acc        : accession of a genome sequence. Usually, NCBI’s assembly accession is used for public domain data.
    • -uid        : this is a unique integer id. If you do not designate, one will be automatically generated for you. Ignore this when you are not sure.
  • For a test run
    • Please download fasta files from here and uncompress in “fasta” directory. This file contains genomes in the order Corynebacteriales.
    • You need to convert each fasta to a bcg file using UBCG program. A text file containing the necessary commands is given here for your download.
    • Go to “bcg” directory where you should be able to find 6 *.bcg files that contain UBCG gene sequences with metadata.
  • A bcg file contains all UBCG gene sequences and necessary information. The goal of the next step is to generate a multiple-alignment from multiple bcg files.
  • The content of bcg files (for example, gene sequences) can be viewed (as CSV format that is readable by Microsoft Excel or Google spreadsheet) by using the following command:
    • java -jar UBCG.jar view -i  <a bcg file name>
    • java -jar UBCG.jar view -d <directory containing bcg files>

Step 2: Generating multiple alignments from bcg files

  • Place all bcg files that you want to include in the analysis into a single directory by copying desired bcg files.
  • Command: java -jar UBCG.jar align
  • You are required to designate the following parameters:
    • -bcg_dir    directory for bcg files that you want to include in the alignment.
  • Optional parameters:
    • -out_dir    directory where all output files will be
    • -a <string>:  alignment method (default : codon).
      • nt             : nucleotide sequence alignment
      • aa             : amino acid sequence alignment
      • codon      : codon-based alignment (output is nucleotide sequences, but alignment is carried out using amino acid sequences).
      • codon12  : same as “codon” option but only 1st and 2nd nucleotides of a codon are selected. The 3rd position is usually of high variability.
    • -t <integer>      : number of threads to be used (default : 1)
    • -f <integer>      : set a filtering cutoff for gap-containing positions (default: 50)
      • Enter 0~100
      • 0 to select all alignment positions
      • 100 to select positions that are present in all genomes
      • 50 to select positions that are present in a half of genomes
    • -prefix <string>: a prefix is to appended to all output files to recognize each different run. If you don’t designate, one will be generated automatically.
      • e.g. john_115, mycoplasma_1
    • -gsi_threshold: Threshold for Gene Support Index (GSI). 95 means 95%. (default = 95)
    • -raxml : Use RAxML for phylogeny reconstruction (Default: FastTree)
    • -zZ : Make zZ-formatted files. This additionally creates fasta/nwk files with zZ+uid+zZ format for the names of each genome
  • Examples of typical runs
    • java -jar UBCG.jar align -bcg_dir bcg -prefix mytest1   (align and draw trees with bcg files in “bcg” directory and save all results in “output/mytest1” directory.
  • Output files will be generated in output directory (for default) or the directory that you designated with the following name: (if the prefix is mytest1)
    • nwk files can be viewed by MEGA, FigTree and other tree viewers. MEGA is tested for displaying Gene Support Index (GSI) on the branches of phylogenetic trees.
    • mytest1.log = a text file containing logs (what happened during execution of program)
    • mytest1.UBCG_concat.codon.label.nwk = A Newick file based on UBCG gene set, codon alignment, 50% filtered, labeled with full label
    • mytest1.UBCG_concat.codon.zZ.nwk =  A Newick file based on UBCG gene set, codon alignment,50% filtered, labeled with zZ+Unique id+zZ
    • mytest1.UBCG_concat_gsi.codon.label.nwk = A newick file vased on UBCG + Gene Support Index (GGI) values with full label
    • mytest1.UBCG_concat_gsi.codon.zZ.nwk = A newick file based on UBCG + Gene Support Index (GGI) values with zZ+Unique id+zZ
    • mytest1.concat.codon.50.label.fasta = A FASTA file containing multiple alignments of UBCG genes, codon aligned, 50% filtered with full label
    • mytest1.concat.codon.50.zZ.fasta = A FASTA file containing multiple alignments of UBCG genes, codon aligned, 50% filtered with zZ+Unique id+zZ
    • mytest1.secY.codon.50.label.nwk = A newick file based on a single gene (secY), codon aligned, 50% filtered with full label
    • mytest1.secY.codon.50.zZ.nwk = A newick file based on a single gene (secY), codon aligned, 50% filtered with zZ+Unique id+zZ
    • mytest1.align.secY.codon.50.label.fasta = A FASTA file containing multiple alignment of a single gene (secY), codon aligned, 50% filtered with full label
    • mytest1.align.secY.codon.50.zZ.fasta = A FASTA file containing multiple alignment of a single gene (secY), codon aligned, 50% filtered with zZ+Unique id+zZ

Frequently asked questions

  • How can I access the UBCG gene sequences?

    • All BCG gene sequences, as both nucleotide and amino acid sequences, are stored in a bcg file. It is of the JSON format. Any JSON viewer can be used to explore the content of bcg files. For example, https://codebeautify.org/jsonviewer can be used. Alternatively, a chrome extension can be installed from here, if you use the Chrome web browser.

 

Last updated Jan 20, 2018, by JC