What is the chimera?
According to Greek mythology, the Chimera is a monstrous fire-breathing hybrid creature of Lycia in Asia Minor, composed of parts from more than one animal. Here, we define the chimera as an artifactual PCR product/amplicon generated erroneously from more than one DNA template. It is a well known fact that chimeras are inevitable when preparing amplicon sequencing libraries for NGS. It is therefore important to detect and filter them out before any types of microbiome analyses.
Mechanism of chimera formation
PCR involves multiple cycles of (i) denaturation of DNA templates by heat to generate the single-stranded DNA templates, (ii) annealing of the primers to each of the DNA templates and (iii) extension/elongation by DNA polymerase. The major cause of the chimera formation is an aborted extension product from an earlier cycle of PCR which can act as a primer in a subsequent PCR cycle. If this aborted extension product anneals to and primes DNA synthesis from an incomplete template, a chimeric PCR product is formed (see the below figure).
The ratio of chimeras in PCR reactions varies depending on the DNA polymerase used, PCR conditions, and the product size and diversity of the DNA templates. Hass et al. (2011) reported that 15~20% chimeras were detected for 454 sequencing of 16S.
How to detect the chimeras
There are two major approaches to detecting chimeras in NGS-based amplicon data.
(1) Reference-dependent detection: As shown in the above example, each end of the PCR product matches to the strains A and B, respectively. However, as a whole sequence, it would not match to either strain A or B with high similarity. If we know the exact sequences of strains A and B, and there are substantial differences between two, we should be able to figure out that this chimeric product did not come from a single strain but from both strains. Using this principle, a large number of NGS reads can be screened for chimeras using a well established trusted, non-chimeric reference database. Needless to say, the quality of the reference chimera-free database is the key to success in this case. UCHIME and ChimeraSlayer provide this algorithm.
(2) De novo detection: In this algorithm, a chimera-free reference database is automatically generated for each NGS data. Initially, the reference database is empty. Then, NGS reads are considered in the order of decreasing abundance. If a sequence is classified as chimeric, it is discarded; otherwise, it is added to the reference database (so the size of reference database grows). Candidate parents (PCR templates, strains A and B in the previous figure) are required to have more abundance than that of the query sequence, on the assumption that a chimera has undergone fewer rounds of amplification and will, therefore, be less abundant than its parents (Edgar et al., 2011). UCHIME provides this algorithm.
Because there is a huge amount of full-length 16S sequences available, reference-dependent detection has been mostly used in recent studies, particularly for the human microbiome. UCHIME, as implemented in QIIME and MOTHUR packages, is most widely used and has been cited many times.
Example of chimera
The following sequence from a human skin microbiome sample was generated by a Roche 454 instrument.
This sequence is identified as a chimera by UCHIME algorithm as:
- The left part of the sequence matches to 99.7% to Staphylococcus epidermidis (Firmicutes;Bacilli;Bacillales; Staphylococcaceae;Staphylococcus)
- The right part of the sequence matches to 100% to Propionibacterium acnes (Actinobacteria;Actinobacteria; Propionibacteriales;Propionibacteriaceae; Propionibacterium)
Please note that two parents belong to different phyla (Firmicutes and Actinobacteria). Try this by yourself using EzBioCloud’s [Identify] service at https://www.ezbiocloud.net/identify; copy the left and right half of the above sequence and use them to identify the left and right parts of this 454 sequence.
Chimera detection in BIOiPLUG
BIOiPLUG uses the UCHIME and manually curated chimera-free reference database. BIOiPLUG’s chimera-free reference database contains:
- Sequences from pure cultures.
- Full-length sequences of uncultured organisms that are confirmed to be genuine. This includes >2,000 sequences generated by PacBio CCS technology which were recovered in the repetitious PCR reactions of the same sample or different samples. Please note that chimera formation is thought to be random, so repetitive recoveries of the same sequence from different PCR reactions is a fair indication of non-chimeric reads. The quality of sequences in the chimera-free reference database was further checked manually using secondary structure modeling of 16S rRNA molecules.
The following figure illustrates the workflow for chimera detection in the BIOiPLUG pipeline. The query NGS sequence data is first subject to taxonomic assignment to the EzBioCloud 16S database. If a sequence matches to a reference sequence with >97% similarity, it is assigned to a species, but also not labeled as a chimera, as the EzBioCloud 16S database is also checked by a rigorous quality control process that includes chimera detection. The remaining query NGS reads are checked by the UCHIME tool. Because of high coverage of our chimera-free reference database in human and mouse microbiomes, we believe that chimeras that escape this process are minimal, particularly for human/mouse microbiome samples.
Notes on chimera detection
- Chimera detection is a very important step in the microbiome analysis as the unchecked chimeras will be noted as a novel species. Together with erroneous sequences, chimeras will falsely increase the number of species/OTUs detected. Consequently, this will affect the accuracy of alpha-diversity indices by overestimating them.
- There is no way to detect all chimeras. However, the efficiency of the chimera removal process can be greatly improved by the quality and coverage of a chimera-free reference database.
The BIOiPLUG team / Last edited on Feb. 19, 2018