Parathaa (Preserving and Assimilating Region-specific Ambiguities in Taxonomic Hierarchical Assignments for Amplicons) is a tool used for the taxonomic assignment of amplicon sequences that takes into account the uncertainty associated with using specific variable regions/primers. Parathaa does this by generating new primer-trimmed phylogenetic trees from reference amplicon sequence datasets and then determines the optimal phylogenetic distances within that tree for taxonomic labeling. Parathaa then can use this tree to assign taxonomy to query amplicon sequences by aligning and placing those sequences into the new primer-trimmed reference database.
anpan
Genetic and genomic variation among microbial strains can dramatically influence their phenotypes and environmental impact, including on human health. However, inferential methods for quantifying these differences have been lacking. Strain-level metagenomic profiling data has several features that make traditional statistical methods challenging to use, including high dimensionality, extreme variation among samples, and complex phylogenetic relatedness. We present Anpan, a set of quantitative methods addressing three key challenges in microbiome strain epidemiology. First, adaptive filtering designed to interrogate microbial strain gene carriage is combined with linear models to identify strain-specific genetic elements associated with host health outcomes and other phenotypes. Second, phylogenetic generalized linear mixed models are used to characterize the association of sub-species lineages with such phenotypes. Finally, random effects models are used to identify pathways more likely to be retained or lost by outcome-associated strains. We validated our methods by simulation, showing that we achieve more accurate effect size estimation and a lower false positive rate compared to alternative methodologies. We then applied our methods to a dataset of 1,262 colorectal cancer patients, identifying functionally adaptive genes and strong phylogenetic effects associated with CRC status, sometimes complementing and sometimes extending known species-level microbiome CRC biomarkers. Anpan’s methods have been implemented as a publicly available R library to support microbial community strain and genetic epidemiology in a variety of contexts, environments, and phenotypes.
FUGAsseM
FUGAsseM (Function predictor of Uncharacterized Gene products by Assessing high-dimensional community data in Microbiomes) is a computational tool based on a “guilt by association” approach to predict functions of novel gene products in the context of microbial communities. It uses machine learning methods to predict functions of microbial proteins by integrating multiple types of community-wide data.
MACARRoN
MACARRoN (Metabolome Analysis and Combined Annotation Ranks to pRioritize Novel bioactives) is a computational workflow for the systematic analysis of microbial community associated metabolomes towards prioritization of novel, potentially bioactive metabolites. Here, bioactivity indicates the likelihood of causal or consequential involvement in a phenotype or environment of interest. MACARRoN is based on the principle of guilt-by-association and uses correlated abundances (covariance) between metabolic features to transfer biologically meaningful annotations from annotated metabolites to unknown metabolic features in large-scale, untargeted metabolomes. For each metabolic feature, MACARRoN evaluates quantitative indicators of bioactivity such as abundance versus covarying, abundant known metabolite aka anchor (AVA), and q-value and effect size of differential abundance in a phenotype of interest. Finally, it prioritizes metabolic features with respect to their bioactivity based on the meta-rank of the aforementioned indicators of bioactivity. MACARRoN is available as a Bioconductor package and can be run via both command line and R.
KneadData
KneadData is a tool designed to perform quality control on metagenomic sequencing data, especially data from microbiome experiments. In these experiments, samples are typically taken from a host in hopes of learning something about the microbial community on the host. However, metagenomic sequencing data from such experiments will often contain a high ratio of host to bacterial reads. This tool aims to perform principled in silico separation of bacterial reads from these “contaminant” reads, be they from cat, dog, or human hosts, from bacterial 16S or shotgun sequences, or other user-defined sources.
The bioBakery
The bioBakery workflows is a collection of workflows and tasks for executing common microbial community analyses using standardized, validated tools and parameters. Quality control and statistical summary reports are automatically generated for most data types, which include 16S amplicons, metagenomes, and metatranscriptomes. Workflows are run directly from the command line and tasks can be imported to create your own custom workflows. The workflows and tasks are built with AnADAMA2 which allows for parallel task execution locally and in a grid compute environment.