आईएसएसएन: 2167-0587
Gregory B Gloor
Statement of the Problem: Commonly-used methods of analyzing microbiome or RNA-seq datasets can be misleading and all the available information in a consistent manner are not in use. These results in many analyses being dominated by either the most abundant, or the rarest features: In fact, it is often the case that the most abundant taxa dominate multivariate outputs, and the rarest taxa dominate univariate outputs in the same dataset. Furthermore, these datasets have extraordinary properties that make the use of correlation and network analysis problematic. Methodology and Theoretical Orientation: Data collected using high throughput sequencing (HTS) methods are sequence reads mapped to genomic intervals, and are commonly analyzed as either normalized count data or relative abundance data. One reason for these normalizations is to attempt to compensate for the problem that the sequencing instrument imposes an upper bound on the number of sequence reads. Positive data with an arbitrary bound are compositional data and are subject to the problem of spurious correlation. Thus, ordination, clustering and network analysis become unreliable. A second problem is that the data are sparse: i.e., contain many 0 values. A third problem is that the largest measurement error is at the low count margins in these datasets. Conclusion & Significance: We use microbiome datasets to show how Bayesian estimation combined with compositional data approaches that examine the ratios between taxa give robust insights into the structure and function of microbial communities. I will present example datasets drawn from the human and ecological domains and show that ordination, differential abundance and correlation can be interpreted in an internally consistent manner that provides reproducible insights.