![]() |
ИСТИНА |
Войти в систему Регистрация |
Интеллектуальная Система Тематического Исследования НАукометрических данных |
||
Motivation and Aim: Transcriptomics analysis of various small RNA (sRNA) biotypes is a new and rapidly developing field. However, the bioinformatic analysis of NGS data for sRNA is prone to many challenges and not yet well established. Here, we attempt to identify the optimal pipeline configurations for each step of the sRNA analysis of human data, including read trimming, filtering, mapping, transcript abundance quantification, and differential expression (DE) analysis. Also we calculated the quality of obtained DE analysis signatures to estimate robustness of the obtained gene signature with robust and efficient rank statistics based approach. Methods and Algorithms: Small RNA raw sequencing data from 7 published human studies [1-7]. All read adapters were removed, following lab kit protocols. To adjust the pipeline we used several trimming options for the upper bound. The trimmed reads were processed with various alignment (bowtie, hisat2, STAR, rsem) and pseudoalignment (salmon, kallisto) methods. Then three filtering strategies (by mean, median count and default DE analysis methods filtering) with several thresholds were applied to reduce noise. DE analysis was conducted using DESeq2, edgeR, limma, EBSeq and NOISeq packages. The number of DE transcripts cannot be used as a robust metric for DE analysis quality, as this approach does not account for false-positive results. To evaluate the expression signature quality, we applied the previously published Hobotnica approach [8]. Also we conducted an analysis with permuted group labels to detect false positive signatures and calculate FPR and Jaccard index. Results: The most popular and usable methods of sRNA analysis were carried out. Based on our analysis, we suggest a pipeline that produces robust DE analysis results for sRNA transcripts, at least for categorical factors and two-group comparisons of biosamples (Table 1). We used existing tools to construct an optimal pipeline for quality sequencing data analysis, regardless of the differences in input data. To account for data variation in the original datasets, flexible trimming thresholds were applied. We suggest lower bound and upper bound of read length for trimming, thresholds for filtering expression data for the analysis of two groups. According to the results of DE analyses we suggest the DGE method for data with strong and well-detected signals. Conclusion: The effect of various factors that impact the expression analysis of human sRNA at different stages of data processing were investigated. The optimal pipeline (trimming, aligning, assigning, filtering, DE analysis) parameters were suggested on Table 1, and an optimized pipeline for setting and running sRNA expression analysis was proposed. Assessing the resulting expression signatures with rank-statistics-based inference suggests a way to estimate the quality of resulting signatures and performance of bioinformatic analysis for particular biological data. Results of this work were published [9] and new approaches (filtering thresholds and DE methods) were added to improve our suggestions. Acknowledgements: The study was supported by the Russian Science Foundation grant number 18-15-00202, https://rscf.ru/project/18-15-00202.