ИСТИНА |
Войти в систему Регистрация |
|
Интеллектуальная Система Тематического Исследования НАукометрических данных |
||
Motivation and Aim: Transcriptomics analysis of various small RNA (sRNA) biotypes is a new and rapidly developing field. Annotations for microRNAs, tRNAs, tRNA-derived (tsRNA), piRNAs and rRNAs, contain information on transcripts sequences and loci that is vital for downstream analyses. Several databases have been established to provide this type of data for specific RNA biotypes. However, these sources often contain data in different formats, which makes bulk analysis of several sRNA biotypes in a single pipeline challenging. Information on some transcripts may be incomplete or conflicting with other entries. To overcome these challenges, we aimed to introduce ITAS, or Integrated Transcript Annotation for Small RNA, a filtered, corrected and integrated transcript annotation containing information on several types of small RNAs including tsRNAs for several species (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans), which was tested in several case studies for human-derived data against existing alternative databases. Methods and Algorithms: Files with transcripts sequences in fasta format (referred to as fasta sequences) and annotation of transcripts in bed-format (loci) were obtained from corresponding databases: GtRNAdb Data Release 19 (June 2021) for mature tRNA sequences, piRNAdb v1.7.6 for piRNAs, miRBase Release 22.1 for microRNAs, UCSC for rRNA, MINTbase v2.0 (Human) and tRFdb for tsRNA. For some loci additional UCSC liftOver procedure followed to transform loci to the most recent genome version. To overcome the incompletion of databases or conflicting data for some transcripts sequences in fasta-format from the reference genome (getfasta sequences) which correspond to the annotation bed-file were obtained by bedtools getfasta version 2.27.1, and fasta sequences were mapped on the reference genome of current version by hisat2. Then consensus database for each sRNA type was built, and transcripts correction was proceeded with the help of R instruments. Transcripts with conflicting data due to inbase and interbase (mature microRNA, piRNA, rRNA, tRNA) loci conflicts were detected with the help of bedtools intersect. Human sperm RNA-seq data from three publicly available datasets [1-3] were used for case studies. The differential expression analysis was done by DESeq2 package in relation to factors analyzed in original studies. Results: The most popular databases of sRNA were analyzed. The following steps were invoked to approach outlined problems while generating the ITAS: (1) missing data for the incomplete entries were retrieved and filled in; (2) the problem of multiple loci per transcript was addressed; (3) transcripts with conflicting fasta-delivered and locus-delivered data were identified, corrected if possible, and filtered out otherwise; (4) in-database and inter-databases loci-wise intersecting entries were identified and filtered out; (5) inter-databases loci-wise intersecting entries were identified and filtered out. The identified problems were corrected where possible and entries with severe conflicts were filtered out. The statistics for Human annotations correction for different biotypes of small RNA is provided in Table 1. The conducted case studies using human sperm RNA-seq data [1-3] demonstrated the advantages of ITAS. Mapping of reads to ITAS which unifies in a single gtf format database transcripts from all source databases allowed for the identification of more significant transcripts as compared with the “map and remove” remove’ approach. In particular for sRNAs expression analysis, all cases revealed that ITAS based genome alignment approach identified more significant transcripts than the “map and remove” pipeline with default databases (11 vs 66 for Donkin et al., 43 vs 212 for Ingerslev et al., 26 vs 242 for Hua et al.) Conclusion: We have identified several issues during the inference of existing databases containing information on sRNA transcripts sequences and annotation for five species. Some transcripts had missing information on their sequences or loci; for others, their genome locus-retrieved sequence and database provided-sequence were not matching. Transcripts had both in-database and inter-databases intersecting loci with other entries. To address these drawbacks we established ITAS, a filtered, corrected, and integrated database for 5 species in a widely used gtf format, which was tested in several case studies for human-derived data against existing alternative databases. Acknowledgements: The study was supported by the Russian Science Foundation grant number 18-15-00202, https://rscf.ru/project/18-15-00202.