Evolution of plant miRNA annotation methods and criteria

Category and comparison of plant miRNA annotation tools

As new experimental methods were constantly evolved and the characteristics of plant miRNAs have been refined more and more clear, annotation tools have also been continuously developed. Except tools purely based on the search of sequence homolog, currently broadly used tools could be sorted into two main categories. Category I includes tools mainly discovering plant miRNAs based on the features of reads signature along with precursors while tools in category II employ various machine learning methods to learn characteristics of different parts of miRNAs and further to make prediction, even though there are tools which combine both strategies.

Tools in category I was triggered by the introduction of tools similar to miRDeep over one decade ago when sRNA-seq libraries started to become a popular method and major features of miRNAs have been described. Friedländeret al., firstly discovered that reads corresponding to mature and star miRNAs, especially mature miRNAs, were enriched in deeply sequenced sRNA libraries, based on the potential reason that miRNAs are loaded into RISC complex and protected, not like other parts which were quickly degraded after accessing the miRNA biogenesis process. Meanwhile, they found that some reads along other parts of precursors still can be traced, which could be utilized as an extra point when predicting a putative miRNA candidates, even though this was found not necessarily after following research. In addition, they evaluated these candidates in terms of other features of miRNA precursors, such as secondary structure, overhangs between mature miRNAs and miRNAs*, and so forth. In general, these tools combined the knowledge on miRNA biogenesis including secondary structure, the profile of small reads along precursors, etc., weighted these characteristics by a statistic scoring system and finally set a cut-off score for miRNA candidates. Inspired by miRDeep, many similar tools were developed specifically for plants such as the series of miRDeep-P and miRDeep-P2, miRPlant, miR-PREFeR, miRA, etc. These tools have achieved notable success, and many new and species-specific miRNAs were annotated and meanwhile validated by other supportive experimental methods.

However, its several intrinsic designs limit a further development on its efficiency and accuracy. First, with the sequenced large plant genomes and the increased depth of sRNA-seq libraries, the prediction process becomes very time-consuming. In detail, the process of excising putative precursors based on a huge number of loci where sRNA reads mapped always needs immense size of computational memory, sometimes resulting in a crash of computational system. Another time-consuming step is to calculate secondary structures of the large number of putative precursors excised from reference genome. In sum, a classic process of this type of method becomes impossible to carry the miRNA prediction in the case of large genome reference.

To meet these challenges, miRDeep-P2 proposed and adopted several new strategies. In terms of preselecting conserved miRNAs, setting a cut-off of reads number (potential mature miRNAs) and segmenting large genome into small pieces, together dramatically decreasing the locus number of putative miRNA precursors, miRDeep-P2 could save a lot of computational resource and time as well. Meanwhile, miRDeep-P2 employed a paralleled method to gain the secondary structure of putative precursors. Taken together, these new strategies made the miRNA prediction process much efficiently in case of huge genomes. Second, the newly updated plant miRNA criteria especially on how to separate large number of siRNAs from miRNAs helped the increase of accuracy and sensitivity while miRDeep-P2 accommodated these new criteria.In addition, given that the current criteria are based on many specific cases and not fully comprehensive and objective, there is still much room for improvement in prediction accuracy of this type of tools.

The first taste of tools in category II was introduced before category I when the miRNA features were roughly described. A pioneer is triplet-SVM, which in general segmented a miRNA into different parts, captured the characteristics of each of them such as sequences, local structures, etc., and learned the patterns for further miRNA identification. Besides support vector machine (SVM), other models like random forest(RF), hierarchical hidden Markov (HHMM) models, were utilized later. In consideration of the uniformity of animal miRNAs in length and structure, this type of tools brought great success in animal miRNA annotation. Even though many of them claimed that they were also practicable in plant miRNA annotation, their performance in plants was usually not as good as in animals. Since the prediction was based on the learning results from a training dataset, a complete and solid training dataset was extremely important, which meanwhile become a potentially biggest obstacle of these tools considering the noise and limitation on the training dataset. Taken triplet-SVM as an exemplar, it is clearly demonstrated the remarkable influence due to the selection of different training datasets. First, the current training dataset usually includes all known/annotated miRNAs in a specific species from public database, which is usually in possession of many noises. For instance, even for the most widely studied model plants, Arabidopsis and rice, their miRNA collections still consist of many false positives.Second, the more variable precursors in plants as mentioned above hinder the accurate extraction of plant miRNA features, resulting in the low prediction accuracy which many tools already observed. Third, it was proposed that features learned from miRNAs in one or two species are quite limited when they were employed to predict miRNAs in other species. In addition, some of the tools like triplet-SVM also exist the time-consuming challenge if they employed the same strategy to extract putative precursors. In short, the shortages of solid training datasets along with other deficiencies also result in a large space for improvement of tools in category II.

Summary of plant miRNA databases

Besides various tools developed to annotate plant miRNAs, databases encompassing plant miRNA entries have also greatly facilitated the study of plant miRNAs. The pioneers, including miRBase and Rfam, are the first batch of miRNA databases collecting both plant and animal miRNAs studied in publications. The content of these databases is constantly enriched, prompting them to become the central reservoirs of miRNA entries. ASRP is the first database in particular for model plant Arabidopsis after NGS method was introduced, and its unique sRNA-seq datasets from both wild type and mutants (e.g. dicer mutant) greatly accelerated functional understandings on plant miRNAs. Since the main function of a miRNAs is to regulate target mRNAs, many databases focus on the interactions between miRNAs and target genes, such as TarBase,miRTarBase, starBase, etc., which in general integrate a variety of methods for studying miRNA and target gene regulation including microarray, PARE-seq (parallel analysis of RNA ends sequencing), CLIP-seq (cross-linking immunoprecipitation sequencing), etc. At the same time, with the popularity of NGS technology, the number of uncovered miRNAs and related studies is increasing exponentially. Under this circumstance, the development of databases has come into two directions. On the one hand, comprehensive databases have been making effort to collect miRNA entries and related information as much and complete as possible, like miRBase, Rfam, and recent representatives, sRNAanno and PmiREN. On the other hand, some databases focus on annotating miRNAs in specific species or a specific field. For example, MiRPub focuses on the collection of miRNA-related literature, mirtronDB is a collection of mirtrons, and MepmiRDB is a database of miRNAs in medicinal plants, and so forth.

Based on the content of miRNA databases, they could be divided into 4 categories. The first category are comprehensive databases, which include various types of miRNA information in multiple species, such as miRBase, PMRD/PNRD, PmiREN, sRNAanno, etc. The second category is the miRNA-target database, and they emphasize the collection of miRNA-target interactions and collect various experimental evidence, such as TarBase, miRTarBase, starBase,etc. The third category is databases for specific species, and they are mainly for a limited number of species, such as PmiRKB, MepmiRDB, etc. The last category includes ones cannot be classified into the above three categories. They in general focus on a very specific direction and collect a certain type of miRNAs or a certain type of information related to miRNAs, and MirPub, PmiRExAt, PASmiR, miSolRNA, etc., are representatives.

Compared to the development of these databases, the quality of annotated miRNA entries is more critical. Taking miRBase as an exemplar, it collects and deposits miRNA entries annotated by different scholars around the world, and currently is the most widely used miRNA database. However, the researchers have argued against its incompleteness and noise. Taylor et al.,pointed out a few years ago that it was estimated that two thirds of the plant miRNA families in miRBase (release 21)were questionable. These challenges are also existing in other databases caused by composite factors including the intrinsic defects of NGS method along with the limitation of current plant miRNA annotation criteria and many noises/false positives introduced by different methods and standards. A large portion of databases could not be updated with the latest information in time, which is another reason resulting in the incompleteness and low SNR of their contents, especially in the current big data era.

To overcome these challenges, we developed PmiREN last year and proposed a potential standard for comprehensive plant miRNA databases. First, a unified pipeline and newly updated annotation criteria was employed, that is, PmiREN utilized a pipeline centred on miRDeep-P2 with plant miRNA annotation criteria published in 2018. In addition, to a specific plant, a relatively complete genome reference, sRNA-seq datasets, along with alternatively available PARE-seq, etc., were required as inputs. Second, we were also trying to explore a standard that what kinds of information should be included in an annotated miRNA entry. It should be pointed out that this is just an attempt based on the public available data, and it is for sure that the items and contents of an annotated miRNA entry could be continuously expanded and updated.