Contents
1. OVERVIEW 1.1 BACKGROUND 1.2 SUMMARY OF MIRDP2 FUNCTION 1.3 IMPLEMENTATION AND ALGORITHM 2. INSTALLATION 2.1 DEPENDENCIES 2.2 DOWNLOAD 2.3 TEST 3. DETECTING NEW MIRNAS 3.1 FORMATTING READS 3.2 BUILD INDEX 3.3 RUN MIRDP2 3.4 MIRDP2 OUTPUT 4. THE CONTENTS OF MIRDP2 SOFTWARE PACKAGE 5. ISSUES USING MIRDP2 5.1 PARAMETERS 5.2 REDUNDANCY AND MIRNA* 5.3 LICENSE AND AVAILABILITY 6. APPENDIX - MAJOR UPDATES OF MIRDP2 7. REFERENCES
1. OVERVIEW1.1 BACKGROUNDMicroRNAs (miRNAs) are ~21-nucleotide endogenous smallRNAs (sRNAs) with potent roles in regulating gene expression (Bartel, 2009). Inthe past two decades, extensive research efforts have been devoted to identifymiRNAs and study their functions, especially after the NGS methods becameavailable. Based on such unique features of miRNAs as stem-loop structure andpreferential accumulation of sequence reads corresponding to mature and starmiRNAs, computational tools capturing these characteristics have achievedstunning successes in identifying miRNAs in diverse species. In the publicmiRNA repository miRBase, over 38,000 miRNA items are currently hosted (version22) whereas only ~500 were stored in 2008 (version 2.0; Kozomara et al., 2014). Previously, we have developed miRDeep-P for miRNAprediction in plant species (Yang and Li, 2011). However, miRDeep-P has shown twomajor drawbacks when facing complicated input datasets, which would potentiallydampen its significance in plant miRNA prediction. One is the long running timewhen working on complex genomes or libraries with high sequencing depth. Theother is the relatively large amount of false positives mingling with truemiRNAs, which may severely impact subsequent analysis. To cope with these shortcomings, we have incorporated newplant miRNA annotation criteria (Axtell and Meyers, 2018) and overhauled thestrategies and algorithm of miRDeep-P, which lead to a significantly improvedversion, designated miRDeep-P2 (miRDP2). Compared to other miRNA predictiontools, including MIReNA (Mathelier and Carbone, 2010), miRPlant (An et al., 2014), miRPERFeR(Lei and Sun, 2014), and miRA (Evers etal., 2015), the time consumption, sensitivity, and accuracy of miRDP2 havemuch advantage (details in manuscript and supplementary materials).
1.2 SUMMARY OF MIRDP2 FUNCTION
Based on ultra-deep sampling of small RNA libraries bynext generation sequencing, miRDP2 is able to identify miRNA genes in plantspecies, even for those without detailed annotation, with extremely high speedand reliable performance.
1.3 IMPLEMENTATION AND ALGORITHM
MiRDP2 is documented by Perl (Perl 5.8 or laterversions) and makes use of fundamental packages from Perl library. All thescripts have been tested on two Linux platforms, including CentOS release6.5 on a cluster server, and Cygwin 2.6.0 on PC Windows system, andshould work on similar systems that support Perl. The basic algorithm framework of miRDP2 was inheritedfrom miRDeep-P (Yang and Li, 2011), while several critical modifications andnovel assistant scripts have been added to the original tool.
2. INSTALLATION2.1 DEPENDENCIES
To run miRDP2, several dependencies are required. First, the Bowtie or Bowtie2 should be downloaded from the site:
2.2 DOWNLOAD
2.3 TEST
The test data contains one formatted GSM sequencingfile and one Arabidopsis thalianagenome file. To test miRDP2, please follow this guide (here we use bowtiefor demonstration): - mv miRDP2-v*.tar.gzTestData.tar.gz ncRNA_rfam.tar.gz <user_selected_folder>
- cd <user_selected_folder>
- tar -xvzf miRDP2-v*.tar.gz
- tar -xvzf TestData.tar.gz
- tar -xvzfncRNA_rfam.tar.gz
- bowtie-build -f ./TestData/TAIR10_genome.fa ./TestData/TAIR10.genome
- bowtie-build -f ./ncRNA_rfam.fa./1.1.*/script/index/rfam_index
- (Using bowtie2-build if you prefer to use bowtie2 in the later analysis)
- bash ./1.1.4/miRDP2-v1.1.4_pipeline.bash -g ./TestData/TAIR10_genome.fa -x ./ TestData/TAIR10_genome -f-i ./TestData/GSM2094927.fa -o .
- (add option`-T’ or `--bowtie2’ if you prefer to use bowtie2for reads alignment)
Copy the Code
The bowtie-buildcommand may take a while, and the miRDP2 pipeline would finish within severalminutes. A folder named ‘GSM2094927-15-0-10’ should be automatically generatedin<user_selected_folder>,containing all intermediate files and results. GSM2094927-15-0-10_filter_P_prediction is the final output ofpredicted miRNAs. The file is tab-delimited output files contain columns thatindicate chromosome id, strand direction,representative reads id, precursor id, mature miRNA location, precursorlocation, mature sequence, and precursor sequence. An additional bed fileis derived from this file to facilitate further analysis. The progress_log provide info aboutfinished steps. The script_log and script_err files would retrievepotential info & warnings of the bash script. A detailed explanation of theparameters are listed in part 3.3.
3. DETECTING NEW MIRNAS3.1 FORMATTING READS
Before run the pipeline, the input reads must bepreprocessed into proper format. First, the deep sequencing reads should havethe adapters removed from 5' and 3' ends (if present). Second, the deepsequencing reads must be parsed into FASTA format. Third, redundancy should be removedsuch that reads with identical sequence are represented with a single FASTAentry. Therefore, each sequence identifier must end with a '_x' and an integer,with the integer indicating the number of times the exact sequence was retrievedin the deep sequencing dataset. Finally, all of the FASTA ids should be unique.One way to ensure this is to include a running number in the id. For reference,see the file, GSM2094927.fa, in the testdata ( https://sourceforge.net/projects/mirdp2/files/TestData/). The following are several examples: >read0_x29909
TTTGGATTGAAGGGAGCTCTA
>read1_x36974
TTCCACAGCTTTCTTGAACTG
>read2_x32635
TTCCACAGCTTTCTTGAACTT
3.2 BUILD INDEX
Another non-miRNA ncRNA index is also needed to filterout noisy sequences from ncRNA fragments. The file is a collect of main ncRNAsequences from Rfam, including rRNA, tRNA, snRNA, and snoRNA. To build thisindex, please refer to part 2.3, asthe index should be placed and named correctly, i.e. <miRDP2_version>/script/index/rfam_index.
3.3 RUN MIRDP2To use miRDP2 to detectingnew miRNAs from deep sequencing data, run the bash script in the package tostart the analysis pipeline (An example can be found in part 2.3):
<path_to_miRDP2_folder>/miRDP2-vx.x_pipeline.bash -g <genome_file>-x <path_to_index/index_prefix> -f -i <seq_file > -o<output_folder>
Please note the version of the pipeline bash script. There are three parameters for: number of differentlocation a read could map to, allowed mismatch number for bowtie, reads RPMthreshold for reads. Users can modify them using -L, -M, -N, and -R options. A detailed explanationis in part 5.1. -T/--bowtie2 option can be used to switch to bowtie2while aligning reads. –large-index option should be
3.4 MIRDP2 OUTPUT
The output folder would be automatically generatedunder <output_folder>, and named as `<seq_file_name>’.The file <seq_file_name>_filter_P_predictioncontains information of the final predicted miRNAs. The tab-delimited columnsin this file are chromosome id, stranddirection, representative reads id, precursor id, mature miRNA location,precursor location, mature sequence, and precursor sequence, separately. Abed file is also provided for subsequent analysis.
4. THE CONTENTS OF MIRDP2SOFTWARE PACKAGE
The miRDP2 package consists of six documented Perlscripts that should be run sequentially by the prepared bash script. Of the sixscripts, three, convert_bowtie_to_blast.pl,filter_alignments.pl, and excise_candidate.pl,are inherited from miRDeep-P (Yang and Li, 2011). The other scripts aremodified from the original version.Functionsof the six scripts are described in the following: a. preprocess_reads.pl & preprocess_reads-SAM.pl filters inputreads, including reads that are too long or too short (<19nt or >24nt),and reads correlated with Rfam ncRNA sequences, as well as reads with Reads PerMillion reads (RPM) less than 5. The script then retrieves reads correlated toknown miRNA mature sequences. The input files are fasta format of originalreads files and bowtie output of reads mapping to miRNA and ncRNA sequences. Theformula for calculating RPM is as the following: b. convert_bowtie_to_blast.pl& convert_SAM_to_blast.pl changes the bowtie format/SAM format into blast-parsedformat. Blast-parsed format is a custom tabular separated format derived from standardNCBI blast output format. c. filter_alignments.pl filters the alignments ofdeep sequencing reads to a genome. It filters partial alignments as well as multi-alignedreads (user-specified frequency cutoff). The basic input is a file in blast-parsedformat. d. excise_candidate.pl cuts out potential precursorsequences from a reference sequence using aligned reads as guidelines. Thebasic input is a file in blast-parsed format and a FASTA file. The output is allpotential precursor sequences in FASTA format. e. mod-miRDP.pl needs two input files,signature file and structure file, which is modified from the core miRDeep-Palgorithm by changing the scoring system with plant specific parameters. Theinput files are dot-bracket precursor structure file and reads distributionsignature file. f. mod-rm_redundant_meet_plant.plneedsthree input files: chromosome_length, precursors and original_predictiongenerated by mod-miRDP.pl. Itgenerates two output files, non-redundant predicted file and predicted filefiltered by plant criteria. The two tab-delimited output files contain columnsthat indicate chromosome id, stranddirection, representative reads id, precursor id, mature miRNA location,precursor location, mature sequence, and precursor sequence.
5. ISSUES USING MIRDP25.1 PARAMETERS
There are several parameters that can becustom-modified: a. The first one is the limit of how many locations could a readmap to (-L/--locate option). Readsmap to too many sites are possibly associated with repeat sequences, and arenot likely to related to miRNAs. The default setting is 15. Forspecific species, if there are miRNA families with many members, the firstparameter may be increased manually to adapt to the genome landscape. b. The second one is the length of putative miRNA precursors theprogram excised (-N/--length option). The defaultsetting is 300 nt. c. The third one is allowed mismatches for bowtie/bowtie2 (-M/--mismatch option). The defaultsetting is 0. d. The fourth one is the threshold for reads. (-R/--rpm option) To reduce timeconsumption and false positive, we filter reads by RPM. Only reads exceeded acertain RPM threshold may represent mature sequences of miRNAs rather thanbackground noise, and would be kept for further analysis. The default settingis 10 (RPM). e. The fifth one is the number of thread allowed for RNAfold (-p/--thread option). The defaultsetting is 1. Please be aware that changing these parameters wouldpotentially affect performance and time consumption. In general, increase ofparameter a & c and decrease of parameter d would generate a loose result andlonger running time and vice versa.
5.2 REDUNDANCY AND MIRNA*
In some cases, the output miRNAs from miRDP2 maydiffer from the known miRNAs. We found that this is mainly due to one of tworeasons: heterogeneity of the mature miRNAs or the relative abundance of miRNAand miRNA*. We found that this does not impact the optimal length selection ofprecursors and the profiling of known miRNA genes.
5.3 LICENSE AND AVAILABILITY
MiRDP2 is freely available under a GNU Public License(Version 3) at: The miRDP2 scripts, demos and user manual can beobtained from the website.
6. APPENDIX - MAJOR UPDATES OF MIRDP2
Ourmodifications include filtering of input reads, incorporating latest miRNAannotation criteria, and removing restriction on bifurcation of secondarystructure of miRNA precursor. Firstly, we filtered out improper reads in original small RNAlibraries, and employed new strategies to excise the precursors of miRNAcandidates. The step of excising miRNA precursors is one of the mosttime-consuming steps. After employing the new strategy, the time of processingthis step is dramatically reduced. In addition, the new strategy could improvethe prediction accuracy by removing false positives. In details, we firstfiltered out reads with inappropriate length (<19nt or >24nt) since noneplant miRNAs are shorter than 19nt or longer than 24nt as Axtell (Axtell andMeyers, 2018) suggested. In general, these reads count for around one third oftotal reads in a typical small RNA library. Second, we only employed readseither similar with known miRNAs (allow 1 mismatch) or with high copy number(>=5 RPM) to excise miRNA precursor candidates. These reads take up lessthan 5% of unique reads in a typical small RNA library. Taken theseimprovements together, the reads processed into excising precursors are muchless and more focused, which could dramatically reduce computational time (10sto 100 times). At the same time, noise caused by improper reads is filteredout, resulting in the improvement of prediction accuracy. Secondary, we have introduced the most up-to-date miRNAannotation criteria (Axtell and Meyers, 2018) in miRDP2 and developed a newcriteria of selecting miRNA candidates of 23/24 nt miRNAs. All details are inSupplementary Material 2. This update and change removed many false-positivesand increased the prediction accuracy. Lastly, we have modified the existing scoring system inmiRDeep-P core algorithm to better fitting with plant miRNA characteristics(longer precursors and more complicated secondary structure as stated in Yangand Li, 2011). We have allowed longer precursors with bifurcation in stem loopregion, which are usually filtered by other prediction tools includingmiRDeep-P. Supplementary Material 4 shows two examples (Ath-MIR157c and Ath-MIR858).This change has much increased the sensitivity of miRDeep-P2.
7. REFERENCES
An, J., Lai, J., Sajjanhar, A., Lehman, M. L., andNelson, C. C. (2014). miRPlant: an integrated tool for identification of plantmiRNA from RNA sequencing data. BMCbioinformatics, 15, 275-278.
Axtell M.J. and Meyers, B.C., (2018) RevisitingCriteria for Plant MicroRNA Annotation in the Era of Big Data, Plant Cell, 30, 272-284.
Bartel, D.P. (2009) MicroRNAs: Target Recognitionand Regulatory Functions, Cell, 136, 215-233.
Evers, M., Huttner, M., Dueck, A., Meister, G., andEngelmann, J. C. (2015). miRA: adaptable novel miRNA identification in plantsusing small RNA sequencing data. BMCbioinformatics, 16, 370.
Fahlgren, N., et al. (2007) High-ThroughputSequencing of Arabidopsis microRNAs: Evidence for Frequent Birth and Death ofMIRNA Genes, PLoS One, 2, e219.
Friedlander, M.R., et al. (2008) DiscoveringmicroRNAs from deep sequencing data using miRDeep, Nat Biotechnol, 26,407-415.
Lei, J., and Sun, Y. (2014). miR-PREFeR: anaccurate, fast and easy-to-use plant miRNA prediction tool using small RNA-Seqdata. Bioinformatics, 30, 2837-2839.
Mathelier, A., and Carbone, A. (2010). MIReNA:finding microRNAs with high accuracy and no learning at genome scale and fromdeep sequencing data. Bioinformatics, 26, 2226-2234.
Meyers, B.C., et al. (2008) Criteria forannotation of plant MicroRNAs, Plant Cell,20, 3186-3190.
Wark, A.W., Lee, H.J. and Corn, R.M. (2008) Multiplexeddetection methods for profiling microRNA expression in biological samples,Angew Chem Int Ed Engl, 47, 644-652.
Yang, X. and Li, L. (2011) miRDeep-P: acomputational tool for analyzing the microRNA transcriptome in plants, Bioinformatics, 27, 2614-2615.
Yang, X., Zhang, H. and Li, L. (2011) Globalanalysis of gene-level microRNA expression in Arabidopsis using deep sequencingdata, Genomics, 98, 40-46.
Zhu, Q.H., et al. (2008) A diverse set ofmicroRNAs and microRNA-like small RNAs in developing rice grains, Genome Res, 18, 1456-1465.
Last updated: 21th-April-2020 Current as of miRDP2 version 1.1.4
Zheng Kuang Xiaozeng Yang Beijing Academy of Agriculture and ForestrySciences The Peking University
|