isomiR2Function: An Integrated Workflow for Identifying MicroRNA Variants in Plants

In plants, post transcriptional regulation by non-coding RNAs (ncRNAs), in particular miRNAs (19–24 nt) has been involved in modulating the transcriptional landscape in developmental, biotic and abiotic interactions. In past few years, considerable focus has been leveraged on delineating and deciphering the role of miRNAs and their canonical isomiRs in plants. However, proper classification and accurate prediction of plant isomiRs taking into account the relative features by which we define isomiRs, such as templated or non-templated is still lacking. In the present research, we present isomiR2Function, a standalone easily deployable tool that allows for the robust and high-throughput discovery of templated and non-templated isomiRs. Additionally, isomiR2Function allows for identification of differentially expressed isomiRs and in parallel target prediction based on both transcripts or PARE-Seq either using Targetfinder or Cleaveland. isomiR2Function allows for the functional enrichment of the detected targets using TopGO package. Benchmarking of isomiR2Function revealed highly accurate prediction and classification of isomiRs as compared to the previously developed isomiR prediction tools. Additionally, the downstream implementation of additional features allows isomiR2Function to be classified as a single standalone tool for isomiR profiling from discovery to functional roles. All in all, isomiR2Function allows the streamline processing of the miRNA-seq for the identification and characterization of isomiRs with minimal efforts. isomiR2Function can be accessed through: https://github.com/347033139/isomiR2Function.


GSTAr_MTide.pl with files having the same name in 'threads' directory if your
perl is configured wilt '-Dusethreads'. Which will make isomiR2Function.pl work much more efficiently.

ONLINE RESOURCES
To make convenience for users, a list of 21 widely researched species is uploaded online. Their GO backgrounds are stored in the corresponding directories. The list will be updated if we have made online data for a new species. All these data were generated from www.phytozome.net. For reads filtering, we have prepared plant specific tRNA, rRNA and snoRNA sequences. These files are deposited into the directory 'Constructive_RNA' of online version of isomiR2Function. tRNAs and rRNAs are from NCBI Nucleotide database (http://www.ncbi.nlm.nih.gov/nuccore) and plant specific snoRNAs are from Plant snoRNA Database (http://bioinf.scri.sari.ac.uk/cgi-bin/plant_snorna/home). A template of config_file is at the root path of isomiR2Function project too. In that file, we provides information of three datasets from two different tissues for isomiRs identification and differential expression analysis.

FILE PREPARING
For a thorough analysis workflow of isomiR2Function, the user should prepare eight files before running isomiR2Funtion.pl: 1) clean_reads.fa (-t). This file will contains all the small RNAs which will be identified as isomiRs. All the reads will be unique and the headers should be in a format as ">label_rank_xcount". To avoid any potential problem during the downstream analysis, we strongly recommend the user to use 'CleanReads_modified.pl' for adapter trimming, quality control and converting FASTQ data to FASTA. 'CleanReads_modified.pl' calls python package cutadapt for trimming. You should add cutadapt in your PATH environment variable or revise the path to call cutadapt in the script. 2) pre-miRNA.fa (-p) and canonical-miRNA (-a).
These two files will contain the sequences of species specific pre-miRNAs and canonical miRNAs. The header of 'pre-miRNA.fa' and 'canonical-miRNA.fa' should be the same except '-3p' or '-5p' is appended after the name of mature miRNAs, e.g.
'>ath-miR156a-5p' for mature miRNA and '>ath-miR156a' for precursor. Case of letter is not sensitive here. If the subject species is covered by miRBase, you can use the parameter '-e' instead of providing above two files or you can use '-p', '-a' and '-e' at the same time to run isomiR2Function with your newly identified miRNAs along with those deposited in miRBase. 3) genome_index_name. This name of path will direct isomiR2Function to the path of genome indexes. You can generated these files by bowtie-build. If the genome of your subject is not accessible, transcriptome can be used instead. 4) config_file.tbl. This file will contains the details of the dataset you use for analysis, and isomiR2Function will read this file to understand your experiment design better. We have uploaded a template of this file online (see ONLINE RESOURCES). 5) GO_background.tbl. This file will contains transcript sequences of your subject for target prediction. One geneID a line, and their GO terms are appended after delimited by a TAB. Different GO terms are separated by commas.
The geneIDs of this file should be consistent with that in 'transcript.fa'. You can prepare this file from www.phytozome.net. 6) PARE.fa. This file will be needed only if the user call the parameter '-i C'. It contains the degraded sequences of mRNA and the user should not remove the redundancy. 7) transcripts.fa. This file will contain the transcripts of your subject. You can prepare this file from www.phytozome.net.

RUNNING
To start running the isomiR2Function, easily execute the command line like:  (16) -L <int> Min isomiR sequence length (16) -l <int> Max isomiR sequence length (28) -c <int> Max corrupt in seed region (0) -S <int> Start of seed region on original miRNA (2) -E <int> End of seed region on original miRNA (8) -F <int> Log2 foldchange cutoff of differential expression analysis (1) -P <int> P value cutoff of differential expression analysis ( '-i C', -j, -q and -u -q <int> How many mismatch will be allowed when mapping degradome data on transcripts (0), come with '-i C', -m, -j and -u -u <int> Which level of categories should be allowed, 0 to 4 (4), come with '-i C', -m, -j and -q -k <int> Alignment score cutoff of target prediction when calling targetfinder (4) -T <int> Number of cores for parallel analysis (1) -v Keep tmp files remain for debug -h Call out the manual of isomiR2Function

WORKFLOW
The isomiR2Function workflow has seven main steps: Step 1, sequence abundance in each libraries and in total are stored in Hash1 and Hash2, respectively. Simultaneously, canonical miRNAs are mapped on pre-miRNAs using bowtie and their locus information is stored in Hash3. Abundance filtering is carried out here to avoid unnecessary mapping to speed up process.
Step 2, sequences passed the filtering are mapped on pre-miRNAs using bowtie.
Internal and terminal information of these sequences generated by compared to canonical miRNAs is checked. Sequences with allowed 5' and 3' coordinates and seed corruptness are stored in Hash4 as templated isomiRs. The file of unmapped sequences is created by bowtie and used for subsequent analysis.
Step 3, unmapped sequences from above step are filtered with genome or transcriptome sequences by mapping against to decrease false positive results. Then, sequences failed in mapping to other region of the reference are mapped on pre-miRNAs allowing 'n' mismatches, where 'n' is defined by parameter '-s'.
Mapping results were fully analyzed by core algorithm of isomiR2Function. Terminal variations, internal substitution were then sort out from mixed type, separately.
Internal and terminal information as well as seed corruptness of passed sequences is appended to Hash4. For sequences who might originate from multiple pre-miRNAs, only best hits were analyzed because the fewest variation these hits get.
Step 4, Sequences do not hit the pre-miRNAs from above step are trimmed from their 5' or 3' terminal followed by mapping them on pre-miRNAs again allowing no mismatch. The trimming and mapping events are carried on for 'm-n' round, where 'n' is defined by parameter '-r'. Internal and terminal information as well as seed corrupt of the original sequences, who hit the pre-miRNAs after trimming, are appended to Hash4.
Step 5, Hash4 is passed to core algorithm of isomiR2Function. Graphical alignment file is generated and the details of this isomiR are appended after. Sequentially, conditional judgment is carried out, quantification file for isomiRs in different samples are built.
Step 6, EBSeq or DESeq is called by isomiR2Function for differential expression analysis of isomiRs. TAB delimited quantification file is written. Heatmap graph is also constructed for visualization.
Step 7, Targetfinder or CleaveLand pipeline is called by isomiR2Fucntion for target prediction followed by the GO enrichment of these predicted targets of differentially expressed isomiRs. Finally, significantly enriched GO terms are saved in a TAB delimited result file and a DAG figure consists of those GO terms is drawn.

OUTPUT
Templated and non-templated isomiRs are detected and analyzed separately by isomiR2Function. Normally, isomiR2Funstion will generated two sub-directory in the root path of outputs: The lasst one stores the details of parameter setting of running. Both directories will have results showing bellow.
The result files of isomiR2Function can be mainly divided into five part: From the left to right are the results of the original algorithm of isomiR2Function. The first one shows the length distribution of all isomiRs between different samples. The second one is the graphical alignments of isomiRs to corresponding pre-miRNAs.
Above is an example of result.aln. The top sequence is the pre-miRNA followed by the canonical miRNA under it. The below aligned sequences are identified isomiRs.
Non-templated bases are indicated with lower case. Mod suggests the terminal drift of isomiRs. For example, '3'+1;5'-2' means this isomiR has one nucleotide addition at 5' terminal. and two nucleotide deletion at its 3' terminal. Len suggests the length of isomiRs. Start suggests the relative start of isomiRs on pre-miRNAs. Seed suggests the corruptness happening in seed region. Var suggests the base variation of isomiRs.
SS means single SNP, MS means multi-SNP, CV means compounded variation, TS means tandem SNP. For example, SS: 16:A>C means single SNP is detected at the 16th relative site on isomiR converting A to G. The seed corruptness is defined as the following graph.
The third one is a TAB delimited table file for all identified isomiRs. Information about sequence, mismatch, base variation, 5' and 3' drift, sample information, and RPM are stored in this file.
The last one calculates statistic information of identified isomiRs.

miRNA biogenesis and processing analysis
KS-test plots are built for canonical miRNAs with more than one detected isomiRs.
The results are stored in 'ks_test' directory. The x axis indicates different types of isomiRs, the y axis indicates the abundance proportion of corresponding isomiR to the total isomiRs.

Differential expression analysis
isomiR2Function supports two algorithm for differential expression analysis, DESeq and EBSeq. Both algorithm will result three file in 'DE' directory.
The first one is a heatmap graph of top 50 differentially expressed isomiRs. The third one contains the details of isomiRs whose fold changes and p values are within the cutoff. The second one is a TAB delimited table file which contains the names and sequences of differentially expressed isomiRs. If the experiment has replicates. The intersection between DESeq and EBSeq results can be investigated using parameter '-b both'. This analysis is allowed only if experiment has replicates. If not, the isomiR2Function will complain and exit. The results of DESeq depends on whether experiment has replicates too. If does, the differentially expressed isomiRs will be selected by both p value and fold change, if doesn't, then the differentially expressed isomiRs will be selected by only fold change.

Target prediction
isomiR2Function supports two pipeline for target prediction of isomiRs, CleaveLand and targetfinder. Both pipeline will result a TAB delimited result file in 'targets' directory and a file listing the targets of differentially expressed isomiRs in 'topGO' directory, named 'gene_of_interest.txt'. If user calls CleaveLand pipeline for target prediction. Then an additional sub-directory 'Tplots' will be built in 'targets' directory.
All the T-plots of identified targets will be drawn in this directory.
Above is an example of CleaveLand-MTide generated T-plot. x axis indicates the positions on transcripts, y axis indicates the number of identified cleavage happened.
Red spot indicates the cleavage result of identified isomiR::transcript pair. Component will be constructed in 'topGO' directory.

VALIDATION
The practicability of isomiR2Function has been validated by running with two small RNA sequencing (SRR035251 and SRR035252) and one PARE sequencing data (SRR1039524). The inputs, outputs and command line to run the validation can be accessed at: https://drive.google.com/file/d/0B-w2z1YjW6TiRVJ5QTFidGI5MUE/view?usp=shar ing and checked from online file 'validation.tar.bz2'. Canonical and pre-miRNAs were downloaded from miRBase (release 21). Genome and transcript sequences are downloaded from www.phytozome.net.
To estimate the sensitivity and accuracy of isomiR2Function, we compared the behave of isomiR2Function against a previously published tool isomiRID using datasets from Arabidopsis thaliana (SRR035251), Oryza sativa (SRR1033810) and Brachypodium distachyon (SRR504362). Tools are run allowing the following parameters: 1 At most one internal mismatch against pre-miRNA is allowed for isomiR identification. 2 At most four bases of truncation and editing at one terminal is allowed for isomiR identification. 3 isomiRs should be within 18 to 26 nt in length. 4 isomiRs overlaps at least 16 nt with canonical isomiRs are identified as canonical isomiRs. 5 To investigate the sensitivity of them, isomiRs with only one read counts should be identified from the datasets.
In regard of accuracy, we investigate that based on four feature: 1 If the length of isomiR is beyond the interval we set. 2 If the expression of isomiR is counted inaccurately. 3 If the isomiR is contaminated with canonical miRNAs. 4 If the isomiR is attached to correct the pre-miRNA. Based on the validation, isomiR2Function showed great accuracy to all these features but canonical miRNAs are mixed in the identified isomiRs of isomiRID (Table 1). Additionally, there are several isomiRs which are longer than 26 nt in isomiRID results (Table 1). This analysis proved that  In addition to isomiR identification, graphic alignments of isomiRID was not accurate too ( Figure  1). In the example bellow, isomiR 'TTCGGACCAGGCTTCATTTTTTTT' was wrongly aligned in the isomiRID result but not in isomiR2Function result (Figure 1). It was said the edited bases would be in lower case by isomiRID manual. Here, only the last base was in lower case instead of all edited bases in isomiRID result. However, the edited bases were showed in the exactly lower case in isomiR2Function one (Figure 1). Besides ath-miR166a-5p and ath-miR166a-3p, canonical miRNAs identified from other precursor were aligned to ath-MIR166a which might cause confusion in isomiRID result. However, in isomiR2Function one, only ath-miR166a-5p and ath-miR166a-3p were aligned to ath-MIR166a.

Figure 1 Graphic alignments of isomiRID and isomiR2Function
Screenshot of IsomiRID isomiR2Function isomiRs aligned to ath-MIR166a. The same isomiR showed different alignment was highlighted in both isomiR2Function and isomiRID result.

ADVANTAGES
As compared to the previously developed workflows, isomiR2Function allows for the following enhanced functionalities: 1. It allows for customizable length range, seed corrupt tolerance and minimum sequencing depth for isomiR identification; 2.
Identification of isomiR on pre-miRNA not overlapping with canonical miRNAs; 3.
Information on isomiR biogenesis using Ks plots; 4. Support for differential expression analysis of the identified isomiRs using DESeq and EBSeq; 5. Detailed classification of multiple internal substitution. isomiR2Function classifies isomiRs with multiple internal substitution in more detail, i.e. MS for having internal random SNPs, CV for having both internal SNPs and terminal substitutions and TS for having internal tandem SNPs; 6. Support for transcript based target identification using TargetFinder or degradome based PARE target identification using Cleaveland; 7.
Functional enrichment of the isomiR targets for biological implications. Taking into account the implemented functionalities, we believe that isomiR2Functions will facilitate the large scale discovery of isomiRs across the plant species to gain and strengthen the role of the miRNA variants in post-transcriptional regulatory events.