In silico Method in CRISPR/Cas System: An Expedite and Powerful Booster

The CRISPR/Cas system has stood in the center of attention in the last few years as a revolutionary gene editing tool with a wide application to investigate gene functions. However, the labor-intensive workflow requires a sophisticated pre-experimental and post-experimental analysis, thus becoming one of the hindrances for the further popularization of practical applications. Recently, the increasing emergence and advancement of the in silico methods play a formidable role to support and boost experimental work. However, various tools based on distinctive design principles and frameworks harbor unique characteristics that are likely to confuse users about how to choose the most appropriate one for their purpose. In this review, we will present a comprehensive overview and comparisons on the in silico methods from the aspects of CRISPR/Cas system identification, guide RNA design, and post-experimental assistance. Furthermore, we establish the hypotheses in light of the new trends around the technical optimization and hope to provide significant clues for future tools development.


INTRODUCTION
The mysterious veil of the genome and transcriptome in diverse organisms is being uncovered owing to contributive sequencing efforts. Even so, the functions of most genes remain unknown (1). The toughest challenge has been to associate phenotype changes to alterations on genetic layers. The state-of-the-art CRISPR/Cas system for genetic manipulation is an emerging tool to solve this nerve-wracking problem (2). CRISPR/Cas system is developed from a prokaryotic adaptive immune defense mechanism against the exogenous nucleic acids in archaea and bacteria (3), which follows a base-pairing rule between target and guide RNA (gRNA). The role of gRNA is to steer Cas enzyme to the custom positions in the presence of a protospacer adjacent motif (PAM) or protospacer flanking sequence (PFS) (4). PAM/PFS is a recognizable component following the target sites that enables precise cleavages on exogenous nucleic acids complementary to gRNA. In different types of CRISPR/Cas systems, gRNA could be the CRISPR RNA (crRNA), a kind of short non-coding RNAs derived from CRISPR arrays, or the synthetic formed by crRNA and trans-activating crRNA (tracrRNA). Besides, the category of CRISPR/Cas systems can be divided into two classes and subdivided into six types and 30 subtypes by different kinds of Cas effector module organizations, the position of the CRISPR array and acquisition module (5). As shown in Figure 1A, type I, III, and IV CRISPR/Cas systems have multi-subunit effector complexes and thereby collectively belong to class 1, while class 2 containing type II, V, and VI systems has a simpler architecture composed of only one protein effector (6)(7)(8).
Up to now, CRISPR/Cas system has been extensively applied in fundamental studies (9) as well as clinical practices across multiple diseases (10,11). Of note, the discovery and implementation of CRISPR/Cas system require an intricate workflow ( Figure 1B) including CRISPR/Cas system identification and selection, gRNA design, transfection, singlecell clone establishment, clone screening, and systematic mutation analysis (12)(13)(14). Each step expends considerable time, money, and manpower. Fortunately, the advance in computer science creates scope for remedying the deficiency and fueling the overall procedure ( Figure 1B). In silico methods based on different algorithms and frameworks harbor different merits and are appropriate for diverse applications. Even though a variety of in silico tools goes on a growth spurt over recent years, there is a lack of a comprehensive summary for their roles in the overall procedure from system identification to application, so that many biological researchers are likely lost in the selection of suitable tools for their given intention. Therefore, it is necessary and urgent to make an explicit review of the existing tools.
In this review, we aim to summarize the released in silico methods from three major aspects (CRISPR/Cas system identification, gRNA design, and post-experimental assistance), discuss the relative merits, expound their applicability for various purposes, and put forward the possible assumptions for further improvements. We believe that our review is capable of elaborating on the roles of in silico toolkits in CRISPR/Cas system to formulate meaningful guidance for biological researchers and even provide significant clues for future tools development.

CRISPR/Cas SYSTEM IDENTIFICATION
At the phase of adaptation, bacteria copy a DNA segment (protospacer) from the invasive phages or plasmids and paste it to the start of the CRISPR array downstream of the leader sequence as a new spacer ( Figure 1A) (3,16,17). CRISPR arrays are then transcribed and processed into crRNAs that possess partial genetic information of the invasive DNA and thus are able to form gRNA or directly guide Cas protein to the planned position (6). Since crRNAs and Cas protein, respectively, take full control of the specificity and editing efficiency of CRISPR/Cas systems, identification and classification of CRISPR/Cas system composed of different types of crRNAs and Cas proteins must be the most fundamental prerequisite for the downstream application.

Recognition of CRISPR Arrays That Generate crRNAs
The most important component of CRISPR/Cas system, crRNA, is generated from CRISPR arrays ( Figure 1A). Therefore, recognition of efficient CRISPR arrays largely determines the engineering specificity in the application. Until now, a variety of computational methods have been proposed to recognize CRISPR arrays using sequence information. One of the earliest tools, PatScan (18), was developed long before CRISPR/Cas system was applied in gene editing, which searches for the fragments homologous to the predefined pattern. However, PatScan was designed to detect general repeat not specific for CRISPRs, causing the inability of distinguishing the spacers and repeats in the whole CRISPR array. Later, several specific CRISPR identifiers came along, such as CRISPRFinder (19), PILER-CR (20), and CRT (21). The principle of CRISPRFinder (19) is using the suffix tree-based algorithm to find the maximal repeats that are clamped by the non-repeating sequences with similar length. Besides, PILER-CR (20) based on the alignment matrix identifies putative CRISPR arrays through searching local hits of the query genome to itself and uses sequence similarity, conservation, and length distribution to refine them. Different from CRISPRFinder and PILER-CR, CRT (21) does not rely on any central data structure but adopts the strategy of simple sequential scanning, which enables a high execution speed independent of the number of repeats in the given genome. Afterward, CRISPRDetect (22) based on k-mer and extension strategy was proposed and labeled itself with the improvement of utilizing the features of CRISPR loci especially mutations. CRISPRDetect (22) is more sensitive to short and degenerated repeats by scanning for the variant repeats under a low identity threshold in long spacers, but it incidentally brings the possibility of wrong segmentation of the large integral CRISPRs. The comparison of the advantages and disadvantages of the abovementioned basic CRISPRs identifiers was demonstrated in Table 1.
Along with the diversity of research demand, there are some tools derived from the basic identifiers and tailored for different purposes ( Table 2). One of the most popular purposes now is to explore the CRISPR diversity from metagenomic data and classify the CRISPR/Cas system. Due to the repetitive nature and population heterogeneity, it is hard to assemble CRISPRs from metagenomes using basic tools. Therefore, MinCED (23), MetaCRAST (24), Crass (25), and metaCRT (26) were developed. MinCED, Crass, and MetaCRT are all based on CRT (21) tool and implement the de novo detection. Moreover, MinCED and Crass have no need for prior knowledge of CRISPR arrays of which MinCED only detects spacers in reads without assembly and Crass assembles the reads into arrays. In contrast, metaCRT (26) integrates the reference-based and de novo detection. Besides, MetaCRAST (24), another reference-based method, searches for repeats pairing with the user-defined templates that could be identified by either other tools like CRISPRFinder, PILER-CR, and CRF or taxonomy, whereas its performance is inferior to Crass and MinCED for the poor taxonomic diversity. In addition, there are also some tools tailored for other purposes. For instance, if users want to compare the CRISPR arrays from different species, CRISPRcompar (27) comprising CRISPRcomparison and CRISPRtionary and basically derived from CRISPRFinder must be the best choice. Besides, CRF (28) based on CRT added random forest algorithm to make an extra filtration for invalid CRISPR arrays, but this learningbased tool may partially lose the ability to discover new FIGURE 1 | Schematic diagram shows the mechanism and workflow of CRISPR/Cas adaptive immune system. (A) The mechanism of CRISPR/Cas system. S1: Adaptation stage. The invasive DNA sequence produced by phage is cleaved and incorporated into the start of a CRISPR array comprised of a string of spacers flanked by repeats, forming a new spacer downstream leader. S2: CRISPR RNA (crRNA) biogenesis stage. The precursor of crRNA transcribed from CRISPR array is further processed into mature crRNA, which carries the genetic information from spacer. S3: Interference stage across six main types of systems. In type I system (signature protein: Cas3), the multimeric effector, Cascade, binds to target DNA complementary to crRNA and then recruits Cas3 to generate the single-strand nick. Type II system (signature protein: Cas9) encodes tracRNA to hybridize with crRNA and form a dual tracRNA:crRNA complex, which guides Cas9 enzyme to the target and thus generates blunt double-strand breaks (DSBs). In type III system (signature protein: Cas10), Cas10-Cmr/Csm complex recognizes the nascent target RNA following by the new enzymatic activity for complementary DNA cleavage. Type IV system (signature protein: Csf1) remains mostly unknown, although current research had demonstrated the crRNA maturation and proved its evolutionary connection with type I system (15). Type V system (signature protein: Cas12) solely relies on the formation of a binary complex between crRNA and Cas12 enzyme to identify target sequence and triggers staggered DSBs. In type VI system (signature protein: Cas13), crRNA binds to single-strand RNA through the protospacer flanking sequence (PFS) reorganization and guides Cas13 to realize the cleavage. (B) The workflow of CRISPR/Cas-mediated gene editing includes CRISPR/Cas system selection, guide RNA (gRNA) biogenesis and transfection, single-cell cloning and isolation, and downstream analysis. The subheadings under the main title represent the processes where in silico methods are involved. The flow linking the left and right panels represents the correspondence. For example, red flow shows that the implement of downstream analysis corresponds to the stage after CRISPR inference.

Incorporation With Cas Protein Detector
Other than the abovementioned tools only focusing on CRISPR arrays, recent tools integrate Cas protein detector to improve the classification capacity and enable the automated CRISPR/Cas system discovery. These tools determine the putative Cas protein by using the homologous sequence searcher such as BLAST (38) and HMMER (39), which compare the query Cas protein with the sequences in a known protein database. For example, CRISPRmap (34) is composed of CRT and CRISPRFinder for CRISPR array identification and HMMER for Cas protein annotation. CRISPRdisco (32) incorporates MinCED and BLAST to realize similar functions. Besides, CRISPRCasFinder based on CRISPRFinder for CRISPR array identification integrates the function of Cas protein detection by using a dedicated tool MacSyFinder (40), which is in essence HMMER. Except for the predictors, there are some databases collecting the predicted CRISPRs and Cas proteins such as CRISPRBank (30), CRISPRone (35), and CRISPRCasdb (CRISPRdb) (36).
Although much effort had been invested in the CRISPR/Cas system identification and classification, there are still some unsolved limitations. On one hand, identifying CRISPR arrays especially short arrays based only on pattern alignment or along with limited sequence information is not enough to accurately eliminate noises. It is an imperative trend, as the progression from basic tools to tailored tools, to excavate and incorporate more significant architectural and functional features such as the transcriptional polarity within CRISPRs (41) and regulatory relationships with endogenous genes in a bacterial host (42) to improve the prediction performance. On the other hand, current tools for Cas protein detection are majorly based on the annotation propagation by searching for homologous sequences, which narrows the possibility of discovering novel Cas proteins.

GUIDE RNA DESIGN AND ASSESSMENT
As a key component of CRISPR/Cas system, gRNA specifies the target of Cas enzymes through PAM recognition. The quality of gRNA largely determines the efficacy and specificity of CRISPR/Cas-mediated editing. To date, there have been several types of RNAs found to play guiding roles via various mechanisms in different CRISPR/Cas systems (Figure 1), such as the mature crRNA in CRISPR/Cas12a (formerly Cpf1) system (43) and the hybrid of crRNA and tracRNA in CRISPR/Cas9 system (44). In this section, these RNAs with guiding functions have a joint name, gRNA.
With the wider applications of the CRISPR/Cas system, an increasing number of studies expressed their apprehensions over the incidental off-target effects, which may trigger the mis-editing at other loci and lead to unforeseeable phenotypic alterations (45,46). Thereupon, designing an efficient and functional gRNA with both high on-target efficacy and low offtarget mutations becomes the focus of much attention. Recent computational efforts have taken a massive step toward highquality gRNA design. In what follows, we will set forth the usages and contributions of gRNA designers from two subsections, Overview of the gRNA Designers and Special View Into Off-Target Activity.

Overview of the gRNA Designers
Owing to the simple architecture and superior operability, class 2 CRISPR/Cas systems (Figure 1) gain much wider applications.
Consequently, almost all current in silico gRNA designers are developed for class 2 systems. The following description is also confined to the class 2 CRISPR systems.
By different inner principles, we divided the gRNA designers into three major genres (Figure 2). The characteristics of the representative tools in each genre were shown in Table 3. 1) Pattern recognition genre (Figure 2A) relying on basepairing principle. In this category, tools search for a piece of sequence comprising a short PAM and around 20-bp candidate gRNA complementary to the query sequence in a specified genome. The fewer mismatches the candidate gRNA has, the greater on-target possibility it likely produces. Besides, the specific PAM should be predefined for its diversity in different CRISPR/Cas systems. Another factor influencing gRNA pattern is the transcription methods, in which U6 and T7 promoters, respectively, require G and GG at 5'end of gRNA (87,88). Some tools such as CRISPRseek (49) and flyCRISPR (50) take it into account while others such as SSFinder (48) and GT-Scan (51) do not. Besides, for individual studies, Crisflash (54) is able to improve the accuracy by incorporating user-supplied somatic mutation data into pattern matching.
2) Feature rule genre ( Figure 2B). The subsequent finding that editing activities vary across different target sites indicates the inherent disparity of some targets in the sensitivity to cleavage (89)(90)(91)(92) and thus ushers a series of explorations to seek out the key features that influence the targeting efficacy (93,94). These features include G/C content of gRNAs (high or low G/C content indicates less activity) (95), frequency of frameshift mutations (negative with CRISPR efficacy) (96), poly-T sequences (a typical terminator for gRNA transcription) (97,98), compositions of nucleobases involved in Cas binding preference (the presence of PAM-preceding G and the absence of pyrimidines in the last 4nt of gRNA spacers are preferred) (63), exon position (lower efficacy when gRNAs targeting the terminal coding exon rather than the earlier exons) (99), the status of the motif-and feature-enriched ∼10-12 nt proximal to PAM in spacer sequences dubbed seed region (associated with pairing process) (100, 101), and so on.
Frontiers in Oncology | www.frontiersin.org Tools in this genre always integrate several measurable features with the basic pattern recognition approach to provide more information about candidate gRNAs and target sites. According to feature indexes and the corresponding thresholds, users can lay down their own rules to filter out the gRNAs with poor reliability or of no interest. For instance, Cas-Designer (56) lists putative gRNAs along with G/C proportions and out-of-frame scores that indicate the frequency of in-frame mutations. Besides, CRISPR-ERA (60) constructs a simple scoring rule by arbitrarily quantifying and weighting the information of G/C content, poly-T motifs and target locations. Tools affiliated to this genre provide separate assessment or arbitrary combinations for multiple features rather than perform an integrative analysis on their interactive contributions, which may perplex users about how to balance the probably discordant results of multiple features. Machine learning algorithms found an exit for this dilemma.
3) Machine learning genre (Figure 2C). Given that the weights of multiple features remain uncertain, researchers resort to mathematical algorithms that systematically integrate features for refining optimal gRNA. These models always differ in algorithms and information in training data. For example, Doench et al. (95) (Rule set 1) observed the depletion rates of gRNAs targeting cell surface markers in mouse and human cells and attributed them to the intrinsic nucleotide composition of target sequences, which then acted as training data to construct the logistic regression classifier for gRNA activity prediction. Moreover, combining the changes in expression of cell surface markers (Rule set 1) (95) and drug resistance pathways (Rule set 2), Azimuth (102) trained by the information of not only nucleotide composition but also secondary structure of gRNAs and the relative location of target sites to the transcription start site (TSS) shows improved performance. Unlike above methods using phenotypic changes to measure activity, some others relying on mutations detected by sequencing were proposed. CRISPRscan (75), a linear regression model, investigated the effect of nucleotide composition on CRISPR/Cas9 efficacy by taking the gRNA-induced mutation rates of target sequences in zebrafish embryos as the signal of activity. In addition, sgRNA Scorers v2.0 (76) based on the support vector machine used similar training data from sequencing (mutation rates of the targets in human HEK293T cells). Likewise, TUSCAN (77) reanalyzed the published data and improved the prediction performance by adding the features of flanking target regions and replacing the algorithm with random forest. For fear of the potential biases caused by the manual selection of features in abovementioned tools based on the conventional machine learning algorithm, up-to-date tools (80,81,86) based on deep learning algorithm minimize the biases by automating feature extraction of which DeepCRISPR (86) is particularly noteworthy for unifying both on-target and off-target predictions into one framework and additionally allowing for epigenetic features despite using phenotype-driven data.
Phenotype-driven models are largely influenced by the target positions, some of which far from TSS less likely trigger phenotypic change and would be misclassified into the negative. In contrast, sequencing-based models implement more direct measurement of genetic mutations and have consequently superior generalizability (77). In a word, phenotype-driven models get the upper hand when users are more interested in the functional outcome of gRNA-induced mutations, while sequencing-based models occupy wider application fields if only genotype alterations are focused.
Even though in silico gRNA designers experience a positive evolution, the performances of machine learning-based tools remain difficult to maintain due to the varying features across different species and Cas enzymes requiring an exclusive loading process. Therefore, users were recommended to use the tools based on feature rules if their data are not eligible for the machine learning algorithm. Except for the abovementioned categorical characteristic, gRNA designers also have other distinguishable specialties such as the one-step customization of paired gRNA (pgRNA) for large fragment deletion [e.g., CRISPETa (70), pgRNAFinder (73), and GuideScan (82)], special consideration for CRISPR activation or interference (CRISPRa/i) (103) [e.g., SSC (63), CRISPR-ERA (60), and CHOPCHOP v3.0 (71)], application platform, off-target prediction, and so on. These specialties endow the tools with distinctive ability in particular fields and thus give users more choices for their specific purpose. Moreover, some commercial tools should also be helpful for their visual interface, online consultation, and onestop ordering service, such as Synthego (https://www.synthego. com/products/bioinformatics/crispr-design-tool) based on the Azimuth algorithm (102) and IDT (https://www.idtdna.com/ site/order/designtool/index/CRISPR_CUSTOM) based on their own evaluation algorithm, but most of the commercial tools were designed for the most popular CRISPR/Cas9 system and provided less support for other types of CRISPR systems. Table 3 recording the detailed comparison of some commonly used gRNA designers provides a more brief reference. Since no tool can be omnipotent, the pre-conditions and anticipated purpose should be fully thought before the gRNA designer selection.

Special View Into Off-Target Activity
Off-target activity leading to mis-editing on the unintended regions had been widely reported, which can trigger unpredictably adverse outcomes (104,105). Undoubtedly, experimental methods including whole-genome sequencing [e.g., CIRCLE-seq (106), GUIDE-seq (107), DISCOVER-seq (108), Digenome-seq (109), BLESS (110), and HTGTS (111)] and the improved VIVO strategy (112) are relatively robust and accurate for off-target identification. Nonetheless, the labor-and cost-intensive sequencing methods are not affordable for every researcher and sometimes unnecessary, thus urging the coming and progress of in silico methods.
The most typical and convenient in silico strategy for off-target risk evaluation is to align the short gRNA sequences sometimes with PAMs to reference genome to detect mismatch number and position by repurposing the alignment tools [e.g., Bowtie (113), PatMaN (114), and BWA (115)], which is exemplified by GT-Scan (51), CRISPR-RT (61), E-CRISP (65), and so on. However, short read aligners likely induce a large proportion of falsenegative errors due to their maximum allowable mismatches.
When mismatch number exceeds 2 in a certain read, the accuracy of aligners gets a drastic decline (116). The comparison between the gold standard GUIDE-seq (107) and the alignment strategy revealed that numerous high-mismatch off-targets and even one-mismatch off-targets cannot be detected by only alignment (107). On the other hand, the limited mismatches are hard to represent the authentic off-targets and may cause false-positives. This is supported by an experiment based on SITE-seq, which found that the alignment-based off-targets largely outnumbered the validated off-targets by up to 10fold (117).
Aiming to narrow both types of errors and realize the quantitative evaluation on off-target possibility, some features and scoring systems are incorporated into the prediction programs (Figure 3). For example, CCTop (67) and CROP-IT (59), respectively, incorporate seed region and DNase-sensitive region with mismatch number to grade the potential off-target sites using handcraft rules. Furthermore, mismatches with a few extra bases (DNA bulge) or missing bases (RNA bulge) in target sequences were once reported to be tolerable (118). COSMID (119) lists the number of bulges rather than incorporates it into the scoring rule for the lack of experimentally validated data. Despite the additional features in the above tools, the off-target searching method they used still relies on alignment strategy, which is not as reliable as the sequencing-based off-target source used in following tools. By introducing the mutated gRNAs into cells and measuring the gRNA abundance to quantify the offtarget activities, CFD (102) exhibited more dominant power and has been widely repurposed in other tools such as CRISPR-Local (84), GuideScan (82), and GPP sgRNA designer (78). In contrast with the discontinued MIT-Broad algorithm (120) whose scans area confines to 20-bp sequences, CFD (102) covers PAM as it found non-canonical PAMs tend to induce potential off-target events (102). Subsequently, researchers proved CFD's superior performance by comparison with experimental data (121). However, it should be noted that CFD only aggregates the off-targets within a certain gene rather than a genome-wide scale.
To overcome the drawbacks of handcraft rules and extend the aggregation scale, recent developers are more inclined to machine learning algorithm (Figure 3). CRISTA (122) constructed a random forest model based on the enlarged feature set covering mismatch types (wobble and bulge), chromatin accessibility, DNA enthalpy, and DNA geometry. Regrettably, the complex feature set creates a double-edge sword, which indeed enhances the prediction performance but also restricts the application scope. Using simpler features, Elevation (123), a genome-wide aggregation model based on Naive Bayes, provides a more systematic assessment for multi-loci off-target detection. Besides, the state-of-the-art deep learning algorithm was also applied using only sequencing data and achieved a relatively better result (124). Deep learning takes more full advantage of experimental datasets, whereas the lack of aggregation function and the narrow feature set remain an intractable limitation. The evolution of the original off-target scoring systems is illustrated in Figure 3.
In conclusion, an optimal gRNA should possess not only maximum on-target efficacy but also minimum off-target activity, which requires in silico designers equipped with both high accuracy and robustness. Moreover, the incorporation of more functional features is a key to improve prediction performance. As genetic researches are stepping forward, some additional factors such as histone modification (93,125) and Cas protein variants (126) were found to exert significant influences on editing efficacy and specificity. Besides, what wins the most attention recently must be individual variance that was reported to be discriminately associated with the genesis or destruction of the potential off-target activity (127)(128)(129). Therefore, the applications of CRISPR/Cas system especially for clinical purposes would better be specified into the individual scale to control the risk of deleterious side effects.

POST-EXPERIMENTAL ASSISTANCE
CRISPR/Cas-mediated high-throughput screening has become a main force to impute phenotypic changes to large-scale genetic or epigenetic alterations. In screening, the pooled gRNA library is amplified, packaged, and transfected into the host cells (130,131). The transfected cells are screened for a phenotype of interest, of which the survived would be sequenced to measure gRNA abundance. After that, the major challenges turn to be how to precisely transform the differential gRNA abundances after selection to the gene essentiality evaluation and how to systematically enumerate and visualize the CRISPR/Cas-induced mutations. Bioinformaticians have provided innovative solutions using computational methods to boost the experimental procedure as shown in Figure 1B. Hereinafter, in silico methods are introduced in three parts: Essential Gene Identification, Decipherment of the CRISPR-Induced Mutations, and Database for Experimental Data Collection.

Essential Gene Identification
Since CRISPR/Cas-mediated screening strategy was proposed, several sorts of approaches have been put forward to estimate gene essentiality. At the early stage, some off-the-shelf tools for RNA-seq expression analysis [e.g., edgeR (132), baySeq (133), and DEseq2 (134)] served as makeshifts for CRISPR studies. The algorithms designed for RNA interference (RNAi) screens [e.g., RIGER (135) and RSA (136)] were also regarded as substitutes. However, these algorithms cannot exactly achieve satisfying suitability for CRISPR screens due to various deficiencies including the lack of quality control, unrobustness to variable gRNA coverage per gene, and the weak power in controlling the bias toward small sample size or gRNAs with small read count. To fill the gaps, some dedicated methods have been emerging constantly (Figure 4, Table 4). The typical strategy (Figure 4) is to compare the read count distribution of gRNA with control and then aggregate the variances of multiple gRNAs with the same target into an estimate of gene-level effect.
MAGeCK-RRA (147) based on the negative binomial model and robust rank aggregation (RRA) is the first tool customized for prioritizing gRNAs, performing gene-level ranking and identifying the enriched pathways. To extend the functions, MAGeCK-RRA (147) was further updated to scMAGeCK (148) for single-cell CRISPR screening (a novel technique combining pooled CRISPR screening with single-cell RNAseq, which enables the identification of gRNAs at single-cell resolution from sequencing by modifying the lentiviral vector) and MAGeCKFlute (137) with optional ranking algorithm (maximum likelihood estimation) (149), gRNA outlier removal by network essentiality scoring tool (150), and various accessory functions including upstream quality control and downstream visualization. For some novices without programming expertise, command-line programs are hard to tame and the graphical  workflow, ENCoRE (141), seems more user-friendly, whereas the rough processing of gene ranking may induce unreliable results. Likewise, a universal analyzer, HiTSelect (138), is designed for both RNAi and CRISPR screens, whereas Poisson distribution used to fit the active gRNA abundance is not applicable because the mean and variance of gRNA count are always not equal. Considering that the variance of gRNA count can be either smaller or greater than the mean, Jeong et al. (146) developed CRISPRBetaBinomial based on beta-binomial distribution model and gained the superior sensitivity as well as lower false-negative rate as expected. Totally different in gene-level statistic, BAGEL (140) and JACKs (143) used the reference sets composed of the identified essential and nonessential genes to analyze the query data. Even though these prior knowledge-based methods reward excellent performance, the required compatibility between reference and query sets and the prohibitive update of the pre-set data remain the critical handicaps for popularization. Allowing for the varying effects of gRNAs targeting the same gene especially in CRISPRa/i screens, CRISPhieRmix (145) took a hierarchical mixture model to deconvolute the gRNA distribution and calculate a posterior probability for genes, in which sufficient gRNAs per gene are required to ensure the full discovery of essential genes. Other than the above methods affiliated to typical strategy, the methods in other ways provide more options for particular problems. For example, CERES (144) incorporated copy number effect and thus realized improved specificity in the realm of cancer cells (the left panel of Figure 4). Furthermore, PBNPA (142) (the right panel of Figure 4) permuted gRNA labels to compute gene-level p-values, which may outperform the competitors when encountering the small amounts of gRNAs per gene or low sequencing depth. Similarly, ScreenBEAM (139) is another skillful solution for low-quality data owing to the direct estimation on the gene level. The characteristics of existing essentiality evaluators are listed in Table 4.
In general, despite leaving copy number effect out of consideration, MAGeCK (137) remains the most widely used tool in various biological fields such as identifying cancer drivers (151), drug targets (152), and pathway components (153). Its prominent advantages over other tools are the all-around service covering both upstream and downstream analyses, relative ease of use, and the excellent ranking criteria that deal well with variable gRNA efficacies. Meanwhile, there are still positions for other tools when facing the cases they are adept at. ScreenBEAM (139) for low-quality data and ENCoRE (141) for novice users are two representative examples.

Decipherment of the CRISPR-Induced Mutations
Owing to the outstanding feasibility and versatility, type II CRISPR/Cas9 and type V CRISPR/Cas12a occupy the most dominant position in practical use. Double-strand breaks (DSBs) created by Cas9 or Cas12a cleavage can be repaired via several kinds of pathways, which induce the mixed mutations. The repair pathways mainly include (1) non-homologous end joining (NHEJ) (154), which is an error-prone repair pathway and may induce random insertions and deletions (INDELs); (2) homology-directed repair (HDR) (155), which relies on a donor template homologous to the sequence around DSB site to realize the precise editing or correction; and (3) microhomologymediated end-joining (MMEJ) (156), where the single-stranded overhangs generated by the nuclease are annealed at the microhomologies (typically 5-25 bp) existing both upstream and downstream of DSB. Then, two major methods were used to dissect the mutational outcome. First, some machine learningbased tools, such as in Delphi (157), FORECasT (158), and Lindel (159), used the characteristic of sequence context to achieve a great prediction on the distribution of mutations. However, as similar as other learning-based tools, the application of these tools was largely subject to the training set and cannot be spread across different CRISPR systems and species. Secondly, nextgeneration sequencing (NGS) can not only detect the mutations but also classify the mutation types and mutagenesis efficiency. Nonetheless, transforming millions of sequencing signals to quantitative and comparable data remains challenging and needs mathematical aids from in silico tools. The fundamental workflow of these tools is similar to the standard high-throughput sequencing analysis including quality control, trimming adaptor, alignment, and quantification. The main difference in the existing tools will be demonstrated as follows.
1) Alignment strategy. The existing tools adopt either local alignment to the reference amplicons [e.g., CRIS.py (160), CRISPR-DAV (161), and CRISPR-GA (162)] or global alignment to an entire reference genome [e.g., CrispRVariants (163) and AmpliconDIVider (164)]. The local strategy is apt to miscount the candidate off-target reads, while global strategy makes it 2) Deconvolution of the mixed mutations. As mentioned above, three major pathways (NHEJ, MMEJ, and HDR) jointly participate in DSB repair. In contrast to the unpredictable mutations generated by NHEJ, precise modifications generated by HDR and MMEJ are preferred for purposive gene editing. Therefore, classifying the modified alleles is essential for determining the mutant sites and mutagenesis efficiency. The tools adopting local strategy [e.g., CRISPResso2 (165), CRIS.py (160), CRISPR-DAV (161), and CRISPR-GA (162)] align reads to the expected HDR amplicon and the reference amplicon and then identify the modification status by the comparisons of alignment rates and sequence identities. Moreover, some tools [e.g., ampliconDIVider (164), CRIS.py (160), and CRISPResso2 (165)] enable the quantification of in-frame occurrences and potential splice sites according to mutation location and sequence length. The mutations located in the coding region with relatively conserved length are always regarded as in-frame, while the others are frame-shift. Yet regrettably, the tool for distinguishing MMEJ-induced mutations remains unavailable.
3) Applicability for base editors. For fear of the random introduction of INDELs in canonical CRISPR/Cas experiments, base editors, the fusions composed of a catalytically impaired Cas enzyme to a base deaminase that operates on single strand, can directly install point mutations by mediating base conversion without DSB generation (167,168). Conventional tools only for INDEL quantification cannot detect the varying combinations of base conversion induced by the base editor. Interestingly, CRIS.py (160) and CRISPResso2 (165) compensate for this vacancy through searching the pre-set nucleotide substitution rule.
Additionally, whether the tools are equipped with visualization and the execution platform is worth considering. The detailed information of existing CRISPR NGS data analyzers is listed in Table 5.

Database for Experimental Data Collection
The applications of CRISPR/Cas screening massively expand in gene function exploration, so does the need for the open databases for validated data collection where researchers can easily get access to raw or processed data. To satisfy the urgent need, several repositories had been built ( Table 6). Of note, compared with the databases only recording results but without any comparisons of screening results among different researches [e.g., CRISPRz (171), CrisprGE (172), CRISPRlnc  (151), and BioGRID ORCS (176)], GenomeCRISPR (173) based on 84 high-throughput screens additionally provides the intuitive comparisons of gRNA efficacies as well as perturbation phenotypes under specific conditions. Instead of collecting the gRNA information, PICKLES (174) reanalyzed the raw screening data and compared the essentiality of a certain gene across multiple experiments, tissues, or cells. Another two independent databases tailored for human cancer research are Sanger DepMap (175) and Broad DepMap (144), which record the information of gene dependencies in cancer cell lines through analyzing the CRISPR/Cas9 screening data. Furthermore, there are some databases [e.g., Anti-CRISPRdb (177) and CRISPRminer (178)] recording the anti-CRISPR proteins in phage that had been experimentally validated to inhibit the activity of CRISPR/Cas system and reduce off-target events (179).

CONCLUSION AND PERSPECTIVE
CRISPR/Cas systems have navigated researchers to traverse through the dark where they are left flat-footed by the complex functional annotation. However, the advances in experimental techniques still cannot promise CRISPR/Cas system an effortless and expedite manner, which, therefore, needs essential assistance from in silico methods. Our study makes a comprehensive summary and comparisons on the released tools from two perspectives: pre-experimental guidance (CRISPR/Cas system identification and gRNA design) and post-experimental analysis (gene essentiality evaluation, decipherment of the experimental outcome, and data collection). The characteristics of tools based on different design principles and frameworks had been elucidated hereinbefore, which hopefully guide users to make more reasonable choices for their specific data and purposes.
Unfortunately, CRISPR/Cas system cannot yet reach a satisfying achievement in practical use. Current strategies for technical improvement mainly probe into two aspects. On one hand, the most reliable and effective approach is to optimize the experimental technique, which is well-exemplified by the fusion of catalytically impaired Cas enzymes to other engineered proteins for constructing the riskless systems such as CRISPRa/i (103), base editor (167), and prime editor (180) and enhancing the efficiency of precise repair (181). Yet experimental improvement cannot cover all facets, let alone guarantee affordable cost. At that time, in silico tools, the second aspect, are of importance even if there is still a long way ahead such as how chromatin environment affects the on-target and offtarget activities, whether the effects are fixed or varying across tissue and organisms, how to solve the disparity of training set in machine learning-based tools that may cause the poor versatility, and how to combine the individual information into the personalized gRNA design. To the best of our knowledge, the hypotheses of tool optimization are: (1) For CRISPR/Cas system identification, precisely distinguishing CRISPR arrays from other similar repeats requires the incorporation of more distinct features such as the interactions with other genes in the host (42) and the intra-genus conservation (41); (2) For gRNA design, except feature expansion and algorithm optimization, the individual variance associated with on-target and off-target activities (127)(128)(129) should be taken into account. Current tools such as Crisflash (54) and CRISPR-Local (84) considering only somatic mutation are far from satisfactory. It is envisioned that in silico tools covering more individual characteristics such as chromatin environment, accessibility, and exon expression promise more reliable prediction, especially for the clinical purpose; (3) For gene essentiality evaluation, existing tools are not as all-powerful as we expected, which misinterpret the uncertain relationships between the mean and variance of gRNA count, neglect the copy number effect, or lack accessory functions; (4) For the deconvolution of mutations, combination of microhomology predictor and local alignment to reference may pave a new way for quantifying the MMEJinduced mutations.
The urgent demand for optimizing in silico methods cannot mask the truth that they have made tremendous contributions to biological researches. It is increasingly expected that the progress in computational methods will push CRISPR/Cas system into a higher stage and even assist in an earlier realization of clinical popularization.