SVInterpreter: A Comprehensive Topologically Associated Domain-Based Clinical Outcome Prediction Tool for Balanced and Unbalanced Structural Variants

With the advent of genomic sequencing, a number of balanced and unbalanced structural variants (SVs) can be detected per individual. Mainly due to incompleteness and the scattered nature of the available annotation data of the human genome, manual interpretation of the SV’s clinical significance is laborious and cumbersome. Since bioinformatic tools developed for this task are limited, a comprehensive tool to assist clinical outcome prediction of SVs is warranted. Herein, we present SVInterpreter, a free Web application, which analyzes both balanced and unbalanced SVs using topologically associated domains (TADs) as genome units. Among others, gene-associated data (as function and dosage sensitivity), phenotype similarity scores, and copy number variants (CNVs) scoring metrics are retrieved for an informed SV interpretation. For evaluation, we retrospectively applied SVInterpreter to 97 balanced (translocations and inversions) and 125 unbalanced (deletions, duplications, and insertions) previously published SVs, and 145 SVs identified from 20 clinical samples. Our results showed the ability of SVInterpreter to support the evaluation of SVs by (1) confirming more than half of the predictions of the original studies, (2) decreasing 40% of the variants of uncertain significance, and (3) indicating several potential position effect events. To our knowledge, SVInterpreter is the most comprehensive TAD-based tool to identify the possible disease-causing candidate genes and to assist prediction of the clinical outcome of SVs. SVInterpreter is available at http://dgrctools-insa.min-saude.pt/cgi-bin/SVInterpreter.py.

Step-by-step example of the SVInterpreter form filling

Output description
The output file is a XLSX file. For translocation, inversion breakpoints localized within different TADs, and insertions, the XLSX file contains two data tables, one per breakpoint or affected region. For inversions affecting only one TAD, deletions, and duplications, only one data table is presented. For CNVs, an additional sheet with the ACMG classification criteria (Riggs et al., 2020) and scores is included on the output XLSX file. Also, for CNVs, an additional XLSX file is made available on the output page, which contains the full CNV database overlap search results, following the CNV-ConTool output format (David et al., 2020).
The genomic information table contains all functional and non-functional genomic elements disrupted by the breakpoint, protein-coding, lincRNAs, lnRNAs, and functional and non-functional genomic elements with expression in GTEx database. These elements are sorted by their genomic position, intercalated with relevant intergenic regions and with the indication of the breakpoint location following the International System for Human Cytogenomic Nomenclature 2020 (McGowan-Jordan et al., 2020). For disrupted genomic elements, it is also indicated in the table, in which exon or intron the breakpoint is located. This location is identified in the corresponding transcript that produces the larger protein, according to Ensembl (Hunt et al., 2018). The intergenic regions are inserted into the table if they contain any single nucleotide polymorphism associated to a phenotypic trait or disease, according to genome wide association studies database (Buniello et al., 2019). Also, along the table, the boundaries of each TAD, and their respective identification relative to the brTAD is showed.
For each functional genomic element, several levels of information are given, divided into nine major categories: i) genes and intergenic regions, where data directly associated to the gene function and structure is presented; ii) clustered interactions and loops, where evidence of gene misregulation is presented; iii) clinical phenotype, where the associated disorders and phenotypical similarity search is showed; iv)gene fusion in cancer presents the state-of-the-art fusion genes and respective type of cancer and frequency; v)Animal Models and associated phenotypic characteristics, where phenotypic characteristics of several knockout animal models can be found; vi) genes associated to infertility, according to studies (Oud et al., 2019); vii) genome-wide association studies (GWAS) results, which are presented for genes and intergenic regions; viii) bibliography; ix) CNV overlap results. The columns that compose the nine categories are described in Supplementary Table 3. To facilitate data interpretation, several fields of the table containing especially meaningful annotation data are automatically highlighted, including: genes included in any panel from PanelAPP (Martin et al., 2019), genes included on the actionable genes list (Kalia et al., 2017), genes with haploinsuficiency index <10% (Huang et al., 2010), triplosensitivity score=3, observed vs expected LoF variants ≤0.3 (Karczewski et al., 2020), GeneHancer cluster of interactions (Fishilevich et al., 2017) or chromatin Loops disrupted by the breakpoint, OMIM inheritance overlapping the inputted inheritance (optional input field) and GWAS data with a p-value ≤5. 0E-7. Additionally, in the header of each table, is presented an hyperlink with the text "UCSC genome browser visualization". Through a personalized UCSC genome browser session, this link provides a graphic representation of the genomic region, helping the user with their interpretation. This session focus on the specific analyzed region, automatically marking the breakpoint or deleted/duplicated/inserted region using the highlight function from the browser. Custom tracks demarcating TADs and loops of the chosen cell-line/tissue are presented, as well as UCSC native tracks, also explored on the output table, including, among others, genes, OMIM genes, geneHancer cluster of interactions, SNPs of GWAS studies and Hi-C interactions heatmap.

2.
Step-by-step tutorial for the application of SVInterpreter, with example

Filling the input form
Step 1: Access the SVInterpreter webpage, via link: https://dgrctools-insa.minsaude.pt/cgi-bin/SVInterpreter.py Step 2: Select the genome version to be used (Supplementary Figure 2 A1). This must be the same version as the coordinates inputted at step 7.
Step 3: Select the Cell-line/Tissue to use as reference for the analysis (Supplementary Figure 2 A2). The reference TADs and chromatin loops comply with the Cellline/Tissue chosen in this step. This choice must be done according to the case's characteristics. When in doubt, we suggest the use of undifferentiated cells: hESC.
Step 4: Insert phenotypic description (Supplementary Figure 2 A3). If there is phenotypic characteristics associated with the SV, they may be inserted here (optional). The characteristics must be inserted as HPO terms IDs (HPO:XXXXXXX), separated by commas. These characteristics are overlapped with the ones associated with disease by SVInterpreter.
Step 5: Select the inheritance of interest (optional) (Supplementary Figure 2 A4). If there is an specific inheritance that the user want to investigate, it can be selected here. On the output table, the diseases associated to the chosen inheritance are highlighted.
Step 6: Select the type of SV to analyze (Supplementary Figure 2 B5). The user must choose between "Balanced Translocation", "Unbalanced Translocation", "Insertion", "Deletion", "Inversion", and "Duplication", if the intention is to analyze an SV; or "Query genomic Region", if the intention is to explore a genomic region of interest. Translocations with small deletions associated to the breakpoint may be analyzed with the options "Balanced Translocation".
Step 7: Fill the chromosome and coordinates of SV to analyze (Supplementary Figure  2 B 6A,6B). Input the chromosome(s) and genomic location(s) of the SV. For translocations and insertions (6A), fields for both affected locations are opened, while for inversions, deletions, duplications, and "Query genomic region", there is only space for one region (6B). For translocations with deletions associated to the breakpoint, instead of simply inserting the breakpoint coordinate, the user may insert the deleted region (start coordinate-end coordinate). Make sure that the coordinates are in the same genome version as the one chosen on step 1.
Step 8: (CNVs only) Perform overlap search with public databases (Supplementary Figure 2 B7). For CNVs, SVInterpreter offers the option of overlapping the inputed CNV (query) with the ones available on public databases (reference). If the user selects this option, a new submenu is open, to select which databases are going to be used, and which strategy of overlap is applied: i) "mutual overlap", where both query and reference need to overlap one another by at least X percent (X also established by the user -default is 70%); ii) "query comprised by reference", where the query has to be completely contained by the reference, independently of the reference size.
Step 9: Select if the analysis must be based on TADs or a user-defined region (Supplementary Figure 2 C). If "TADs" option is chosen, the region to analyze is defined by the TADs not coordinates. When this option is selected, a sliding button is opened, allowing the user to define an interval of TADs to analyze (default is the TAD affected by the breakpoint: brTAD). On the other hand, if the "A specific region" option is chosen, the user is allowed to insert a genomic region (as chromosome: start-end), that must be analyzed, independently of the number of TADs it contains.
Step 10: Click on submit. This will submit the form to SVInterpreter analysis, and open the output page. This page will automatically refresh until the output table is ready for download.

Output table interpretation
Step 11: Check if any of the breakpoints disrupts/deletes/duplicates a genomic element, as a protein-coding gene, lincRNA, etc. This can be easily visualized by: i) the demarcation of the breakpoints in green, on the first column; ii) the color code of the table: red for deleted regions, blue for duplicated regions, yellow and grey to differentiate between the region upstream and downstream of the breakpoint; and iii) the indication of which Exon(s)/Intron(s) is disrupted/deleted/duplicated (if applicable) on the third column, in green. This will be the first genomic element to be investigated.
Step 12: If a gene is disrupted/deleted/duplicated, check the gene-associated information, emphasizing the following: i) dosage sensitivity values (fifth and sixth columns), according to the type of SV; ii) Gtex expression, taking special attention to high expression on tissues associated with the SV-associated phenotype (if applicable, ninth column); iii) Associated phenotypes, respective inheritance, classifications by DDG2P and ClinGen, and, if applicable, the phenotypic overlap values (12th to 15th columns); iv) Genephenotype/disease association in animal models, and GWAS data, with special attention to the phenotypic overlap with the SV-associated characteristics (18th to 22th and 24th columns). If there is some indication of the gene being disease-causing, one may expand the research to the bibliographic references (25th column).
Step 13: Check the brTAD genomic element content. Verify the gene-associated data for the remaining genes of the brTAD. The relevant fields are similar to step 12, adding, GeneHancer and chromatin loop disruption, which may indicate a position effect event (10th and 11th columns). Since, in this case, we are looking for indications of position effect, the evidence of gene association must be strong, as for phenotypic overlap or association with autosomal dominant diseases.
Step 14: (Specifically for CNVs) Check the overlap with public databases. On the last column of the table, on the same line of the breakpoint, the results of the overlap with public databases is presented. This will aid the interpretation of the pathogenicity of the analyzed CNV.
Step 15: If no candidate gene was found, repeat steps 12 and 13 for flanking TADs (TAD-1 and TAD+1). This step may be repeated for the TADs further and further from the brTAD (TAD-2 and TAD+2, then TAD-3 and TAD+3, and so on) until the user feels content with their evaluation.

Example of analysis
As an example, we show the application of SVInterpreter to a translocation t(16;17) -DGRC0016. Since the input form may vary with the type of SV, other examples are showed on the tutorial available on the SVInterpreter webpage.
The filling of the form (steps 1 to 10), is illustrated in Supplementary Video 1.
Regarding the analysis of the result table (Supplementary table 6), we started by checking the disrupted genes: in the chromosome 16, ANKRD11 is disrupted at IVS2, and in chromosome 17, WNT3 gene is disrupted in IVS1. The ANKRD11 gene has a significant pLi and is associated with autosomal dominant KBG syndrome, that presents good phenotypic overlap metrics (PhenSSc 2.31 (P= 0.02 ; MaxSSc 2.91 )). Also, the animal models present consistent phenotypic characteristics. At this step, ANKRD11 seems like a suitable candidate gene. WNT3 also as a significant pLi but is only associated with autosomal recessive Tetraamilia syndrome, which shows a poor phenotypic overlap (PhenSSc 0.788 (P= 0.77 ; MaxSSc 2.91 )). These and the other table information do not support the association of WNT3 with the phenotype. As for the remaining genomic elements, none presented a significant phenotypic overlap with the case, nor were associated to autosomal dominant pathologies and had their GeneHancer cluster of interactions or loops disrupted by the breakpoint. Since the disrupted gene ANKRD11 explains the totality of the phenotype, we concluded our search here. Otherwise we would extend the analysis to TADs -1 and +1.

Supplementary Figures
7 Supplementary Figure 1. Example of a complex rearrangement analysis. (A) A hypothetical complex rearrangement in chromosome 1, involves the excision and insertion of a genomic fragment from the short arm to the long arm, and an inversion. (B) For the analysis with SVInterpreter, the complex rearrangement is subdivided into an insertion and an inversion. Together, the two analyses allow a complete overview of all the regions affected by the complex rearrangement.  (7) CNV overlap search, the user must select which databases to use, and the type of overlap, as described in David et al., (2020). (C) Lastly, the user can choose between using TADs to define the region to analyze (where the user can choose up to 5 TADs upstream (+5) or downstream (-5) the breakpoint) or set the region manually using genomic coordinates.
Retrieving integrated data, namely the association between human genes and their respective orthologs.
http://marrvel.org/ Wang et al., 2017 nd -Not determined 16 Supplementary Genes located inside the TAD; The principal symbol of the gene is presented, with the respective synonyms (if exist) inside brackets. The gene is followed by the PanelAPP associated panels and respective level of evidence, in which case, it's all presented in bold. The gene symbol has a hyperlink to the GeneCard respective page. Intergenic regions and the breakpoint locations are also presented in this column.

Actionable Genes (MAGs)
The American College of Medical Genetics and Genomics actionable genes are presented in this column. If exists, appears in bold.

Breakpoint location; Genome strand
If the gene is disrupted or partially deleted/duplicated, the affected region is presented here. Then the strand of the gene is presented next, as SS for sense, and AS for antisense. Breakpoint location and strand are separated by a semicolon. If information about the disruption is present, this field appears in bold and green.

Gene ID
Gene ID on the OMIM database. The ID has a hyperlink to the respective page on the OMIM website.

HI% ; Triplo
Indicates the Haploinsuficiency index and triplosensitivity score of the gene, separated by a semicolon. If the Haploinsuficiency index is lower than 10% or the triplosensitivity score is equal to 3, this field appears in bold and green.
pLi; o/e score Indicates the probability of loss of function and the observed vs expected score, both values separated by a semicolon. If the observed vs expected score is lower than 0.3, this field appears in bold and green.
Protein entries Direct link to Uniprot database, to a set of proteins associated to the respective gene, in the human species.

Function
Description of the gene function, according to OMIM. The description might not be complete since this field is limited by the number of characters allowed on a XLSX cell.
Top 3 highest TPM (Total TPM; Mean TPM) The three most expressed tissues according to GTEx, in Transcripts per million (TPM). Besides the value of expression of each individual tissue, the total and mean expression are also presented inside brackets and separated by a semicolon.

Clustered interactions
The region covered by the cluster of interactions of each gene, according to GeneHancer. If the region is disrupted by the breakpoint, the text is presented in bold and green.

Loops
Loops identified on the cell line or tissue chosen in the input form. The two genomic regions involved on the Loop are inside brackets, preceded by the genomic elements that they affect, and separated by "&&". If the loop is disrupted by the alteration, the field is presented in bold and green.

Clinical phenotype Assoc. Disorder
Description of the disorder associated to the gene. This field has the direct hyperlink to OMIM, DDG2P or ClinGen, according to the source of the information.
OMIM_ID_inh ID of the OMIM phenotype indicated in the previous column, and the respective inheritance separated by an underscore, with hyperlink to the respective page on the OMIM website. If the inheritance matches the one chosen by the user on the input form (optional), this field appears in bold.
DDG2P class. Classification of the disorders described on the Assoc. Disorder column, according to DDG2P.
ClinGen class. Classification of the disorders described on the Assoc. Disorder column, according to ClinGen.

PhenSSc (P; MaxSSc)
Result of the phenotype similarity search if any phenotype was inputted on the input form (optional). The first score is the similarity score between the inputted phenotype and the disorder described on the Assoc. Disorder column. Next, inside brackets, and separated by a semicolon is the p-value that reflects the probability of this score been obtained by chance and the maximum score that could be attained with the inputted phenotype description. This search is only applicable to disorders described on OMIM, with associated phenotypic description.

Gene fusion in cancer
Gene1 / Gene2 Cytoband Fusion genes found in cancer. The two genes fused are separated by a bar, and followed by the cytoband of the second gene, separated by an underscore. The first gene is always the one being described in this line. The text has hyperlink to the Atlas database or the Mitleman database.
Organ: type, nr. Cases Organ and the type of cancer where the previous fusion gene was found, followed by the number of cases. The text has hyperlink to the Atlas database or the Mitleman database.

Genephenotype/disease associations and animal models
C.elegans Phenotypic characteristics of the knockout results of the Orthologs of the human gene in C.elegans, with link to WormBase. Also a direct link to Uniprot search in C.elegans is provided.

Drosophila
Phenotypic characteristics of the knockout results of the Orthologs of the human gene in fruit fly, with link to FlyBase. Also, a direct link to Uniprot search in fruit fly is provided.

Mouse
Phenotypic characteristics of the knockout results of the Orthologs of the human gene in mouse, with link to MGI. Also, a direct link to Uniprot search in mouse is provided.

Rat
Phenotypic characteristics of the knockout results of the Orthologs of the human gene in rat, with link to RGD. Also a direct link to Uniprot search in rat is provided.

Zebrafish
Phenotypic characteristics of the knockout results of the Orthologs of the human gene in zebrafish, with link to Zfin. Also, a direct link to Uniprot search in zebrafish is provided.

Infertility
Disorder Infertility-associated disorders, that were potentially or confirmed as associated with the gene in question. The disorder is presented in the column, followed by the type of association established, inside brackets.

GWAS data
SNPs -Genetic traits SNP and genetic trait association trough genome wide association studies. The number of SNPs associated to each trait is presented and is followed by the p-value inside square brackets. SNPs with a p-value ≤5.0E-7 are presented in bold and green.

Bibliography
PubMed Link Direct hyperlink to PubMed search of the gene in question, in human.

Only for CNVs
Best Hits For CNVs or query region, the results of the overlap search, according to the preferences of the user, are presented here, in line with the beginning of the CNV. The overleaped CNV, their clinical significance, the percentage of overlap and frequency is presented. One line is presented by tested database (if any CNV falls inside the defined parameters).