Resources for Interpreting Variants in Precision Genomic Oncology Applications

Precision genomic oncology—applying high throughput sequencing (HTS) at the point-of-care to inform clinical decisions—is a developing precision medicine paradigm that is seeing increasing adoption. Simultaneously, new developments in targeted agents and immunotherapy, when informed by rich genomic characterization, offer potential benefit to a growing subset of patients. Multiple previous studies have commented on methods for identifying both germline and somatic variants. However, interpreting individual variants remains a significant challenge, relying in large part on the integration of observed variants with biological knowledge. A number of data and software resources have been developed to assist in interpreting observed variants, determining their potential clinical actionability, and augmenting them with ancillary information that can inform clinical decisions and even generate new hypotheses for exploration in the laboratory. Here, we review available variant catalogs, variant and functional annotation software and tools, and databases of clinically actionable variants that can be used in an ad hoc approach with research samples or incorporated into a data platform for interpreting and formally reporting clinical results.


iNTRODUCTiON
Genomic technologies and approaches have transformed cancer research and have led to the production of large-scale cancer genomics compendia (1,2). The resulting molecular characterization and categorization of individual samples from such compendia has driven development of molecular subtypes cancers as well as enhanced understanding of the molecular etiologies of carcinogenesis (3)(4)(5). The development of novel and effective targeted therapies has proceeded in parallel with and been accelerated by deeper, faster, and broader genomic characterization (6), enabling early application of molecular characterization at the point of care to inform clinical decision-making (7)(8)(9)(10) and to address resistance to primary therapy (11). Genomic characterization also has applications in immune approaches to cancer. For example, chimeric antigen receptor T-cell (CARt) therapy has shown great success in diseases with well-characterized antigens that are relatively tumor-specific (12) as identified by genomic profiling. Variously referred to as precision oncology (13), genomicsdriven oncology (14), genomic oncology, and even simply as precision medicine, the paradigm of applying high-throughput genomic approaches to patient samples is rapidly changing the landscape of oncology care and clinical oncology research. Conventional approaches to clinical trials design may be inadequate due to molecular heterogeneity of tumors derived from a single primary tissue (15), leading to the adoption of basket, umbrella, and hybrid trials designs. A number of studies are ongoing to determine feasibility and potential impact of precision genomic oncology at the point-of-care (16)(17)(18). In addition to studies focused on identifying targetable mutations, immunebased therapeutic approaches are also being informed by HTS applied to patient samples (19)(20)(21).
One of the most recent developments in the field of precision oncology is the approval of Pembrolizumab (Keytruda), an anti-PD-1 antibody that functions as a checkpoint inhibitor, by the US Food and Drug Administration for treatment of solid tumors that show genetic evidence of mismatch repair and, therefore, carry very high mutational burdens (22). Pembrolizumab was previously approved for use in melanoma, but the most recent approval is the first that is targeting allows a drug to be used in a non-tissue-specific context in patients showing a specific genomic marker in any solid tumor (23).
As with any clinical testing modality, whether in a research setting or at the point-of-care, a clear understanding of the goals of applying the test is necessary when first designing the test and its validation. However, the flexibility and number of potential data items that arise from even a limited application of HTS has lead the US Food and Drug Administration (FDA) to begin to define its regulatory role (24) and, critically, how existing knowledge bases can be applied in real time to address findings from clinical HTS testing (25).
This review aims to provide an organized set of biological knowledge bases with relevance to the interpretation of small variants, defined as single nucleotide variants or short (on the order of 20 base pairs or fewer) insertions and deletions. The catalogs of observed variants section list large-scale catalogs of variants, useful for filtering known common polymorphisms and identifying previously identified cancer variants. When a variant observed in a clinical sample has not been seen but appears to affect the protein coding sequence, the functional annotation resources section presents a sampling of some of the most common software and databases for predicting the impact on protein function. Finally, we catalog several data products and knowledgebases have been developed to provide decision support (with strong disclaimers and caveats) directly linking observed variants to clinical intervention in point-of-care HTS applications. Integrating the various data sources described in this review with variants observed in individual patients can be accomplished with combinations of software tools for the manipulation of variant datasets.

Catalogs of Observed Germline and Somatic variants
Databases of observed variation in normal populations, diseased individuals, and cancer compendia form the map onto which observed variants in patients are projected. Because of the vast quantities of genomic data and, specifically, DNA variants, there is a tension between providing rich, highly curated information about individual variants and producing the largest possible catalog of variants with manageable levels of curation. This section reviews some of the available catalogs ( Table 1) of genomic variation observed in the germline as well as those that appear in tumors as somatic mutations. Note that many of the databases mentioned below overlap in data sources (some nearly completely), but they may differ in the amount and depth of curation, additional metadata added to each variant, speed of updates, and methods or formats for access.

Germline
Comprehensive catalogs of germline variants inform decisions about the frequency of variants as seen in the general population as well as to identify variants that are annotated as cancer associated. In the context of tumor sequencing, common variants are unlikely to be genomic drivers of carcinogenesis and are often filtered from a report of potential somatic variants. This filtering process is particularly important when tumor sequencing is not accompanied by matched normal sequencing. Additional germline databases that catalog disease-associated variants can be useful to begin to address familial risk and potentially pharmacogenomic loci (38,39).
Perhaps the oldest of the variant catalogs, dbSNP contains 325,658,303 individual variant records (build 150, accessed May 30, 2017) and is available in multiple formats, searchable, and linked to records in literature and other data resources and databases. While the vast majority of variants in dbSNP have been observed in individuals without cancer, somatic variants are included and annotated in the database. Because dbSNP is driven by community submission of variants, levels of evidence vary among individual variants. The genome Aggregation Database, or gnomAD (29,40), contains information from 123,136 exomes and 15,496 whole-genomes from unrelated individuals sequenced as part of various disease-specific and population genetic studies (accessed May 30, 2017). These data were collected by numerous collaborations, underwent standard processing, and unified quality control and results area accessible as a searchable online database and as a downloadable VCF-format text file. ClinVar (28), maintained by the NIH National Center for Biotechnology Information (NCBI), is a freely available archive for interpretations of clinical significance of variants for reported conditions. Entries in ClinVar are taken directly from submitters and represent the relationship between variants and clinical significance. When multiple submissions concerning a single variant are available, ClinVar supplies high-level summaries of agreement or disagreement across submitters. Importantly, though, clinical significance in ClinVar is reported as supplied by the submitter. The Personalized Genome Project (31) provides a limited number of fully open-access genome sequencing results provided by volunteers with trait surveys and even some microbiome surveys of participants. A catalog of germline variants derived from 69 genomes sequenced using the Complete Genomics sequencing platform (30) may be useful for groups who have data generated from the same platform, particularly for identifying sequencingplatform-specific false positive results.

Somatic
Whereas databases of germline variants are useful to filter out variants unlikely to be directly involved in carcinogenesis, databases of somatic variants are useful to identify variants and their frequencies as observed in tumors. In some cases, identified variants may be associated with specific tumor types, offering mechanistic clues, particularly in the rare cancer setting where biological understanding may be limited.
Several catalogs of somatic variants have, at their core, variants derived from The Cancer Genome Atlas (TCGA). These databases vary in the pipelines used to define the variants, the level of annotation associated with individual variants, the proportion of TCGA included, and methods for accessing or querying. Recently, National Cancer Institute (NCI) has established the Genomic Data Commons (GDC) to harmonize clinical information and genomic results across enterprise cancer datasets (32), particularly those funded by NCI, such as TCGA. In addition to the adult tumors profiled as part of the TCGA, the NCI GDC also contains data from several pediatric tumors profiled as part of the Therapeutically Applicable Research To Generate Effective Treatments (TARGET) project (41). Cancer cell line data from the Cancer Cell Line Encyclopedia (CCLE) are also included (42) in the GDC data collection. The GDC is a modern data platform that provides multiple access methods, including a programmatic application programming interface (API), data file download, and web browser-based text and graphical queries and visualization. The International Cancer Genome Consortium (ICGC) is a large, international collaboration with a collection of 76 studies (including TCGA studies) encompassing 21 tissue primary sites. Like the NCI GDC, the ICGC data portal provides modern data platform approaches to data access, visualization, and query (43). The Catalog of Somatic Mutations in Cancer (COSMIC) database is perhaps the largest and best-known cancer variant database. It presents a unified dataset consisting of curated cancer variants for specific genes as well as genomic screens from projects, such as TCGA. Several other cancer variant data resources are listed in Table 1.

FUNCTiONAl ANNOTATiON ReSOURCeS
When faced with variants with little or no literature or database support, differentiating those that variants that are likely to be deleterious, perhaps contributing to carcinogenesis, versus those that likely are tolerated by the cell is a critical task, particularly in the setting of clinical precision genomic oncology. Note that determing that a variant is deleterious is not likely to result in a change in diagnosis, prognosis, or therapy. However, prioritizing variants for further study, research interest, and for discussion in forums such as a molecular tumor board is a valuable and necessary aspect of applying genomic technologies in the clinical arena.
A number of algorithms and methods have been developed to predict the effect of observed variants on protein structure and function as well as the potential for clinical impact. These prediction methods utilize features of the variant and its context, such as sequence identity, sequence conservation, evolutionary relationship, protein primary and secondary structure, entropybased protein stability, and approaches such as clustering based on sequence alignments and machine learning. Some of them are specific to the type of variant or mutation, some to a disease type, and some more general. Therefore, applying these functional annotational tools and interpreting the results in a clinical or research setting may require significant human curation before being recognized as clinically actionable. Here, we present a review of a representative set of approaches for predicting pathogenicity of different variants. For a comprehensive list of prediction tools and their details, see Table 2. For more detailed scientific and technical explanations of these methods, we refer the reader to a comprehensive review (44).

SiFT
Sorting Intolerant From Tolerant, or SIFT, that predicts functional impacts of amino acid substitutions (48) is one of the earliest variant effect prediction tools and represents the class of prediction algorithms that utilizes protein conservation. It has since been updated and an online version of the tool is available (67). SIFT uses sequence homology, as measured by protein-level conservation, to classify variants based as tolerated or deleterious based on the associated protein coding changes. SIFT has served as a benchmark against which other methods are compared because of its relative simplicity. SIFT considers the type of amino acid change induced by a genomic variant and the position at which the change/mutation occurs. SIFT relies on the presence of sequences from which conservation can be determined; variants for which such databases are limited will potentially lack robust predictions.

PolyPhen-2
Polymorphism Phenotyping v2, or PolyPhen2, predicts the effecting of coding non-synonymous SNPs on protein structure and function and annotates them (47). This algorithm uses a naive Bayes approach to combine information across a panel of 3D structural, sequence-based, and conservation-based features. Trained on two datasets, HumDiv and HumVar, and associated non-deleterious controls, the PolyPhen2 algorithm represents a class of multivariate prediction algorithms that employ machine learning and multiple features of variant impact.

Mutation Assessor
Mutation Assessor is an algorithm and tool that, such as SIFT, uses a conservation-based approach. However, Mutation Assessor also incorporates evolutionary information in an attempt to account for shifts in function between subfamilies of proteins (27), potentially extending the functional annotation of variants to "switch of function" as well as loss or gain of function. By quantifying the impact to conserved residues both globally and within subfamilies (residues that distinguish subfamilies from each other are thought to be less tolerant to change), Mutation Assessor defines a functional impact score to predict which variants are likely to be deleterious.

CONDel
The CONsensus DELeteriousness, or CONDEL score, is an integrated prediction method for missense mutations that is relatively easy to extend with additional prediction resources (63). Originally implemented as a weighted average of the normalized scores from the output of two computational tools, Mutation Assessor and FATHMM, CONDEL can be extended or adapted to data at hand and represents an "aggregator" approach to variant effect prediction. Condel scores can be derived for a limited set of specified mutations via an online web application. The Ensembl database provides a variation of position-specific CONDEL predictions that combine SIFT and Polyphen-2 for every possible amino acid substitution in all human proteins.

CHASM
Cancer-specific High-throughput Annotation of Somatic Mutations, or CHASM, is a computational method that identifies and prioritizes the missense mutations likely to enhance tumor cell proliferation (65). CHASM uses machine learning to classify putative "driver" cancer mutations as distinct from "passenger" mutations. Training the CHASM model employed in silico simulation to generate realistic "passenger" mutations, specifically modeled to represent variant context and genes that are observed in cancer settings. Multiple features of the variants, including their DNA and protein contexts, were then used to build a machine learning approach that attempted to maximize the specificity of separating driver mutations from passenger mutations. CHASM represents a relatively specific algorithm focused not on "deleteriousness" but, rather, on the likelihood that an observed variant is a cancer "driver. "

dbNSFP
Recognizing that applying all of the effect prediction tools available is potentially challenging (45), developed a database that aggregates predictions for all possible SNVs associated with coding changes (in Gencode gene models). With more than ten different prediction algorithms and extensive additional annotation, this database can be a useful one-stop-shop for adding annotations to variant datasets. The snpEff suite (described below) can be used in conjunction with dbNSFP to efficiently annotate SNPs with the potential to effect coding genes.

CliNiCAl ACTiONAbiliTY
The ultimate goal for many of the abovementioned resources is to develop an individualized approach to the diagnosis, prevention, and treatment of cancer, or precision oncology. However, despite recent advances in HTS, determining the clinical relevance of experimentally observed cancer variants remains a challenge in the application of HTS in clinical practice. Difficulties in differentiating driver and passenger mutations, lack of standards and guidelines in reporting and interpretation of genomic variants, lack of clinical evidence in associating genomic variants to clinical outcome, lack of resources to disseminate clinical knowledge to the cancer community, and the precise definition of actionability have been reported to contribute to the bottleneck (68)(69)(70)(71). Comprehensive resources linking experimentally determined cancer variants and clinical actionability have been developed to address some of these challenges and address various aspects of translating research results into clinical valuable information to support clinical decisions in precision oncology (see Table 3).
In recognition of the fact that central curation of information regarding actionability is extremely challenging, several of the resources below use crowdsourcing as a means of gathering updates and enhancing curation efforts. In addition to a web interface, some tools provide additional access via API, mobile app, and/or social media tagging to facilitate dissemination of information and enhance accessibility. While some of these tools share similar functions, in the section below, we highlight distinct features and capabilities for a representative set of resources that might be used as a "starter" set for clinical annotation of variants. The myvariant.info database is one of the newest and attempts to provide a "one-stop-shop" for variants. It is included in this section because it has recently incorporated the CIViC and Cancer Genome Interpreter databases. In addition, it provides annotations for SNVs from multiple other data sources (a growing list, so see the site for updates) and aggregates functional annotations for variants present in its database, making it a good all-around tool for cancer variant annotation. It is available as a performant web API only at this time.
Clinical Interpretation of Variants in Cancer (CIViC) is an open access and open source platform for community-driven curation and interpretation of cancer variants. It is based on a crowdsourcing model where individuals in the community can contribute to produce a centralized knowledge base with the goal of disseminating knowledge and encouraging active discussion. Users, including patients, patient advocates, clinicians, and researchers, can participate, along with community editors, in various stages of interpreting the clinical significance of cancer variants using standards and guidelines developed by community experts (68,72).
The Drug Gene Interaction Database (DGIdb) is an open source and open access platform for gene and drug annotation for known interaction and potential druggability. Users can crossreference genes of interest and drugs against up to 15 sources and in functionally classified gene categories (73,74). Cancer Genome Interpreter (CGI) identifies mutational events that are biomarkers of drug response or interact with known chemical compounds (75). PharmGKB is a pharmacogenomic resource for building clinical implementation and interpretation based on annotating, integrating, and aggregating knowledge extracted from research-level publications. It provides scored clinical annotation, prescription annotation (drug dosing, prescribing information), as well as pharmacokinetics/pharmacodynamics (PK/PD) annotation, with primary literature reference. OncoKb contains information on the clinical implication of specific genetic alterations in cancer. Each variant is annotation from multiple sources and scored using Levels of Evidence ranging from Level 1, which includes FDA-approved biomarker predictive of response to an FDA-approved drug, to Level 2, which includes variants for which an FDA-approved or standard of care treatment is available, Level 3 and Level 4 contain variants with investigational and hypothetical therapeutic implications, respectively. A similarly structured scoring system is available for indicating therapeutic implications for variants associated with resistance (76). Cancer Driver Log (CanDL), an expert-curated database for potential driver mutations in cancer, employs a similar four-level scoring system based on FDA approval, clinical, pre-clinical, and experimental functional evidence (77).
MyCancerGenome (MCG) is a knowledge resource highlighting the implication of tumor mutation on cancer care. It allows users to access its content via a mobile app and provide patientfocused information. Patients can access a database entitled DNAmutation Inventory to Refine and Enhance Cancer Treatment (DIRECT) for Epidermal Growth Factor Receptor (EGFR) mutation for non-small cell lung cancer (NSCLC). Personalized Cancer Therapy (PCT) at the MD Anderson Cancer Center is a resource for clinical response associated with cancer variants and aims to facilitate patient involvement in biomarker-related clinical trials. Drug effectiveness is associated with a specific biomarker and scored based on prospective clinical study as well as Food and Drug Administration (FDA) approval.

TOOlS FOR MANiPUlATiNG vARiANT DATASeTS
Processing sequence data with the goal of determining variants (somatic or germline) often end with a file in Variant Call Format (VCF format), a loose, self-describing data standard describing variants along a genome, associated statistical and numeric metrics for each variant, and information integrated from data resources such as those described in the preceding sections (81). An ecosystem of tools, listed in Table 4, has been developed for basic transformations, manipulations, merge operations, and for adding transcript, protein, and higher-level functional annotations to variants in a VCF file. The vt and bcftools software suites perform operations such as slicing by genomic coordinate, data compression, and, importantly variant normalization, rendering variants more readily comparable across resources. Annovar (82,83) and the SnpEff suite (84) add annotations relative to gene annotations, including information about transcript and protein-coding changes. The Ensembl Variant Effect Predictor (VEP) utilizes Ensembl gene models to annotate variants in gene context and offers an interesting plugin architecture that supports adding variant information from resources in ( Table 1) (85). Recently, several software developers of variant annotation tools have developed a standard for reporting gene-centric annotations that has simplified post-processing of variants after annotation. Finally, tools such as Vcfanno (86) have been developed that can flexibly add fields to variants in a VCF file based on relatively sophisticated logic and data transformations, reducing the number of tools required to bring a new data resource into the annotation pipeline.

Pragmatic Details
Despite advanced toolsets for manipulating variant files and increasing adoption available standard formats, practical pitfalls and challenges remain to the basic manipulation of variant datasets. Some data resources are available in multiple formats and not all formats contain identical information. Matching variants between resources and observed variants can be challenging, as some variants can be represented validly in multiple forms. Ideally, variants are cataloged with clarity with respect to a reference genome and, whenever possible, using HGVS nomenclature (90). In spite of increasing awareness and uptake of HGVS standard nomenclature, the critical step of matching variants across tools and databases in assessing clinical significance is still hampered by inconsistencies across tools and databases (91). Particularly, when handling clinical samples, an information system that provides results from multiple resources when assessing novel variants, incorporates in silico controls when adding or updating data resources (to avoid introducing errors), and adheres to HGVS nomenclature wherever possible in data processing pipelines can increase the likelihood of discovering potentially relevant variants.

where to Start?
This review is meant to be comprehensive, so the reader might wonder "Where do we start?. " While it is difficult to make hardand-fast recommendations about what resources, tools, and databases are "the best" given the lack of gold-standard datasets on which to base such evalutations, annotations in Tables 1-3 are meant to provide context for prioritization. The context for sequencing (clinical or not, targeted mutations, trial setting, or novel variant and biomarker discovery) will also drive annotation pipeline development. Not all data resources need to be added simultaneously if developing a pipeline for annotating cancer variants for precision oncology applications. In a clinical setting, targeting the reporting workflow and working with clinicians to understand the most relevant annotations is the most efficient approach to determining relevant resources for annotation. Developing a modular informatics pipeline, perhaps using a computational workflow framework (https://github.com/pditommaso/ awesome-pipeline) that can be easily extended and re-run on previously annotated data is helpful to keep pace with the rapidly changing and growing collection of annotation resources. Newer aggregation resources such as myvariant.info offer a wholistic solution (annotation, catalog, and clinical actionability), but with some risk of "lossiness" with respect to the primary resources contained within.
Finally, given the rapid pace of new development in this space, we have established a crowd-sourced list of cancer variant resources for precision medicine available at https://github.com/ seandavi/awesome-cancer-variant-databases.

Conclusion
Robust sequencing technologies and increasingly reliable bioinformatics pipelines, combined with parallel development of therapeutics and diagnostics has bolstered the field of precision genomic oncology. However, the sheer number of resources available that can inform the interpretation of small variants is staggering, except for the very few variants with well-established clinical relevance or an associated targeted therapy. This review has highlighted a number of important data resources individually. For other variants, data integration remains a significant hurdle to the rapid turnaround required to apply HTS in a clinical context. Expert panel review (the molecular tumor board) has been effective for some groups (13,92,93) while other groups have adopted a protocol-based approach (94). Even when molecularly targetable lesions are identified, barriers to delivering therapy have been observed, limiting the impact of precision genomic oncology in some settings (95). Not covered in this review is the increasing utility of HTS in the burgeoning field of immunotherapy, where early efforts to predict response based on HTS results have been promising (19,96,97).
Some interesting trends are evident in the databases and resources presented in this review that highlight the overarching trends in delivering precision medicine. First is the sheer volume and rapid growth of numbers of observations to learn about the spectrum of variation cancer and normal genomes. Projects such as GnomAD, COSMIC, and other data sharing efforts enhance precision by cataloging rare variants as well as precise estimates of the frequencies of common variants. Second is the use of crowd-sourcing to produce rich clinical annotation (e.g., CiVIC) in response to the need for intensive human interaction to interpret the clinical impact of a variant or its relationship to potential medical intervention. On the other hand, with volumes of data ever-increasing, machine learning techniques drive many of the most commonly used approaches for assigning scores for impact of observed variants. As wellannotated datasets and variant catalogs grow, application of machine learning will become both more common and more powerful.
While significant progress has been made in applying technology to precision oncology, cancer arises in an individual after a typically complex and incompletely understood set of oncogenic events that are increasingly observable at the molecular level. Progress in cancer prevention, early detection, diagnosis, prognosis, and treatment is increasingly driven by insight gained through the analysis and interpretation of large genomic, proteomic, and pharmacological knowledge bases. Reductionist approaches to cancer biology can achieve only limited success in understanding cancer biology and improving therapy. Cancer is a disease associated with disruption of normal cellular circuitry and processes that leads to abnormal or uncontrolled proliferative growth, characterized by a complex spectrum of biochemical alterations that affects biological processes at multiple scales from the molecular activity and cellular homeostasis to intercellular and inter-tissue signaling. The cancer research community has made great strides in measuring the oncogenic events that lead to the development of cancer and therapy resistance. Because of the complexity inherent in protein networks, intercellular signaling, cellular heterogeneity, and the dynamic nature of cancer, future progress will require a more wholistic approach to precision oncology, including multiscale systems and modeling approaches that address the interrelatedness of the biological processes underlying cancer.

AUTHOR CONTRibUTiONS
SD initiated the manuscript. SD, KA, and HT all contributed to the writing and editing of the manuscript.

FUNDiNG
This work was supported by the National Cancer Institute Center for Biomedical Informatics and Information Technology and the National Cancer Institute Center for Cancer Research in the Intramural Research Program at the National Institutes of Health.