Evaluating Clinical Genome Sequence Analysis by Watson for Genomics

Background: Oncologists increasingly rely on clinical genome sequencing to pursue effective, molecularly targeted therapies. This study assesses the validity and utility of the artificial intelligence Watson for Genomics (WfG) for analyzing clinical sequencing results. Methods: This study identified patients with solid tumors who participated in in-house genome sequencing projects at a single cancer specialty hospital between April 2013 and October 2016. Targeted genome sequencing results of these patients' tumors, previously analyzed by multidisciplinary specialists at the hospital, were reanalyzed by WfG. This study measures the concordance between the two evaluations. Results: In 198 patients, in-house genome sequencing detected 785 gene mutations, 40 amplifications, and 22 fusions after eliminating single nucleotide polymorphisms. Breast cancer (n = 40) was the most frequent diagnosis in this analysis, followed by gastric cancer (n = 31), and lung cancer (n = 30). Frequently detected single nucleotide variants were found in TP53 (n = 107), BRCA2 (n = 24), and NOTCH2 (n = 23). MYC (n = 10) was the most frequently detected gene amplification, followed by ERBB2 (n = 9) and CCND1 (n = 6). Concordant pathogenic classifications (i.e., pathogenic, benign, or variant of unknown significance) between in-house specialists and WfG included 705 mutations (89.8%; 95% CI, 87.5%−91.8%), 39 amplifications (97.5%; 95% CI, 86.8–99.9%), and 17 fusions (77.3%; 95% CI, 54.6–92.2%). After about 12 months, reanalysis using a more recent version of WfG demonstrated a better concordance rate of 94.5% (95% CI, 92.7–96.0%) for gene mutations. Across the 249 gene alterations determined to be pathogenic by both methods, including mutations, amplifications, and fusions, WfG covered 84.6% (88 of 104) of all targeted therapies that experts proposed and offered an additional 225 therapeutic options. Conclusions: WfG was able to scour large volumes of data from scientific studies and databases to analyze in-house clinical genome sequencing results and demonstrated the potential for application to clinical practice; however, we must train WfG in clinical trial settings.


INTRODUCTION
Recent progress in precision medicine applied to cancer therapy, i.e., targeted treatment based on the specific molecular features of a patient's tumor has changed cancer treatment strategies (1,2). With increasing frequency, oncologists are sequencing tumor genomes in clinical practice to identify molecularly targeted treatment options (3). Next-generation sequencing is a powerful tool for building a tumor-specific profile of genomic alterations, such as single nucleotide variants, copy number variants, and gene translocations. The challenge lies in identifying which mutations are pathogenic and actionable. This is generally done by teams of human experts who reference databases of genomic alterations, published literature, regulatory approval data for pharmaceutical agents, and clinical trial protocols to identify patients who may benefit from treatment by molecularly targeted therapies (4,5). This translation of genome sequencing results into potential treatment options is time-consuming, laborious, and requires a high degree of specialization, but the growing list of targeted therapies is making it more and more essential for patient care.
Currently, even well-resourced hospitals struggle to keep up with the growing need for the interpretation of genomesequencing results. One potential solution to this problem of scaling is IBM Watson for Genomics (WfG), a cloud-based service for evidence gathering and genomic analysis, which can be used to analyze large volumes of genome data and obtain evidence-based answers (6,7). For each patient, WfG reads a data file containing a tumor's genetic variants and evaluates each variant using advanced cognitive analytics against data sources, such as treatment guidelines, peer-reviewed journal articles, and clinical studies. WfG identifies which ones are actionable based on the literature as well as the list of FDA-approved drugs and recruiting clinical trials. The information is presented in a report along with links to the relevant medical literature. The patient's doctor then reviews this information alongside additional clinical evidence to make an informed treatment decision. For this study, only recruiting clinical trials in the United States were used by WfG during the analysis, which might lead to differences in the list of potential therapies for each patient case.
Since April 2013, the National Cancer Center Hospital (NCCH) in Japan has been conducting a clinical genome sequencing project (TOP-GEAR PROJECT). This project was initiated to match patients whose tumors harbor specific molecular aberrations to investigational phase I trials of targeted therapeutics. Preliminary data for this project have already been reported, and in the present study, we used WfG to reanalyze the sequencing results (8). The primary objective of this study was to assess the validity and utility of WfG for analyzing clinical genome sequencing results by comparisons with results obtained by an expert panel composed of multidisciplinary specialists at NCCH.

Patient Selection
Patients with solid tumors who were candidates for phase I trials at the NCCH in Japan and who participated in the clinical genome sequencing project (UMIN000011141) between April 2013 and October 2016 were identified. Characteristics and clinical information for the patients were obtained from their medical records. The study is approved by the institutional review boards of the National Cancer Center Hospital (ID: 2012-374). All subjects gave written informed consent in accordance with the Declaration of Helsinki.

In-House Genome Sequencing System
In-house genome sequencing was performed with formalinfixed, paraffin-embedded tissue samples using an original oncogene panel ("NCC oncopanel, " ver. 2.0, containing 90 mutations/amplifications and 10 fusion genes; ver. 3.0, 104 mutations/amplifications, and 16 fusions, Supplementary Table 1). By using SureSelect XT reagent  [single nucleotide variants and short insertions and deletions (indels)], gene amplifications, and gene fusions were detected by using the in-house program. All detected alterations were checked by manual inspection using the Integrative Genomics Viewer, and the list of altered genes was obtained. Single nucleotide polymorphisms (SNPs) were eliminated from the list by using the existing databases. A multidisciplinary team, which included biologists, genome researchers, bioinformaticians, pathologists, and clinicians, annotated each altered gene in the list except for SNPs. For annotation and SNPs elimination, ANNOVAR (11), COSMIC (12), 1000 Genomes (13), ESP6500 (14), Human Genetic Variation Database (15), and in-house Japanese germline SNP data were used. In addition, treatment options, i.e., corresponding molecularly targeted drugs, which included investigational drugs, were reported. The detailed methods were described previously (8,16).

Molecular Interpretation by WfG
The same list of altered genes was re-analyzed by WfG. The following information was uploaded to WfG for each case:  Table 2 and Datasheet 1). This analysis was performed twice using different versions of WfG: version 27 (analysis conducted November 2016) and version 33 (September 2017). WfG became commercially available in October 2016.

Comparison of Results Obtained by the Expert Panel and WfG
First, analytical results, including proposed pathogenic gene alterations and targetable pathways, obtained by the expert panel and WfG (ver. 27) were compared. By comparing the results obtained using WfG ver. 27 and ver. 33, the temporal progress of the WfG system was evaluated, with a focus on previously discrepant cases. For comparison to the expert panel, the WfG results were grouped as follows; pathogenic and likely pathogenic mutations were categorized as pathogenic alterations, whereas VUSs and likely benign/benign mutations were categorized as non-pathogenic alterations.

Statistics
The 95% confidence interval [CI] for each rate was calculated using R Version 3.4.2 (R Foundation for Statistical Computing, Vienna, Austria). The Cohen's kappa statistic was used to measure agreement between the results of the expert panel and WfG.

Gene Mutations Analyzed in WfG ver. 27 and Concordance With the Expert Panel
The same list of altered genes, excluding SNPs, was analyzed using WfG ver. 27, and 235 of 785 mutations, 39 of 40 amplifications, and 9 of 22 fusions were identified as pathogenic (Figures 2, 3, and Supplementary Table 7). We compared the results obtained by WfG ver. 27 and the expert panel. Of 785 gene mutations, 206 mutations (26.2%) were identified as pathogenic by both methods, and 499 mutations (63.6%) were non-pathogenic by both methods (Figure 4A). The total number mutations were identified as pathogenic by the panel. In the VUS subgroup, 321 out of 354 (90.7%, 95% CI, 87.2-93.5%) mutations were determined to be non-pathogenic by the panel and in the likely benign/benign subgroup, 178 out of 196 mutations (90.8%; 95% CI, 85.9-94.5%) were determined to be non-pathogenic by the panel.

Causes of Discrepancies in the Evaluation of Pathogenicity Between the Expert Panel and WfG ver. 27, and Reanalysis by WfG ver. 33
Eighty unmatched mutations were discussed between investigators at the NCCH and at IBM, and the reasons for the discrepancies were identified. Detailed data for genes with unmatched mutations are summarized in Table 2. There was a high possibility that 24 out of 80 unmatched mutations were likely false positive assessments by WfG ver. 27, and 22 unmatched mutations were likely false negative assessments by WfG. For example, EGFR G719C is a well-known actionable mutation in patients with lung adenocarcinoma (17,18), and patients with FGFR W290C mutation-positive tumor were reported to be sensitive to FGFR inhibitors (19), but these mutations were determined as VUSs by  Supplementary Table 8. In addition, the expert panel identified five VUSs as pathogenic mutations out of abundant caution. The expert panel found 14 mutations to be pathogenic based only on the observation of one or more reports in the COSMIC (12) database of their presence in tumors. The lower threshold of pathogenicity held by the expert panel when evaluating VUSs or using COSMIC may explain these 19 discrepant evaluations. Opinions regarding the remaining 13 mutations were divided between investigators because different references were used to determine the degree of pathogenicity, and the pathogenicity of the mutations was not always sufficiently established.
After reanalysis using WfG ver. 33, 46 of the 80 unmatched mutations became concordant with the expert panel results (Figure 4B). The analysis using ver. 27 resulted in many disagreements in TP53 and NOTCH2-3. Ver. 33 was improved with respect to TP53 and NOTCH2-3, and the number of disagreements was substantially reduced. EGFR G719C, FGFR2 ). The number of concordant assignments with respect to mutation pathogenicity after reanalysis is broken down by gene in Figure 6. The proportion of matched cases increased after reanalysis for all subgroups (Figure 5).

Gene Amplifications Analyzed by WfG ver. 27
The concordance of pathogenic evaluations of gene amplifications between the expert panel and WfG ver. 27 was 97.5% (39/40; 95% CI, 86.8-99.9%; Figure 4C). The concordance rate for gene amplifications was high; however, one case of ERBB2 amplification was proposed as pathogenic only by the expert panel. Copy number variation in this gene was found to be more than eight copies; therefore, the authors believe this was a false negative result by WfG. The results of the comparison about gene gene amplifications and fusions were not reported because the same results were obtained after the reanalysis by ver.33.

Fusions Analyzed in WfG ver. 27
Four fusions (KIF5B-RET, CD74-ROS1, CLTC-ALK, and GBA3-ALK) were determined as valid and pathogenic by both the expert panel and WfG ver. 27 ( Figure 4D). Three of these fusions are well-known in patients with lung cancer. The fourth, GBA3-ALK, has not been reported previously, and a functional analysis of this fusion has not been conducted; therefore, the role of this fusion in tumorigenesis is not clear. WfG reported an additional five fusions (Supplementary Table 7) as pathogenic. BRAF-SULT4A1 and its reciprocal fusion SULT4A1-BRAF were evaluated as pathogenic; BRAF fusions, such as KIAA1549-BRAF, have been reported in astrocytoma, (21) supporting the conclusion that this fusion is likely pathogenic. FIP1L1-PDGFRA is a recurrent fusion in chronic eosinophilic leukemia (22). Although it has not been detected in solid tumors, this result is plausible. However, due to low read frequency, the authors believe that these were sequencing artifacts. Overall concordance was 77.3% (95% CI, 54.6-92.2%) for fusions.

Targeted Drugs
Across all the detected pathogenic gene alterations in this analysis, the expert panel proposed targeted drugs against 113 pathways including 111 gene alterations; WfG ver. 27 proposed drugs targeting 322 pathways and 227 alterations. When drugs targeting the same signaling pathway were proposed for a gene alteration, they were combined into the same pathway-targeted drug for the analysis.  Table 9). Thus, WfG proposed about twice as many targetable gene alterations and three times as many targeted drugs. Of the 104 pathways with 102 gene alterations for which targeted drugs were proposed by the expert panel, 88 (84.6%) of the same pathway-targeted drugs were also proposed by WfG (Figure 7). Thus, WfG covered most targeted therapies proposed by the expert panel and offered additional therapeutic options, including investigational drugs in recruiting trials

DISCUSSION
This is the first detailed analysis of WfG results in a temporal progression. WfG exhibited a high level of concordance with our expert panel for the identification of pathogenic gene mutations and amplifications. The update process of WfG was not related to this study and had been performed independently. The difference in performance by WfG from ver. 27 to ver. 33 is primarily due to three factors. First, WfG knowledge corpus was expanded through ingestion of new articles and the addition of several databases, including OncoKB. Second, a new, more accurate functional annotation tool was incorporated into the software. Third, the system continued to learn through training on additional cases from other clients leading to an overall improvement in the concordance rate (23). The concordance rate between WfG ver. 27 and the panel for gene mutations was 89.8%. After reanalysis using WfG ver. 33, there was a 4.7% increase in concordance for gene mutations. Most of the remaining discrepancies in the assigned pathogenicity of gene mutations can be explained by a lack of evidence supporting the function of genes in cancer. Large datasets for genome alterations in cancer are available from several genome databases, but the pathogenicity has not been sufficiently established for many gene alterations, and the degree of pathogenicity is often unclear. Evaluations of nonsynonymous mutations and in-frame indels in tumor suppressor genes can be quite challenging owing to insufficient information from the literature. The increase in concordance from ver. 27 to ver. 33 demonstrates the cognitive capabilities of WfG, which continues to learn as more cases are interpreted.
Gene alterations were more likely to be identified as pathogenic by the expert panel. In daily clinical practice, it is important to offer additional treatment options for patients with cancer. Therefore, the expert panel may have a lower threshold for determining the degree of pathogenicity. In contrast, WfG offered targeted therapies against about twice as many gene alterations compared with the expert panel. Moreover, WfG covered about 84.5% of targeted therapies proposed by the expert panel. The proposal of investigational drugs as a therapeutic option is usually a difficult challenge. The expert panel required stricter evidence when selecting targeted therapeutic options, whereas WfG is designed to be more inclusive of potential therapies, providing the level of evidence for each potential therapy and enabling the clinician to make the final treatment decision.
At present, WfG is useful for a clinician at a general hospital although an additional survey of evidence by a clinician is required when evaluating fusions. WfG can also offer additional and beneficial opinions including investigational drugs even for a clinician at a cancer specialty hospital. In the near future, genome sequencing of patients' tumors and molecularly targeted therapies may become more common, and a shortage of specialists is expected. Additionally, the amount of data collected per patient is expected to increase. For this study, we conducted targeted sequencing; however, analyses of exome-sequencing data or whole-genome sequencing data will become necessary, further increasing the burden on specialists. The use of cognitive technologies, such as WfG, is a promising option for overcoming this scaling problem.
The usage of WfG has some limitations. WfG interpreted.vcf files. The processes of mapping reads from raw sequencing data obtained from MiSeq (fastq) and removing mapping errors were performed by bioinformaticians at NCCH. Our study also had some limitations. In particular, the number of evaluated amplifications and fusions was low for a comprehensive evaluation of accuracy. Additionally, although the validity and utility of WfG were judged based on its agreement with the opinions of experts at our hospital, the pathogenicity of many gene alterations in tumors has not been established, and interpretations were not always correct or clear. The validity and utility in clinical trials, including basket studies of targeted therapies, should be evaluated by accumulating additional clinical data, such as tumor responses or non-responses to matched targeted drugs proposed by WfG. Moreover, the opinions of the expert panel at our institution were based on data obtained from many databases available at the time of the analysis. Thus, it is necessary to analyze differences in results as databases are updated.
In conclusion, WfG showed comparable analytical results for clinical genome sequencing to those of a multidisciplinary team at a cancer specialty hospital. A few discrepancies in gene mutations and amplifications remained after reanalysis, and there was room for improvement in the analysis of gene fusions. WfG development continues, and it demonstrated a significant improvement in mutation assignment from ver. 27 and 33. WfG may be useful in cases where large amounts of genomic data is available, such as whole exon sequences, and in institutions with an insufficient number of experts in gene analyses. However, additional evaluation of Watson for Genomics in clinical trials is necessary to further validate its assessments and drug recommendations.