Compound Heterozygous Variants in Pediatric Cancers: A Systematic Review

A compound heterozygous (CH) variant is a type of germline variant that occurs when each parent donates one alternate allele and these alleles are located at different loci within the same gene. Pathogenic germline variants have been identified for some pediatric cancer types but in most studies, CH variants are overlooked. Thus, the prevalence of pathogenic CH variants in most pediatric cancer types is unknown. We identified 26 studies (published between 1999 and 2019) that identified a CH variant in at least one pediatric cancer patient. These studies encompass 21 cancer types and have collectively identified 25 different genes in which a CH variant occurred. However, the sequencing methods used and the number of patients and genes evaluated in each study were highly variable across the studies. In addition, methods for assessing pathogenicity of CH variants varied widely and were often not reported. In this review, we discuss technologies and methods for identifying CH variants, provide an overview of studies that have identified CH variants in pediatric cancer patients, provide insights into future directions in the field, and give a summary of publicly available pediatric cancer sequencing data. Although considerable insights have been gained over the last 20 years, much has yet to be learned about the involvement of CH variants in pediatric cancers. In future studies, larger sample sizes, more pediatric cancer types, and better pathogenicity assessment and filtering methods will be needed to move this field forward.


INTRODUCTION
Each year worldwide, ∼300,000 children under the age of 14 are diagnosed with cancer (Sweet-Cordero and Biegel, 2019). Since the 1970s, 5-year survival rates for pediatric cancer patients have steadily increased and are presently over 80% (Phillips et al., 2015). Despite improvements in treatments and survival rates, the causes of most pediatric cancers are still relatively unknown (American Cancer Society 1 ). Recent large-scale studies have helped elucidate the involvement of germline mutations in pediatric cancer development. For example, a 2015 study analyzed mutations across 565 known, cancer-associated genes for 1,120 pediatric cancer patients and found that 8.5% of the patients had an identifiable, pathogenic germline mutation in at least one of these genes (Zhang et al., 2015). Similarly, a 2016 study identified 10% of pediatric cancer patients as having a mutation in a cancer predisposition gene (Parsons et al., 2016). These studies highlight that heritable predisposition does play a critical role in many pediatric cancer cases (Dean and Farmer, 2017). However, these studies also emphasize the need to elucidate additional types of rare, germline variants associated with pediatric cancers. In this review, we focus on compound heterozygous (CH) variants, a type of germline variant that has been understudied in pediatric cancers.
CH variants occur when each parent donates one alternate allele and when the alleles are located at different loci within the same gene (Figure 1; Kamphans et al., 2013). CH variants are particularly relevant to certain types of genes, such as tumor suppressors, where loss-of-function variations are often recessive . Tumor suppressor genes are involved in inhibiting cell division, initiating apoptosis, repairing DNA damage, and suppressing metastasis. When tumor suppressor genes lose function, tumors can arise and existing tumors can become more aggressive (Guo et al., 2014;Wang et al., 2018). Commonly, researchers identify cases where two non-reference alleles at a given genomic locus have been inherited, one from each parent (homozygosity). But in other cases, an individual may inherit two defective copies of a tumor suppressor geneone from each parent, with defects at different loci in the gene-thereby resulting in no functional copies of the tumor suppressor gene and potentially an increased susceptibility to cancer development (Weinberg, 2013).
Sanger sequencing has classically been used to identify CH variants. In this approach, DNA is collected from the patient and his/her parents, and visual analysis of an electropherogram helps to confirm Mendelian inheritance of a CH variant (Piane et al., 2016;Nafisinia et al., 2017). These technologies are most relevant when the researcher wishes to examine one or a few genes. In recent years, detection of CH variants at a larger scale has become feasible with the advent of next-generation sequencing (NGS).
NGS technologies (e.g., Illumina, PacBio) can sequence DNA in a high-throughput manner and thus allow for the examination of many genes in a single run (Tewhey et al., 2011). In order to use NGS data for identifying CH variants, sequencing data must be phased. The process of phasing estimates which chromosome (haplotype) the nucleotides are located on, thereby helping to distinguish between maternally and paternally inherited variants (Choi et al., 2018). If it is known before sequencing that haplotype information will be needed, a laboratory-based, haplotype-estimation method can be used, such as Linked-Read technology by 10X Genomics (Zheng et al., 2016). This approach uses micro-droplet-based dilution to compartmentalize DNA in a random manner and uses a high number of distinct barcodes. This method prevents the partitioned DNA molecules from originating from the same genomic loci. Alternatively, if NGS libraries have been prepared without regard to phase, computer-based phasing algorithms can be used to estimate haplotypes (Browning and Browning, 2007;Delaneau et al., 2013;Loh et al., 2016). These algorithms estimate a patient's haplotypes from genotype data after making inferences from a population-based reference panel and/or inferences from parental inheritance patterns.
FIGURE 1 | Illustration of compound heterozygous variants. Compound heterozygous variants occur when a child has an alternate allele from each parent and the variant is located at different loci within the same gene.
Assessment of CH variant pathogenicity can be accomplished using trusted variant databases, published literature, functional studies, and predictive algorithms (Richards et al., 2015). While each assessment method can be useful, each has unique challenges that may lead to poor pathogenicity assessment. Databases may lack validation data, contain outdated information, or be based on small sample sizes. Published literature may reflect a poor study design or a limited sample size. Functional studies aim to understand the downstream effects of genetic variants (Rodenburg, 2018). For example, gene rescue assays seek to determine whether introducing the wildtype allele into patient derived cells "rescues" the phenotype. However, functional studies may not reflect the true biological environment or may not take full pathways into consideration (Richards et al., 2015). Common variant prediction algorithms, such as SIFT and PolyPhen-2, use nucleotide sequence homology (among other parameters) to predict whether protein function is affected by an amino acid substitution (Ng and Henikoff, 2003;Adzhubei et al., 2010). Predictive algorithms often have low specificity for missense variant prediction (Richards et al., 2015). Therefore, in most cases, it is important to use multiple programs for pathogenicity assessment (Niroula and Vihinen, 2019).
In the following sections, we provide an overview of studies that have identified CH variants in pediatric-cancer patients, provide insights into future directions in the field, and give a summary of available pediatric cancer sequencing data.

Literature Search
On March 20, 2020, we searched PubMed, Google Scholar, and Web of Science for "((pediatric cancer) OR pediatric tumor OR childhood cancer OR childhood tumor) AND ((compound heterozygous) OR compound heterozygosity) AND humans AND Journal Article[ptyp], ", '("pediatric cancer" OR "pediatric tumor" OR "childhood cancer" OR "childhood tumor") AND ("compound heterozygous" OR "compound heterozygosity"), ' and "((pediatric cancer OR pediatric tumor OR childhood cancer OR childhood tumor) AND (compound heterozygous OR compound heterozygosity)), " respectively. The above searches were also filtered to be inclusive of studies published between 1999 and 2019. Based on these criteria, 247, 709, and 33 results were obtained from PubMed, Google Scholar, and Web of Science, respectively. Of the returned results, we examined each article's abstract (and full article text when necessary) to determine whether the researchers had identified CH variants in one or more pediatric cancer patients; we identified 35 total articles that met these criteria (Figure 2).

Evaluation Criteria
We further evaluated the 35 articles to identify whether the authors had used germline tissue for DNA sequencing and described methods for assessing compound heterozygosity. If germline tissue was used for sequencing, and if the authors indicated how compound heterozygosity was determined, we included the study. Of the 35 studies identified, 26 met these criteria and are described in this review (Figure 2). For each of the 26 studies, we obtained additional information such as tumor type, the number of CH variants identified and the genes in which they were identified, the number of patients per cancer type that were included in the study, how compound heterozygosity was determined (i.e., whether through Mendelian inheritance or phasing), the type of sequencing technology used, the use of parental sequence data, and how variant pathogenicity was evaluated (Supplementary Data 1: T1). FIGURE 3 | The number of publications per cancer type pertaining to CH variants in pediatric cancer. The literature on CH variants has covered a wide range of cancer types, especially acute lymphoblastic leukemia, non-Hodgkin's lymphoma, and medulloblastoma. In six publications, at least one patient was diagnosed with more than one cancer type. These "2+ diagnoses" include patients with the following cancer types: glioblastoma + non-Hodgkin's Lymphoma + oligodendroglioma (Bakry et al., 2014), glioblastoma + rectal carcinoma (Bakry et al., 2014), glioblastoma + non-Hodgkin's lymphoma (Chmara et al., 2013) or ALL + rectal adenoma (Herkert et al., 2011), acute myeloid leukemia + medulloblastoma (Scott et al., 2007), colon carcinoma + oligodendroglioma (De Rosa et al., 2000), brain tumor + rhabdomyosarcoma (Quesnel et al., 1999).

Tumor Type Standardization
To ensure consistency of names used to describe the tumor types and to reduce the total number of tumor types to consider, we standardized tumor type names, when possible, using parent terms defined in the National Cancer Institute Thesaurus (Sioutos et al., 2007). For example, we grouped B-cell acute lymphoblastic leukemia and T-cell acute lymphoblastic leukemia under their parent term, "acute lymphoblastic leukemia" (ALL).

Variant Pathogenicity Reassessment and Cancer Pathway Association
Because pathogenicity estimation methods varied widely across the studies and because it had been many years since some articles were published, we reassessed pathogenicity using ClinVar and Variant Effect Predictor (VEP) for all studies that provided variant positions (McLaren et al., 2016;Landrum et al., 2018). We used Kyoto Encyclopedia of Genes and Genomes (KEGG) to determine which of the genes, across all studies, are in a known cancer pathway. We used Catalogue of Somatic Mutations in Cancer (COSMIC) to identify genes with known cancer associations (Kanehisa and Goto, 2000;Tate et al., 2019).
Across all studies, at least one CH variant was identified in 25 genes ( Table 1). Of these genes, 7 are known to be associated with hereditary cancer predisposition ( Table 2). The remaining 18 genes may provide clues about alternative mechanisms of cancer predisposition. For example, one study identified 6 ALL and 13 AML patients with a CH variant in KMT2C (Valentine et al., 2014), and one study identified two hepatoblastoma patients with a CH variant in MUC4 (Zhang et al., 2018). DNA alterations in KMT2C and MUC4 have been observed in the somatic tissue of medulloblastoma and head and neck squamous cell carcinoma patients, respectively (Tate et al., 2019). Therefore, future studies may reveal that germline mutations in these genes also contribute to pediatric cancer predisposition.

Overview of Studies that Identified CH Variants in Genes Associated With Pediatric Cancers
Here we provide an overview of CH variant findings specific to genes that have a known association to cancer predisposition. Of the 25 characterized genes identified across all studies (Table 1), COSMIC classifies 7 as being associated with hereditary predisposition for at least one type of pediatric cancer: ATM, BRCA2, MSH6, PMS2, RECQL4, SDHB, and TP53 ( Table 2).
Below we provide insight about the functions of these genes, prior associations that have been made for non-compound germline variants, and CH variants in these genes.
ATM, BRCA2, and SDHB are known tumor suppressor genes. Non-compound germline variations in these genes have been associated with leukemias, lymphomas, medulloblastomas, and gliomas (Table 2). Similarly, CH variants in ATM have been observed in ALL, astrocytoma, and high-grade glioma (Zhang et al., 2015;Piane et al., 2016;Sharapova et al., 2018). BRCA2 plays important roles in DNA repair, and germline variations in this gene have been associated with breast cancer, ovarian cancer, pancreatic cancer, and leukemia risk ( Table 2; Kanehisa and Goto, 2000;Tate et al., 2019). To date, CH variants in BRCA2 have been observed in medulloblastomas and Wilms tumors (Svojgr et al., 2016;Gröbner et al., 2018;Waszak et al., 2018). Variations in SDHB have been associated with paraganglioma and pheochromocytoma ( Table 2); one patient with paraganglioma had a CH variant in this gene (Majumdar et al., 2010).
MSH6 and PMS2 are both considered to be tumorsuppressor and mismatch repair (MMR) genes (Ripperger and Schlegelberger, 2016;Tabori et al., 2017;Tate et al., 2019). Patients with biallelic germline mutations in an MMR gene (MLH1, MSH2, MSH6, and PMS2) are considered to have a syndrome known as constitutional mismatch repair disease (CMMRD) (Ripperger and Schlegelberger, 2016;Tabori et al., 2017). Germline mutations in MMR genes have been associated with a predisposition to many different types of cancer ( Table 2; Tabori et al., 2017), including hematological malignancies, brain tumors, and digestive tract cancers (Ripperger and Schlegelberger,  (Tate et al., 2019). RECQL4 is involved in many intracellular regulatory pathways and can act either as an oncogene or a tumor suppressor gene (Kellermayer, 2006;Arora et al., 2016). Germline cancer associations for RECQL4 include osteosarcoma, skin basal cell, and skin squamous cell ( Table 2). CH variants in RECQL4 have been associated with osteosarcoma in two studies (Salih et al., 2018;Maciaszek et al., 2019). TP53 is involved in many cancer pathways, and germline variation in this gene has been associated with a wide range of tumor types ( Table 2). One patient with rhabdomyosarcoma + brain tumor had a CH variant in TP53 (Quesnel et al., 1999).
There were a total of 18 CH variants where both alleles were reported as pathogenic or likely pathogenic by the authors of the study (Supplementary Data 1:T2). Because some of the studies were conducted years ago and pathogenicity classifications may have been updated, we re-assessed the pathogenicity of all variants that were provided by the authors of the studies. Using ClinVar and VEP (as SIFT and PolyPhen scores), we were able to confirm pathogenicity for 5 of the 18 CH variants with one or more of our reassessment methods (Supplementary Data 1:T2). These confirmed variants were in MYO18B (Schieffer et al., 2019), BRCA2 (Waszak et al., 2018), FAM83H (Diets et al., 2018), and HIST1H1C (Diets et al., 2018). In addition, we were able to classify 3 CH variants as pathogenic or likely pathogenic that were not indicated as such in the original study. These variants were in RECQL4 (Salih et al., 2018), PMS2 (Herkert et al., 2011), andMSH6 (Scott et al., 2007). For the remaining variants, either only one allele from a CH pair was able to be classified, or no information was available in the databases (Supplementary Data 1:T2).

Observations and Future Directions
Research to date has highlighted that CH variants are observed relatively rarely but occur in many different genes and a diverse array of pediatric tumor types. However, our knowledge of the roles that CH variants play in pediatric cancers is only in its infancy. Prior studies have focused primarily on candidate genes, have been limited to individual cancer types, or have been limited by small sample sizes ( Table 1). Due to these issues, important CH variants may have been missed. For example, in AML, tumor suppressor genes that are known to harbor germline risk variants include BRIP1, FANCA, FANCC, FANCD2, FANCE, FANCF, FANCG, PALB2, and SBDS (Sondka et al., 2018;Tate et al., 2019). However, no study to date has found CH variants in any of these genes for AML patients. Similarly, ATM, NBN, PALB2, PMS2, PTCH1, and SUFU are tumor suppressor and germline risk genes for medulloblastoma, but CH variants in these genes have not been found in prior studies.
One factor that may have contributed to a lack of identifying CH variants in known cancer-predisposition genes may be filtering based on minor allele frequencies (MAF). Commonly, individual alleles with a MAF >1% (or >0.5%) in a control population are excluded from analyses because they are assumed to have benign effects (Richards et al., 2015;Niroula and Vihinen, 2019). Yet, this traditional filtering criterion may be too stringent for identifying pathogenic CH variants. By definition, two pathogenic alleles must be present for a recessive phenotype to manifest itself. Suppose, for example, that one allele in a given gene is relatively rare, having been observed in 0.8% of the population. Now suppose that a second allele is present at a different locus within the same gene and that this allele has a population prevalence of 5%. MAF filtering considers each locus separately, so the latter variant would be excluded by the 1% threshold, causing this potentially pathogenic CH variant to be overlooked. Assuming non-consanguinity and using the probability multiplication rule, the population prevalence of this particular CH variant would be estimated at approximately 0.04% (5% x 0.8%). Although care must be taken to consider other possible combinations of in trans alleles in this gene (Eilbeck et al., 2017), this example illustrates that traditional MAF filtering may be too simplistic for CH variant analysis. Furthermore, researchers must account for any homozygous, non-reference genotypes that have been observed for either allele in healthy individuals (Kamphans et al., 2013).
Further complicating matters, it is difficult to estimate the a priori expectation of finding a CH variant in a given gene. The longer a gene, the higher the probability that two pathogenic alleles will occur within that gene. Accordingly, genome-wide studies may be biased toward identifying CH variants in longer genes. For example, across the 26 studies covered in this review, the average coding sequence (CDS) length for genes with an identified CH variant was 5,610 bases (median: 3,591 bases) ( Table 2). Comparatively, using data from GENCODE, we calculated the average CDS length of protein coding genes across the human genome to be ∼1,796 bases (median: 1,338) (Frankish et al., 2019). Efforts should be made to help alleviate this bias and other confounding factors. For example, Itan et al. developed the Gene Damage Index to prioritize genes based on CDS length, protein complexity, paralog count, and evolutionary pressures (Itan et al., 2015). When prioritizing CH variants at the gene level in this way, the number of false-positive genes may be reduced.

Available Pediatric Cancer Data
The future is primed for more rapid discovery of genetic factors involved in pediatric cancers and a clearer understanding of the roles that CH variants play in pediatric cancer development and progression. Publicly available data are becoming readily accessible for researchers to study, and more data will become available over the next few years. For example, the NIH-funded Gabriella Miller Kids First (GMKF) initiative is sequencing germline DNA from hundreds of trios (affected child and both parents). This initiative is enabling researchers to obtain Illumina-based, whole-genome sequencing (WGS) data for these trios across many childhood cancer types (and other pediatric diseases) (Gabriella Miller Kids First Pediatric Research Program 2 ). Currently, the GMKF repository contains sequencing data for Ewing sarcoma (250 trios) and neuroblastoma (470 trios). In the coming years, the GMKF repository is expected to include data from pediatric cancers such as osteosarcoma, leukemia, enchondromatosis, brain tumors, myeloid malignancies, and others. This repository will prove valuable as parental data may allow for better identification of CH variants and de novo mutations (Francioli et al., 2017;Choi et al., 2018).
Frequently it is infeasible to obtain sequencing data from a proband's parents due to cost limitations or logistical challenges (a parent may not consent to participate or may be unavailable, a parent may be deceased, the child may be adopted, etc.). Projects such as Therapeutically Applicable Research to Generate Effective Treatments (TARGET), St. Jude Cloud, and the Children's Brain Tumor Tissue Consortium (CBTTC) have sequenced DNA for thousands of patients across many pediatric diseases, but sequencing data are available for the proband only in these resources (Children's Hospital of Philadelphia 3 ; Downing et al., 2012;Office of Cancer Genomics, 2013). At the time of this writing, TARGET included matched tumor-normal, multiomic data across six types of pediatric cancer, including AML (n ≈ 50), neuroblastoma (n ≈ 228), Wilms tumor (n ≈ 81), osteosarcoma (n ≈ 89), clear cell sarcoma of the kidney (n ≈ 13), and rhabdoid tumor (n ≈ 43) (Office of Cancer Genomics, 2013). The St. Jude Cloud contained matched tumor-normal data for ∼42 pediatric cancer types/subtypes, which encompassed ∼2,168 patients (Downing et al., 2012). Some of the datasets with the highest number of patients in the St. Jude Cloud included ALL (n ≈ 260), AML (n ≈ 189), HGG (n ≈ 163), and neuroblastoma (n ≈ 135) (Downing et al., 2012). Eight cancer types are represented by over 100 patients, while 14 cancer types had data for 10 or fewer patients. The CBTCC is a collaborative effort dedicated to the study and treatment of pediatric brain tumors (Children's Hospital of Philadelphia). At the time of this writing, CBTCC included genomic data for ∼871 patients across ∼38 pediatric brain tumor types. Although these databases do not contain parental genome data, computer-based algorithms can often determine a variant's parent-of-origin using haplotype reference panels (Browning and Browning, 2011).
Thanks to these public and non-profit efforts, we have entered an era in which researchers can shed more light on the genes and pathways that influence specific pediatric cancers through the use of genome-wide, publicly available data. We advocate that researchers take advantage of these resources.

CONCLUSION
Many discoveries about CH variants and their association with pediatric cancers have been made over the last 20 years; the role of CH variants in cancer and developmental pathways and the prevalence of these variants in pediatric cancers are beginning to be uncovered. Through the works discussed in this review, much insight has been gained. As future studies are conducted on CH variants in pediatric cancers, we anticipate that CH variants will play a more prominent role in elucidating disease mechanisms. This heightened knowledge could expand this field of study and eventually lead to the development of targeted treatments. Furthermore, having an understanding of CH variants and their role in disease development could prove beneficial for disease monitoring. For example, a child with a malignancy who has a germline risk variant in a known predisposition gene and/or key pathway could be more regularly monitored and assessed for additional tumor development (Milanese and Wang, 2019). If risk variants are in genes associated with one particular type of cancer, screening efforts could be more directed than general cancer screening. More specific screening could allow for earlier detection of tumors, sooner treatments, and prophylactic measures. Finally, a heightened knowledge of CH variants could lead to an expansion of our understanding of other pediatric diseases, such as birth defects, and even inherited disorders that arise in adults.

AUTHOR CONTRIBUTIONS
DM wrote the manuscript and performed critical analysis. SP provided edits and critical feedback.