CEACAM Gene Family Mutations Associated With Inherited Breast Cancer Risk – A Comparative Oncology Approach to Discovery

Introduction Recent studies comparing canine mammary tumors (CMTs) and human breast cancers have revealed remarkable tumor similarities, identifying shared expression profiles and acquired mutations. CMTs can also provide a model of inherited breast cancer susceptibility in humans; thus, we investigated breed-specific whole genome sequencing (WGS) data in search for novel CMT risk factors that could subsequently explain inherited breast cancer risk in humans. Methods WGS was carried out on five CMT-affected Gold Retrievers from a large pedigree of 18 CMT-affected dogs. Protein truncating variants (PTVs) detected in all five samples (within human orthlogs) were validated and then genotyped in the 13 remaining CMT-affected Golden Retrievers. Allele frequencies were compared to canine controls. Subsequently, human blood-derived exomes from The Cancer Genome Atlas breast cancer cases were analyzed and allele frequencies were compared to Exome Variant Server ethnic-matched controls. Results Carcinoembryonic Antigen-related Cell Adhesion Molecule 24 (CEACAM24) c.247dupG;p.(Val83Glyfs∗48) was the only validated variant and had a frequency of 66.7% amongst the 18 Golden Retrievers with CMT. This was significant compared to the European Variation Archive (p-value 1.52 × 10–8) and non-Golden Retriever American Kennel Club breeds (p-value 2.48 × 10–5). With no direct ortholog of CEACAM24 in humans but high homology to all CEACAM gene family proteins, all human CEACAM genes were investigated for PTVs. A total of six and sixteen rare PTVs were identified in African and European American breast cancer cases, respectively. Single variant assessment revealed five PTVs associated with breast cancer risk. Gene-based aggregation analyses revealed that rare PTVs in CEACAM6, CEACAM7, and CEACAM8 are associated with European American breast cancer risk, and rare PTVs in CEACAM7 are associated with breast cancer risk in African Americans. Ultimately, rare PTVs in the entire CEACAM gene family are associated with breast cancer risk in both European and African Americans with respective p-values of 1.75 × 10–13 and 1.87 × 10–04. Conclusion This study reports the first association of inherited CEACAM mutations and breast cancer risk, and potentially implicates the whole gene family in genetic risk. Precisely how these mutations contribute to breast cancer needs to be determined; especially considering our current knowledge on the role that the CEACAM gene family plays in tumor development, progression, and metastasis.

Introduction: Recent studies comparing canine mammary tumors (CMTs) and human breast cancers have revealed remarkable tumor similarities, identifying shared expression profiles and acquired mutations. CMTs can also provide a model of inherited breast cancer susceptibility in humans; thus, we investigated breed-specific whole genome sequencing (WGS) data in search for novel CMT risk factors that could subsequently explain inherited breast cancer risk in humans.
Methods: WGS was carried out on five CMT-affected Gold Retrievers from a large pedigree of 18 CMT-affected dogs. Protein truncating variants (PTVs) detected in all five samples (within human orthlogs) were validated and then genotyped in the 13 remaining CMT-affected Golden Retrievers. Allele frequencies were compared to canine controls. Subsequently, human blood-derived exomes from The Cancer Genome Atlas breast cancer cases were analyzed and allele frequencies were compared to Exome Variant Server ethnic-matched controls.
Results: Carcinoembryonic Antigen-related Cell Adhesion Molecule 24 (CEACAM24) c.247dupG;p.(Val83Glyfs * 48) was the only validated variant and had a frequency of 66.7% amongst the 18 Golden Retrievers with CMT. This was significant compared to the European Variation Archive (p-value 1.52 × 10 −8 ) and non-Golden Retriever American Kennel Club breeds (p-value 2.48 × 10 −5 ). With no direct ortholog of CEACAM24 in humans but high homology to all CEACAM gene family proteins, all human CEACAM genes were investigated for PTVs. A total of six and sixteen rare PTVs were identified in African and European American breast cancer cases, respectively. Single variant assessment revealed five PTVs associated with breast cancer risk. Genebased aggregation analyses revealed that rare PTVs in CEACAM6, CEACAM7, and CEACAM8 are associated with European American breast cancer risk, and rare PTVs in CEACAM7 are associated with breast cancer risk in African Americans. Ultimately, rare PTVs in the entire CEACAM gene family are associated with breast cancer risk in both European and African Americans with respective p-values of 1.75 × 10 −13 and 1.87 × 10 −04 .

INTRODUCTION
Breast cancer is a serious health concern. Amongst both sexes, it globally ranks as the second most commonly diagnosed type of cancer and the second leading cause of cancer-related deaths, accounting for ∼2.1 million new diagnoses and 626,679 deaths in 2018 (Bray et al., 2018). Worldwide, it is also the most common cancer diagnosed in women and the overall leading cause of cancer-related female deaths (Bray et al., 2018). In the United States, 2020 estimates predicted breast cancer to be the leading site of new cancer diagnoses in women and the second leading cause of cancer-related deaths, resulting in 276,480 new diagnoses and 42,170 deaths (American Cancer Society, 2020). Advances in breast cancer research have translated to better disease screening, diagnosis, and treatment, but new research questions continuously arise as time and medical needs progress (Cardoso et al., 2017).
Comparative oncology, which is the study of cancer biology and therapy in spontaneous, naturally-occurring cancers in companion animals, provides valuable models of human cancer that have and will continue to make research advances (Garden et al., 2018). Recent studies comparing canine mammary tumors (CMTs) and human breast cancers have revealed notable tumor similarities, identifying shared expression profiles and acquired mutations (Liu et al., 2014;Ettlin et al., 2017;Lee et al., 2018Lee et al., , 2019Kim et al., 2019;Gray et al., 2020). CMTs can also provide a model of hereditary breast cancer susceptibility in humans, especially considering similar genetics and familial clustering (Goebel and Merner, 2017;Gray et al., 2020). While most CMT studies investigating inherited risk have focused on identifying genetic variants in orthologs of known human breast cancer risk genes (Goebel and Merner, 2017;Huskey et al., 2020), in this study, we investigate breed-specific whole genome sequencing (WGS) data in search for novel CMT risk factors. WGS studies have been used to make numerous disease gene discoveries in dogs, many of which clearly translated to human health (Gilliam et al., 2014;Guo et al., 2014;Sayyab et al., 2016;Kolicheski et al., 2017;Fyfe et al., 2018;Meurs et al., 2019). Taking a similar approach, we identified a Carcinoembryonic Antigen-related Cell Adhesion Molecule 24 (CEACAM24) proteintruncating variant (PTV) in a Golden Retriever CMT pedigree, which ultimately revealed that rare PTVs in the CEACAM gene family are associated with breast cancer risk in humans. Aberrant expression of many CEACAM genes have previously been associated with tumorigenesis, and CEACAM gene products are recognized as clinically-relevant tumor markers (Kuespert et al., 2006;Beauchemin and Arabzadeh, 2013;Han et al., 2020). This is the first association to be reported between CEACAM gene mutations and inherited cancer risk.

Golden Retriever Pedigree and WGS
As previously described by Huskey et al. (2020), blood-or buccal-derived DNA samples were obtained from 18 CMTaffected Golden Retrievers from the Canine Health Information Center (CHIC) DNA repository, and a pedigree was constructed linking all 18 dogs in one large pedigree. Five of those Golden Retrievers (three females and two males) were selected for WGS. This number was influenced by the cost of WGS. Furthermore, aiming to identify breed-specific mutations, distantly related dogs were selected, including two males since male breast cancer is associated with hereditary disease (Huskey et al., 2020). The WGS data was processed through a bioinformatics pipeline (Huskey et al., 2020). Upon alignment to the CanFam3.1 reference genome and annotation using gene predictions from Ensembl build version 75, a script was written to isolate PTVs found in all five Golden Retriever samples. PTVs were defined as single nucleotide variants (SNVs) that resulted in a premature stop codon or abrogated a splice site, and small insertions or deletions (indels) that changed a transcript's reading frame. Upon filtering, the genes with PTVs were classified into two different groups, orthologs of human genes or non-orthologs. Polymerase chain reaction (PCR) and Sanger sequencing were carried out to validate the PTVs in human orthologs. CEACAM24 c.247dupG;p.(Val83Glyfs * 48) was the only validated variant. Following validation, the 13 remaining CMT-affected Golden Retrievers underwent PCR and Sanger sequencing to determine their mutation status.

Canine Controls
As a convenient, publically available, online canine genetic variant repository, the European Variation Archive 1 was initially used to note the allele frequency of CEACAM24 c.247dupG;p.(Val83Glyfs * 48). The European Variation Archive provides high quality WGS variant calls of over 200 dogs from multiple breeds (breed and sex information was unknown). The data was obtained through Ensembl by accessing the canine gene's "Variant table" under "Genetic Variation"; for a particular variant, "Population genetics" information was given, including European Variation Archive allele frequencies (Zerbino et al., 2018). Furthermore, additional splicing, frame-shifting, and stop gain mutations within the other dog CEACAM genes were investigated through Ensembl transcripts (CEACAM16: ENSCAFT00000044174; CEACAM18: ENSCAFT00000004587; CEACAM20: ENSCAFT0 0000047731; CEACAM24: ENSCAFT00000047960; CEACAM28: ENSCAFT00000022623). CEACAM1, CEACAM23, and CEACAM30 did not have variant information available in Ensembl for European Variation Archive data.

Canine Statistical Analyses
Upon determining CEACAM24 c.247dupG;p.(Val83Glyfs * 48) allele frequencies, p-values were generated using the Fisher's Exact Test in R (v 3.5.1), comparing allele differences in Golden Retriever to control dogs, including both European Variation Archive and CHIC DNA samples.

Dog and Human CEACAM Protein Analyses
EMBOSS water alignment (Madeira et al., 2019) was carried out to determine the level of homogeneity between the dog CEACAM24 protein and other dog and human CEACAM proteins. Additionally, InterPro (Hunter et al., 2009) and the Eukaryotic Linear Motif (ELM) resource (Kumar et al., 2020) were used to identify CEACAM domains and binding motifs, respectively.

Human CEACAM Gene Analysis -The Cancer Genome Atlas
Due to the homogeneity of the CEACAM gene family and no direct ortholog of dog CEACAM24 in humans, all human CEACAM family genes were investigated for rare PTVs in The Cancer Genome Atlas (TCGA) breast cancer cohort. Investigating inherited risk, only blood-derived exomes of breast cancer cases were analyzed. Overall, whole-exome binary sequence alignment mapping (BAM) files were downloaded using the Genomic Data Commons (GDC) Data Portal Repository through approved research project #10805. To acquire the samples, the specific filters under the "Cases" category included: Project (TCGA-BRCA), Samples Sample Type (Blood Derived Normal), and Race ("Black or African American" and "White"). The samples were further filtered under the "Files" category, including Experimental Strategy (WXS) and Data Format (BAM). A total of 170 sample files were obtained for African Americans and 650 for European Americans. These files were downloaded using the GDC Data Transfer Tool (version 1.2.0). Only individuals with known ages of breast cancer onset were used in this study; as a result, one European American and two African American BAM  The bold values represent significant p-values, p-values less than 0.05.
Frontiers in Genetics | www.frontiersin.org files were removed from further bioinformatics processing and statistical analysis. The downloaded BAM files, which had previously been aligned to the hg38 human reference genome, were processed using the remaining steps of a pipeline adapted from the Genome Analysis Toolkit's (GATK's) best practices pipeline (Van der Auwera et al., 2013). Base quality scores were recalibrated using BaseRecalibrator and then HaplotypeCaller was used to generate genome variant calling format (gVCF) files (GATK version 3.6). GenotypeGVCFs was used to merge the individual gVCF files based on ethnicity (GATK version 3.6). The European American files were merged in batches of approximately 200 using GATK's (version 3.6) CombineGVCFs prior to merging into a single VCF file with GenotypeGVCFs. The two ethnic specific VCF files were then processed through a variant quality score recalibration using VariantRecalibrator (GATK version 3.6), and, as recommended, SNVs were filtered using a pass filter of 99.5%, and indels were filtered using a slightly lower pass filter of 99.0% (Van der Auwera et al., 2013

Human Statistical Analyses
Using the Fisher's exact test (Sprent, 2011) in R (v 3.5.1), individual PTVs were assessed to compare allele frequency differences between ethnic-specific TCGA breast cancer cases and EVS controls. The Fisher's method was used for gene-based and gene family-based aggregation analyses (Fisher, 1925;Sutton et al., 2000). The R tool "sumlog" (in the "metap" package) was used to combine p-values for each aggregation test. To accommodate for the one-sided nature of the Fisher exact test p-values, compliments of p-values in the opposite direction were used in the calculations for the Fisher's method aggregation analyses.

Human Mutation Analysis
Mutalyzer was used to determine the effect of frame-shifting and non-sense variants on the coded protein (Wildeman et al., 2008). Human splicing mutations that affected non-proteincoding exons of the mRNA, specifically in the 3 untranslated region (UTR), were analyzed using the miRDB tool to identify microRNA binding sites potentially lost due to a splicing mutation (Chen and Wang, 2020). For each gene harboring a splice mutation affecting non-protein-coding exons, microRNA binding sites within the 3 UTR with a target score of ≥80 were

RESULTS
Upon filtering the WGS data, 12 different PTVs were detected in all five Golden Retrievers, four of which were within human orthologs. Only one PTV, a frame-shifting mutation in CEACAM24 (c.247dupG;p.(Val83Glyfs * 48)) was determined to be a true positive upon validation (Figure 1). This mutation had a frequency of 66.7% amongst the 18 Golden Retrievers with CMT in this study (    (Figures 1C,D). Homology analysis revealed that the dog CEACAM proteins were, on average, 43.7% similar to the dog CEACAM24 protein ( Table 2 and Figure 2A). Similarly, there were many related functional domains and high homology between the dog CEACAM24 protein and the human CEACAM proteins, averaging 51.9% similarity (Table 2 and Figure 2). This homology, along with the fact that there is no direct human ortholog of dog CEACAM24, prompted all human CEACAM genes ( Figure 2B) to be investigated for rare PTVs in the TCGA breast cancer cohort.
A total of six rare PTVs were identified in African Americans and sixteen in European Americans breast cancer cases (Supplementary Tables 1, 2). Single variant assessment revealed five variants associated with breast cancer risk, three of which were associated each with European and African American breast cancer (Table 3 and Figures 3, 4). One variant, CEACAM7 c.195C > A;p.(Y65X), was associated with breast cancer risk in both ethnicities (Table 3 and Figure 3). Two stop gain mutations in CEACAM4 were associated with African American breast cancer (Table 3 and Figure 3), and two splicing mutations were associated with European American breast cancer, one in CEACAM6 and another within CEACAM8 (Table 3 and Figure 4). Both of those splicing mutations affect non-protein-coding exons in the 3 UTR, which, instead of truncating the protein, potentially disrupt key microRNA binding sites previously associated with cancer (Table 4 and Figure 4). Overall, gene-based aggregation analyses revealed that rare PTVs in CEACAM6, CEACAM7, and CEACAM8 are associated with European American breast cancer risk, and rare PTVs in CEACAM7 are associated with breast cancer risk in African Americans (Table 5). Ultimately, rare PTVs in the entire CEACAM gene family are associated with breast cancer risk in both European and African Americans with respective p-values of 1.75 × 10 −13 and 1.87 × 10 −04 (Table 5).

DISCUSSION
Utilizing a comparative oncology approach, our team identified CEACAM24 c.247dupG;p.(Val83Glyfs * 48) in Golden Retrievers with CMT and subsequently determined that rare PTVs in the entire CEACAM gene family were associated with inherited breast cancer risk in humans. We previously described a large Golden Retriever pedigree with segregating CMT, carried out WGS on five selected Golden Retriever cases, and highlighted variants in orthologs of human breast cancer susceptibility genes (Huskey et al., 2020). In this current study, we used  the same WGS dataset to identify novel variants that could be influencing Golden Retriever CMT susceptibility. We isolated PTVs found in all five sequenced Golden Retriever samples, and, upon validation, determined the mutation status in the 13 remaining CMT-affected Golden Retrievers within the pedigree. CEACAM24 c.247dupG;p.(Val83Glyfs * 48) was the only validated variant and had an allele frequency of 66.7% amongst the 18 CMT-affected dogs. Despite not being recognized as a breed highly affected by CMT, Golden Retrievers have a higher prevalence of cancer compared to many dog breeds with 65% of Golden Retrievers in the United States succumbing to the disease (Dobson, 2013;Salas et al., 2015;Kent et al., 2018). The Golden Retriever CEACAM24 c.247dupG;p.(Val83Glyfs * 48) allele frequency and cancer mortality rate are very similar. The CMT-affected Golden Retrievers within this study can all be linked back to a sire in the United States from the 1950s, which was shortly after the registration of the breed with the American Kennel Club. Since importation to and registration in the United States, Golden Retrievers in Europe and the United States are considered two distinct populations, as breeding between the two continents is rare and unique gene pools have been established due to strict breeding standards and the popular-sire effect (Brackman, 2020). Cancer mortality in European-bred Golden Retrievers has been reported to be 38.8%, which is much lower than Golden Retrievers in the United States (65%) (Dobson, 2013;Kent et al., 2018). These differences could be explained by distinct genetic risk factors. The allele frequency of CEACAM24 c.247dupG;p.(Val83Glyfs * 48) in the European Variant Archive was 17.3%, which corresponded to a p-value of 1.52 × 10 −8 when compared to our CMTaffected Golden Retrievers from the United States. However, in addition to not knowing breed-specific information in the European Variant Archive, genetic bottlenecks upon importation to the United States need to be acknowledged. Thus, comparing allele frequencies to a United States dog population with known breed status was important, which can be determined through American Kennel Club registration. Overall, CEACAM24 c.247dupG;p.(Val83Glyfs * 48) appears to be common in Golden Retrievers in the United States with an allele frequency of 67.8%, which is not significantly different from the CMT-affected Golden Retriever cases. However, that allele frequency was determined by screening 87 Golden Retrievers from the CHIC repository with unknown disease diagnoses and age at sample submission. This is not ideal for canine cancer studies; older dogs (> than 8 years of age) with unaffected CMT-status are recommended (Tonomura et al., 2015;Hayward et al., 2016). In saying that, if CEACAM24 c.247dupG;p.(Val83Glyfs * 48) truly is a high-frequency allele in Golden Retrievers due to a genetic bottleneck in the United States, it can explain why 65% of Golden Retrievers succumb to cancer (Kent et al., 2018).
Regarding the assessment of other American Kennel Club breeds, an overall CEACAM24 c.247dupG;p.(Val83Glyfs * 48) allele frequency of 22.4% was revealed, which was significantly different from CMT-affected Golden Retriever cases. Noting the small sample sizes of each breed, over half of the assessed breeds showed no presence of the variant. However, some breeds contained the variant at higher levels; most notably, Petit Basset Griffon Vendeen, Gordon Setter, Australian Cattle Dog, Siberian Husky, and Dalmatian. Petit Basset Griffon Vendeen, which had the highest allele frequency, has a cancer mortality rate of 33% (Dobson, 2013). In a United Kingdom study, Dalmatians, Gordon Setters, and Siberian Huskies were found to have cancer mortality rates ranging from 19.1 to 31.8% (Dobson, 2013), and Australian Cattle Dogs have a rate of 27% (Petmed, 2014).
CEACAM24 is a part of the dog CEACAM gene family (Figure 2A), which is a subdivision of the immunoglobulin superfamily of cell adhesion molecules (IgCAMs) (Smith and Xue, 1997;Kuespert et al., 2006). All IgCAMs, and hence all CEACAM proteins, are characterized by having at least one immunoglobulin (Ig)-like domain (Figure 2). CEACAM genes have diverse functions in both dogs and humans, including cell-cell adhesion, cell signaling, immunity/inflammation, angiogenesis, and tumor development, progression and metastasis (Kuespert et al., 2006;Kammerer et al., 2007;Kammerer and Zimmermann, 2010;Beauchemin and Arabzadeh, 2013;Han et al., 2020). CEACAM24 c.247dupG;p.(Val83Glyfs * 48) abolishes the extracellular region, the transmembrane domain, and part of the cytoplasmic region, including the Ig V-set domain; thus, it is presumed to be a loss-of-function mutation. According to Ensembl, no other stop gain or frame-shifting variants have been identified in dog CEACAM genes. However, one splicing mutation in CEACAM28 (c.1415-2A > G) was identified, which had a 34% allele frequency within the European Variation Archive. The CEACAM gene family is present in many mammalian species but has evolved in a highly species-specific manner, heavily influenced by pathogen/host coevolution (Kammerer et al., 2007;Kammerer and Zimmermann, 2010;Weichselbaumer et al., 2011). Despite phylogenetic discordance of dog and human CEACAM genes (Weichselbaumer et al., 2011), our analyses revealed there is high homology between the dog CEACAM24 protein and the human CEACAM proteins, averaging 51.9% similarity. This homology, along with the fact that there is no direct human ortholog of the CEACAM24 gene, prompted all human CEACAM genes to be investigated for rare PTVs in the TCGA breast cancer cohort.
There are 12 human CEACAM genes, all of which cluster on chromosome 19q13. 2-19q13.4. Over the years, genetic markers in that region have been associated with many different types of cancer susceptibility, including breast cancer Yin et al., 2002;Nexo et al., 2003Nexo et al., , 2008Vogel et al., 2004;Amin Al Olama et al., 2013;Gao et al., 2018).  Nonetheless, inherited mutations in CEACAM genes have yet to be associated with inherited risk of cancer (Zheng et al., 2011;Kammerer et al., 2012;Wang et al., 2015). Aberrant expression of many CEACAM genes have been associated with tumorigenesis, and CEACAM gene products are recognized as clinicallyrelevant tumor markers (Kuespert et al., 2006;Beauchemin and Arabzadeh, 2013;Han et al., 2020). Regarding breast cancer, CEACAM1 has been shown to be down-regulated compared to normal breast tissue (Yang et al., 2015), similar to its expression in prostate (Busch et al., 2002;Liu J. et al., 2020), endometrial (Bamberger et al., 1998), gastric (Takeuchi et al., 2019) and colon cancer (Fournes et al., 2001;Song et al., 2011), identifying it as a tumor suppressor. It has also been demonstrated that CEACAM5 (Iqbal et al., 2017;Powell et al., 2018), CEACAM6 (Maraqa et al., 2008;Tsang et al., 2013;Iqbal et al., 2017;Rizeq et al., 2018), and CEACAM19 (Michaelidou et al., 2013;Estiar et al., 2017) are overexpressed in breast cancer and are associated with enhanced tumor invasiveness and metastasis. Conversely, CEACAM6 and CEACAM8 co-expression inhibits proliferation and invasiveness of breast cancer cells (Iwabuchi et al., 2019). Additionally, CEACAM gene splice variants have been suggested to play a role in breast cancer tumorigenesis (Gaur et al., 2008;Zisi et al., 2020). Lastly, through exome sequencing, Li et al. observed loss of heterozygosity of CEACAM1, CEACAM3, CEACAM5, CEACAM6, CEACAM7, and CEACAM8 in breast cancer tumors that were associated with metastasis, suggesting that this closely-linked gene family regulates tumorigenesis and metastasis synergistically (Li et al., 2014). Corroborating those preliminary findings, we have now determined that rare inherited PTVs in the entire CEACAM gene family are associated with The bold values represent significant p-values, p-values less than 0.05.
breast cancer risk in both European and African Americans with respective p-values of 1.75 × 10 −13 and 1.87 × 10 −04 . The p-value generated for African American breast cancer risk was likely influenced by the small sample size in TCGA.
We analyzed blood-derived exomes of European and African American breast cancer cases in TCGA to identify inherited PTVs in all human CEACAM genes, and detected sixteen and six rare PTVs in each ethnicity, respectively. Gene-based analyses determined that rare PTVs in CEACAM6, CEACAM7, and CEACAM8 are associated with European American breast cancer risk, and rare PTVs in CEACAM7 are associated with breast cancer risk in African Americans. CEACAM7, which was associated with breast cancer risk in both ethnicities, has no current link to breast cancer. However, down-regulation of CEACAM7 in hyperplastic polyps and early adenomas represent some of the earliest observable molecular events leading to colorectal tumors (Scholzel et al., 2000). Though CEACAM7 expression was thought to be restricted to the epithelial cells of the colon and pancreas, according to the Human Protein Atlas, grandular cells of the breast have moderate CEACAM7 protein expression (Uhlen et al., 2015;Raj et al., 2021). How CEACAM7 plays a role in breast cancer is currently unknown, but the link could even be indirect and due to expression in non-breast tissue (Ferreira et al., 2019). CEACAM7 c.195C > A;p.(Y65X), which was detected in 10.8 and 4.5% of European and African American cases, respectively, was absent in all EVS controls. It severely truncates the 265 amino acid proteins and results in a loss of the cytoplasmic region, as well as a large portion of the extracellular region, including disruption of the Ig-like and Ig V-set domains. It is likely a loss-of-function mutation (Figure 3).
Rare PTVs in CEACAM6 and CEACAM8 appear to only be associated with European American breast cancer risk. Considering that CEACAM6/8 co-expression inhibits proliferation and invasiveness of breast cancer cells (Iwabuchi et al., 2019), having a rare PTV in one of those two genes may be sufficient to override their synergistic tumor-suppressing relationship. While a number of PTVs were detected in these genes, two splicing mutations, CEACAM6 c. * 40 + 2T > G and CEACAM8 c. * 40 + 2T > G, were individually determined to be associated with European American breast cancer, both of which affect non-coding exons in the 3 UTR. Both mutations affect the donor site immediately following exon 5 of their respective genes, which contains both coding and non-coding DNA. The mutated donor sites likely affect the downstream sequence of the mature mRNA product, either retaining (all or a part of) intron 5 or removing exon 6, the last non-coding exon, where many microRNA binding sites are located (Figure 4). Based on miRDB rankings, the top five microRNAs that bind to the 3 UTRs of CEACAM6 and CEACAM8 have previous links to cancer (Table 4); thus, disrupted microRNA binding likely leads to aberrant CEACAM6 and CEACAM8 expression.
Two stop gain mutations in CEACAM4 (c.367C > T;p.R123X and c.424C > T;p.Q142X) were associated with African American breast cancer. These mutations were not detected in European American cases or controls, and are very rare in the general African American population. They were detected in significantly more African American breast cancer cases compared to ethnic-matched controls, suggesting their involvement in African American breast cancer risk. However, gene-based aggregation analyses did not support CEACAM4 as a breast cancer risk gene. Larger African American breast cancer cohorts will need to be studied to validate these findings. Interestingly, in a study of parous women with and without breast cancer, CEACAM4 has been reported to be up-regulated in normal breast compared to breast tumor samples (Balogh et al., 2007). Though race/ethnicity was not revealed in that study, the results suggest that CEACAM4 could be a breast cancer tumor suppressor.
It has long been reported that minimal genetic changes can have radical effects on the function of CEACAM genes (Naghibalhossaini and Stanners, 2004). Residues in CEACAM6 and CEACAM8 have been identified that are critical for CEACAM6 homodimerization as well as the formation of CEACAM6 and CEACAM8 heterodimers, which is important in preventing breast cancer cell proliferation (Kuroki et al., 2001;Iwabuchi et al., 2019). There have also been residues reported in CEACAM1 that are crucial for determining the risk of infection by receptor-binding pathogens (Villullas et al., 2007) and preventing the killing activity of NK cells (Markel et al., 2004). Furthermore, somatic missense mutations in colorectal cancers have been detected in CEACAM1 (Song et al., 2011) and CEACAM5 (Gu et al., 2020), the latter of which has been shown to increase proliferation by inhibiting TGFB signaling and altering the intestinal microbiome. The microbiome has been reported as a new breast cancer risk factor (Fernandez et al., 2018;Eslami et al., 2020). In fact, differences have been reported in the microbiome of normal and cancerous breast tissue, as well as the gut microbiota of breast cancer cases versus controls (Fernandez et al., 2018). Disrupted CEACAM genes could be the underlying mechanism through altered TGFB signaling, bacteria docking, and/or estrogen metabolism (Villullas et al., 2007;Tchoupa et al., 2014;Fernandez et al., 2018;Gu et al., 2020). This study reports the first association of inherited CEACAM mutations and breast cancer risk, and potentially implicates the whole gene family in genetic risk. Precisely how these mutations contribute to breast cancer needs to be determined, especially considering our current knowledge on the role that the CEACAM gene family plays in tumor development, progression, and metastasis.

DATA AVAILABILITY STATEMENT
The WGS data for the five whole genome sequenced CMTaffected Golden Retriever dogs can be obtained through the NCBI SRA repository through BioProject PRJNA745215. TCGA data is available through dbGAP.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by Auburn University Institutional Review Board (IRB) for the Protection of Human Subjects in Research. The patients/participants provided their written informed consent to participate in this study. Ethical review and approval was not required for the animal study because this research did not require ORC -Animal Care & Use (IACUC) approval since only dog DNA was studied upon receipt from the CHIC repository.

AUTHOR CONTRIBUTIONS
AH and NM wrote the manuscript and performed variant and statistical analyses. AH and IM performed PCR for validation and determining mutational frequency. AH performed bioinformatic processing. All authors read and approved the final manuscript.

ACKNOWLEDGMENTS
We would like to acknowledge the Orthopedic Foundation for Animals' CHIC DNA Repository, which provided CMTaffected dog DNA samples. We would also like to thank the Office of Information Technology at Auburn University Hopper High-Performance Computing Cluster for compute time and technical support.