Evolution of Protein-Mediated Biomineralization in Scleractinian Corals

While recent strides have been made in understanding the biological process by which stony corals calcify, much remains to be revealed, including the ubiquity across taxa of specific biomolecules involved. Several proteins associated with this process have been identified through proteomic profiling of the skeletal organic matrix (SOM) extracted from three scleractinian species. However, the evolutionary history of this putative “biomineralization toolkit,” including the appearance of these proteins’ throughout metazoan evolution, remains to be resolved. Here we used a phylogenetic approach to examine the evolution of the known scleractinians’ SOM proteins across the Metazoa. Our analysis reveals an evolutionary process dominated by the co-option of genes that originated before the cnidarian diversification. Each one of the three species appears to express a unique set of the more ancient genes, representing the independent co-option of SOM proteins, as well as a substantial proportion of proteins that evolved independently. In addition, in some instances, the different species expressed multiple orthologous proteins sharing the same evolutionary history. Furthermore, the non-random clustering of multiple SOM proteins within scleractinian-specific branches suggests the conservation of protein function between distinct species for what we posit is part of the scleractinian “core biomineralization toolkit.” This “core set” contains proteins that are likely fundamental to the scleractinian biomineralization mechanism. From this analysis, we infer that the scleractinians’ ability to calcify was achieved primarily through multiple lineage-specific protein expansions, which resulted in a new functional role that was not present in the parent gene.


INTRODUCTION
Scleractinian corals (commonly known as stony or hard corals) are foundation species in the tropical marine ecosystem (Moberg and Folke, 1999). One of their most important roles is reef formation through their ability to create a rigid aragonite exoskeleton by the process of biomineralization. These exoskeletons are valuable as they provide the ecological framework that supports high rates of primary production and permits extensive biological diversity in coral reef ecosystems (Veron et al., 2009), as well as serving as a large reservoir of biogenic calcium carbonate in the ocean (Cohen and McConnaughey, 2003). Scleractinians are among the oldest biomineralizing Metazoa, likely appearing in the late Ordovician (∼445 Million years ago), and becoming highly diverse (Stolarski et al., 2011;Drake et al., 2020). They are represented by different morphologies and spatial distribution (Veron, 2000), and show distinct molecular evolution, with the order split into two major clades, known as the Complexa (complex corals) and Robusta (robust corals) (Romano and Palumbi, 1996;Kitahara et al., 2010;Ying et al., 2018), named for the extent of skeletal calcification-specific patterns of corallite wall construction. While the process of biomineralization in scleractinians has long been studied (reviewed by Drake et al., 2020), its biomolecular mechanisms have only recently begun to be revealed (reviewed by Murdock, 2020), coinciding with advances in genomics and protein identification.
The "biomineralization toolkit" is the collective term for the many specific lipids, polysaccharides, and proteins both documented and hypothesized to be involved in the formation of the biomineral at various stages of an organism's life history, some of which may become embedded in its skeleton (Livingston et al., 2006). Those organic molecules which are retained in the skeleton (the skeletal organic matrix, SOM) directly mediate and regulate the process by which many organisms from across all kingdoms of life form biominerals (Lowenstam and Weiner, 1989;Mann, 2001;Knoll, 2003), with the resulting biominerals exhibiting characteristics different from their abiogenic counterparts (Weiner, 2003;Gal et al., 2015). Out of all the SOM molecules, the most intensively studied are the proteins (Evans, 2019;Clark, 2020;Erwin, 2020;Murdock, 2020). Proteomic studies have shown that different lineages use sets of proteins with similar functional categories, including matrix formers, nucleation assisters, signalers, and remodelers to form their skeletal structure (Marin et al., 2016;Evans, 2019). Although the SOM proteins from distant organisms share common properties (Evans, 2019), each taxon-specific suite appears to have evolved independently through convergent and co-option evolution, resulting in varying contributions by lineage-and species-specific novel proteins, which exhibit contrasting rates of conservation between and within lineages (Drake et al., 2014).
In scleractinians, numerous SOM-related characteristics have been studied (Tambutté et al., 2011), yet only a few proteomic profiling experiments have been conducted, and then solely for tropical species. Extensive proteomic studies using speciesspecific genomes and transcriptomes include those of Stylophora pistillata Peled et al., 2020), Acropora millepora (Ramos-Silva et al., 2013), and Acropora digitifera , which, when combined, revealed over 100 SOM protein, hence members of the "biomineralization toolkit." Similar to previous examinations of various metazoan lineages, scleractinian SOM proteins appear to share functional roles in carbohydrate-binding and catalytic activities (Ramos-Silva and Marin, 2015). Notably, the most extensively studied SOM proteins in scleractinians are the aspartic acid-rich proteins which assist in mineral nucleation and modification (Lowenstam and Weiner, 1989;Marin and Luquet, 2008;Mass et al., 2013;Gavriel et al., 2018;Laipnik et al., 2019), and α-carbonic anhydrases that play a role in both carbon supply and concentration (Bertucci et al., 2013;Zoccola et al., 2016). However, many scleractinian SOM proteins do not contain known functional domains and remain to be functionally characterized. Furthermore, out of all the known scleractinian SOM proteins, only a few were found to be shared between the three species Peled et al., 2020). The identification and characterization of the suite of scleractinian SOM proteins to date has led to the hypothesis that the proteins underlying scleractinian skeleton formation developed through stepwise evolution, supplementing proteins that are conserved across Metazoa with scleractinianspecific and species-specific novel proteins (Ramos-Silva et al., 2013;Takeuchi et al., 2016).
While useful for initial studies, most of the analyses that examined SOM protein diversity across taxa were carried out using heuristic methods of sequence similarity scores (e.g., BLAST), which estimates the phylogenetic relationships between a set of genes by the premise that higher-scoring sequence pairs are likely to have diverged more recently compared to their lower-scoring counterparts (Fitch, 1970;Lafond et al., 2018;Emms and Kelly, 2019). As a preliminary examination, sequence similarity can aid in determining homologous gene groups and are useful for function-related applications (Doyle et al., 2010;Paps and Holland, 2018;Richter et al., 2018); yet, the lack of a phylogenetic analysis based on a species tree limits our understanding of the proteins' origin and evolutionary dynamics, as sequence duplication can result in a high sequence differentiation and subsequently leads to overlooking orthologous sequences (Lafond et al., 2018). To date, Bhattacharya et al. (2016) have published the most intensive phylogenetic study of the previously known scleractinian biomineralization proteins. The authors provided the basis for understanding scleractinian genomic evolutionary information, revealing mechanisms for scleractinians to adapt to changing environments while maintaining the ability to calcify. Recent advances in genome and transcriptome sequencing and the production of more gene databases are increasing our ability to provide a higher resolution comparison of SOM proteins and therefore, a better understanding of their evolutionary dynamics. This will allow extrapolation of the occurrence of the scleractinian "core biomineralization toolkit, " that is, the biomineralization-related proteins that are shared across scleractinian species and, as such, have a fundamental role in the skeleton formation process across the order. While at first glance, the most straightforward method may be to sequence more scleractinian skeletal proteomes, in practice, direct proteomic analyses are often incomplete and time-consuming (Marin et al., 2016;Aguilera et al., 2017;Peled et al., 2020); predictions based on transcriptomic and genomic data, therefore, become essential. However, on their own, such predictions can result in redundancies and overestimations, while at the same time overlooking potential gene candidates due to unresolved and incomplete genomes and transcriptomes (Eisenhaber, 2013;Sinha et al., 2018). Therefore, it is essential to combine both approaches, based on both proteins and DNA/RNA sequencing, to generate a more holistic picture of SOM protein evolution.
Here, we used a phylogenetic approach of the known scleractinian SOM proteins to study their evolution across the metazoan tree of life. As we have used the orthology/paralogy relationships for each protein in one species at a time, our results are independent for each lineage, providing a robust evaluation of their evolution. Our results reveal part of the "core biomineralization toolkit" across scleractinians, comprised of multiple proteins sharing an evolutionary history across distinct species. Since orthologous genes are more likely to share a biological function (Fang et al., 2010;Gabaldón and Koonin, 2013;Altenhoff et al., 2019), our approach might allow us to extrapolate the occurrence of proteins that play a fundamental role in the skeleton formation across scleractinians. However, the major fraction of each species SOM proteins were found to be independently co-opted into their own "biomineralization toolkit" from genes that evolved before the emergence of scleractinians. These were coupled with scleractinian-specific gene family expansions resulting in each scleractinian lineage and species having a unique set of SOM proteins.

Scleractinian SOM Protein Orthogroup and Gene Tree Reconciliation
Our phylogenetic analysis was based on 43 annotated genomes spanning the metazoan kingdom (Supplementary Table 1), with the addition of seven Fungi species and two choanoflagellate species as the outgroups. The outgroups were selected in order to consider the Opisthokonta evolution. More specifically, the Fungi kingdom was included as scleractinians were found to share a complete histidine biosynthesis pathway, which is unique across the Metazoa (Ying et al., 2018), indicating that the consideration of fungi outgroup may be critical. Furthermore, the Choanoflagellata class was also included as it is considered to be the sister group of the Metazoa (King et al., 2008;Schalchian-Tabrizi et al., 2008). The complete dataset includes rigid skeleton/shell forming and non-forming taxa, including representation of major marine calcifying phyla (Mollusca, Echinodermata, Arthropoda, and Cnidaria) as we sought to group the known scleractinian SOM proteins Ramos-Silva et al., 2013;Takeuchi et al., 2016;Peled et al., 2020) into their respective orthogroups. An orthogroup is defined as a set of genes descended from a single gene in the last common ancestor of all the species being considered. We limited our database of known scleractinian SOM proteins to studies for which the skeletal proteomes were sequenced against annotated genomes from the same species, limiting us to representatives of scleractinian SOM proteins from A. digitifera, A. millepora, and S. pistillata. We decided to use a larger proportion of species from the cnidarian phylum and more particularly, within the scleractinian order, as they are the focus of this study. The annotated genomes in the analysis are comprised of datasets with a median BUSCO score of 90.05% (Seppey et al., 2019;Supplementary Table 1). To infer the scleractinian SOM orthogroups, we clustered all the protein-coding sequences from our entire database using OrthoFinder 2.2.7 Kelly, 2015, 2019), to give a total of ∼16,000 orthogroups. After identifying all orthogroups from each sequence in our known SOM protein dataset, we aligned the sequences in each orthogroup separately using MAFFT (Katoh and Standley, 2013), followed by the removal of sequences and regions based on inconsistencies in the consensus alignment; sequences with both fewer than thirty aligned amino acids and less than 50% of the sequence aligned columns with <2 aligned sequences were removed. Gene trees were constructed in IQ-TREE (Nguyen et al., 2015) using the best-fitted model (LG) and discrete Gamma distribution of four rates across site categories (Resulting trees can be found in Supplementary File 1). To infer pairwise orthology relationships and to reconstruct the sequences' evolutionary histories, the gene trees were further rooted and reconciled via the Orthofinder2 pipeline, using the rooted species tree with the topology presented in Figure 1A, which is based on the current knowledge of animal phylogeny (Laumer et al., 2019;Fernández and Gabaldón, 2020). Then we selected orthogroups that include known SOM proteins of A. digitifera, A. millepora, and S. pistillata. Lastly, the known scleractinian SOM protein orthogroups were used for downstream analyses.
As data acquired from de novo transcriptomes are highly fragmented and can lead to misinterpretation of downstream analysis (Emms and Kelly, 2015), we did not include any transcriptomic data sets as part of the Orthofinder2 pipeline. However, the ability to produce a calcium carbonate rigid skeleton is not unique to the scleractinian order within Cnidaria, with several taxa, including hydrozoans and octocorals, demonstrating this ability. Therefore, to further identify common molecular traits between extant taxa, we searched for the known scleractinian SOM proteins putative orthologs in the transcriptome of a representative species, the blue octocoral Heliopora coerulea (Guzman et al., 2018) using Conditional Reciprocal Best BLAST 0.6.6 (CRB-BLAST) (Aubry et al., 2014), that performs complementary BLAST alignments between query and target sequences. Sequences with more than a single hit across query and target of the same scleractinian species, were removed.

Scleractinian SOM Protein Orthogroup Gain and Duplication Patterns
We based our analysis on Dollo's parsimony (Farris, 1977) and phylostratigraphic profiling (Domazet-Lošo et al., 2007) to infer the likely phylogenetic origin of each known scleractinian SOM gene family. Dollo's parsimony, implemented in COUNT 9.1106 (Csûös, 2010), is modeled on Dollo's law (Dollo, 1893) that argues the statistical improbability of an organism to transition into a different state. It leads to a substantial simplification of evolutionary scenarios as it assumes that genes, which have been lost during evolution in a particular lineage, are unlikely to be regained. This heuristic approach enables us to map the scleractinian SOM proteins to the species tree based on the most phylogenetically distant lineage present in their representative orthogroup and to determine if SOM protein gains are the product of a lineage-specific evolution rather than the co-option of pre-existing genes into a skeleton formation role. If an SOM protein was reported as not having an orthogroup, we assigned it as a species-specific protein.
FIGURE 1 | Evolutionary origin of SOM proteins (SOMP) and duplication rates across animal taxa. (A) In the species tree, each internal node represents an ancestral lineage, where tips represent extant species. Accordingly, blue-filled bars near specific internal nodes represent the percentage of known SOM proteins that were gained in a specific ancestral lineage, based on our results. Similarly, the gray-filled bar represents the percentage of known SOM proteins that evolved from a protein whose last duplication occurred at specific ancestral species. Bottom-right bars represent gain at extant species. The mean (x) and standard deviation (s.d.) values are indicated in each bar graph. Black shapes symbolize the different mineral phases of the species that form a rigid skeleton or shell. (B) The heatmap describes the occurrences of each known SOM protein from Stylophora pistillata, Acropora digitifera, and A. millepora across the different orthogroups and their coding sequence homologs among all species in this study. For a comprehensive list of the SOM proteins found in each orthogroup, see Supplementary Table 2. Gene duplication rates were calculated using the phylogenetic birth-and-death model implemented in COUNT 9.1106 (Csûös, 2010). Specifically, the rate model was calculated and optimized under the gain-loss-duplication model with the Poisson distribution at the root. The variation rate across families was set to 4:1:1:4 gamma categories for the edge length, the loss rate, gain rate, and the duplication rate, respectively. The convergence criteria applied were set to 100 rounds for the optimization FIGURE 2 | Rooted gene tree of polycystin-like proteins (OG0006984) representing the independent co-option of orthologous proteins into the SOM of the Acroporidae species. Node points represent the phylum (shape) and order (color). Tips that are labeled red indicate the occurrence of scleractinian known SOM proteins. The highlighted cluster represents a "Scleractinia branch." Node support values indicate percentage bootstrap values. Only values above 75 are indicated. The list represents only "Scleractinia branches" with at least two scleractinian known SOM proteins and their density in the branch. Asterisks symbolize "Scleractinia branches" that contain SOM proteins from species of both the complex and robust scleractinian clades.
rounds with a likelihood threshold of 0.1. This model is based on the assumption that the primary mechanism of gene gain in eukaryotes is genomic duplications, while the possibility of gene gain through horizontal gene transfer (HGT) between different FIGURE 3 | Rooted gene tree of scleractinian coral aspartic acid-rich proteins (OG0001947) representing the scleractinian-specific expansion of the gene family involving species from both the complex and robust scleractinian clades. Node points represent the phylum (shape) and order (color). Tips that are labeled red indicate the occurrence of scleractinian known SOM proteins. The two highlighted clades represent orthologous relationships between scleractinian sequences to the known SOM protein(s) in their respective "Scleractinia branch." Node support values indicate percentage bootstrap values. Only values above 75 are indicated.
nodes of the gene tree is less likely (Csûrös and Miklós, 2006). However, as our methodological approach centered around using gene phylogenies to infer pairwise orthology and paralogy relationships for all genes in the analysis, it is robust to the effect of HGT, genome completeness and variable genome size (Huerta-Cepas et al., 2014;Kelly, 2015, 2019).

SOM-Enriched Branches and Permutation Test Method
In each rooted gene tree (see above), we first detected "Scleractinia branches, " namely: groups of proteins that evolved from a single protein that existed in the last common ancestor of complex/robust scleractinians. For each scleractinian branch, we calculated the following values: (1) the total number of proteins in that branch, PN; (2) the number of SOM proteins known from previous studies Ramos-Silva et al., 2013;Takeuchi et al., 2016;Peled et al., 2020) in that branch, pn; and (3) SOM protein density, pn/PN, if at least one known SOM protein was observed, otherwise density = 0. Our analysis tested whether known scleractinian SOM proteins evolved independently as opposed to having evolved from an SOM-related protein that existed in the most recent common ancestor of scleractinia. We compared our observed "Scleractinia branches" SOM protein density to an expected density obtained by randomly selecting (n1) S. pistillata, (n2) A. digitifera, and (n3) A. millepora proteins (where n1, 2, 3 are equal to the counts of respective observed SOM proteins). The comparison was conducted using a permutation test (n = 1,000 sums), where a P-value was defined as the proportion of cases where the observed sum of density is smaller than the expected sum. We also repeated this test where density = 1 is assigned if species from both the complex and robust scleractinian clade are found in a cluster, and otherwise density = 0.

Scleractinian SOM Protein Evolutionary History
Using OrthoFinder2, 123 known scleractinian SOM proteins were clustered in 72 different orthogroups (i.e., all genes descended from a single gene belonging to the last common ancestor of the tested species). The identification of known scleractinian SOM proteins was based on previously published scleractinian SOM proteomes ( Figure 1B and Supplementary Table 2). The majority of the orthogroups, 54 out of 72, were found to include known SOM proteins from a single species, while twelve orthogroups contain two species and seven orthogroups are represented by all three species (Figure 1B and Supplementary Table 1). In fourteen orthogroups, we have identified the occurrence of scleractinian known SOM protein paralogs (that is, proteins which separated by a duplication event), from one or more species (Figure 1B and Supplementary Table 2). Although most orthogroups included known SOM proteins from a single species, we identified orthologs of those proteins across most scleractinian species with a mean of 67.5 (±1.78) orthogroups per scleractinian species (Figure 1B). Gene gain can involve the co-option of pre-existing molecular traits to serve a new functional role as well as the evolution of lineage-specific genes, through a multiple "birth" model (True and Carroll, 2002;Choi and Kim, 2006;McLennan, 2008;Mello et al., 2018). As such, we evaluated these two categories using a phylostratigraphic approach. We observed an unbalanced gene distribution, where 76% of the SOM proteins are descendants of genes that were gained before the cnidarian diversification ( Figure 1A and Supplementary Table 3). In our analysis, we found only a single scleractinian-specific orthogroup (OG0012228), representing species from both the robust and complex scleractinian clades. This orthogroup contains the S. pistillata SOM protein "Coral Acid-Rich Protein 2" (Supplementary Figure 1), which was observed to have an essential role in the early life stages of scleractinians (Mass et al., 2016;Akiva et al., 2018). However, ∼60% of the SOM proteins, found in the different orthogroups, emerged during the scleractinian evolution due to gene family expansion ( Figure 1A and Supplementary Table 3). Simultaneously, 10.53 and 7.41% of the known SOM proteins from S. pistillata and A. millepora, respectively, were not found in any a specific orthogroup (Supplementary Table 2), suggesting a species-specific evolution.
Calculation of duplication rates across the tested orthogroups (Supplementary Table 3) shows high rates at the Opisthokonta and Choanimalia branches, followed by low rates at the subsequent branches leading to the known SOM proteins' species (Supplementary Figure 2).
Using CRB-BLAST, 21 distinct transcripts from the massive aragonite skeleton forming octocoral, H. coerulea, were identified to be putative orthologs of 27 scleractinian known biomineralization proteins, spanning 20 orthogroups (Supplementary Table 4). All identified H. coerulea transcripts were present in orthogroups with the phylogenetic origin at the metazoan branch or earlier ( Figure 1B). Furthermore, in all of the respective orthogroups, at least two other non-rigid skeleton forming octocoral species were identified ( Figure 1B). In five instances, the H. coerulea transcripts were identified as putative orthologs of more than a single species in the same orthogroup. Those orthogroups include putative enzymes, transporters, and acidic proteins, suggesting functional conservation across lineages. However, due to limitations of the transcriptomic dataset, the evolutionary relationship between sequences was not resolved. Our understanding of the evolutionary dynamics that guide the biomineralization gene repertoire evolution will increase through the growth in annotated genomic datasets.

Identification of Species-Specific vs. Scleractinian-Conserved SOM Protein Evolution
In the 72 gene trees tested, we detected multiple "Scleractinia branches, " namely: groups of scleractinian genes that evolved from a single gene copy of the most recent common ancestor of scleractinians. Subsequently, we identified known SOM proteins in S. pistillata Peled et al., 2020), A. digitifera , and A. millepora (Ramos-Silva et al., 2013) within each "Scleractinia branch." Accordingly, in 12 "Scleractinia branches, " we observed the cross-species conservation of more than a single known SOM protein (Figure 2 and Table 1). Moreover, in eight "Scleractinia branches, " known SOM proteins from both the complex and robust scleractinian clades were identified (Figures 3, 4 and Table  1). A permutation test revealed that this clustering pattern is nonrandom (p < 0.001), with a significantly higher SOM protein density per cluster than expected ( Table 1 and Supplementary  Figure 2). Overall, these results indicate that the evolution of SOM-related functions emerged in the last common ancestor of scleractinians.

DISCUSSION
The mineralized skeleton is a paramount innovation, appearing simultaneously across phyla during the Cambrian Explosion (Murdock and Donoghue, 2011;Erwin, 2020;Murdock, 2020). Currently, there is growing evidence that many animal taxa inherited sets of ancestral genes that were then independently co-opted to guide skeleton formation (Murdock, 2020). Here, we sought to examine the evolutionary history of scleractinian biomineralization proteins across the metazoan tree of life to determine how modern stony corals evolved to form one of the most significant biostructures on Earth (Veron et al., 2009).
Determining the age of the SOM proteins by phylostratigraphy ( Figure 1A) indicates that each scleractinian species' "biomineralization toolkit" has a similar age profile, characterized by progressively descending gene gain toward the tip of the tree. The most substantial proportion of genes for which orthogroups were found, appear to have evolved over 700 million years ago, before the cnidarian diversification, through extensive expansions of gene families. Of these ancestral genes, 96% of orthogroups contain at least one species that does not form a rigid skeleton (Figure 1B). The evidence that the scleractinian SOM proteins have an ancient origin and are shared between rigid skeleton forming and non-forming taxa further supports the hypothesis that the "biomineralization toolkit" evolved by the differential independent co-option of genes that had unrelated skeleton forming functions (Erwin, 2020).
The independent co-option of ancient genes is not only restricted to between-lineage relationships, but is also evident in the same lineage, with taxa utilizing a unique set of proteins with similar functional patterns (Evans, 2019). For example, mollusks have been found to express in their mantle a species-specific unique set of genes that evolved before their evolutionary origins (Aguilera et al., 2017). This tendency for different species to use a separate set of ancient genes that converge toward the same results is also found in stony corals. Orthogroup OG0006984 (Figure 2), was found to include polycystin-like sequences, having calcium binding sites (Rastogi and Liberles, 2005). This gene family has an evolutionary origin going back at least to the Parahoxozoa lineage. Although the SOM proteins were found to be orthologous to sequences from all three scleractinian species with a published SOM proteome, only the Acroporidae family proteins has been identified in the skeleton. In addition, solely an S. pistillata known SOM protein was identified in the CAP-Gly orthogroup (OG0004585, Figure 5) which contain genes that are involved in the transport of vesicles along the cytoskeletal network (Riehemann and Sorg, 1993). Similarly to the polycystin orthogroup, the known SOM protein shares a 1:1 orthology to sequences from the scleractinian species with a published SOM proteome yet, was only identified in the S. pistillata's skeleton. It is noteworthy that this example can be a result of the identification of different protein sets by the use of diverse protein extraction methods, which can lead not only to different yield but also different content of proteins as the various methods are biased toward their own properties (Marin et al., 2016;Klont et al., 2018;Peled et al., 2020). For example, mechanical filtration is biased toward hydrophobic proteins, while acetone precipitation increase the identification of hydrophilic proteins (Thongboonkerd et al., 2002). As such, the use of multiple extraction methods to retrieve SOM proteins makes it difficult to fully compare across species, likely leading to underestimation of the SOM protein repertoire.
Species-specific novel proteins are those with no orthologous relationship outside the species of interest. This class of proteins was also found to contribute to the scleractinian SOM protein assemblage, although to a lesser extent than the co-option of pre-existing molecular traits ( Figure 1A and Supplementary  Table 2). While the variability in their detection is high between our species of interest, as discussed above, it could also be caused by underestimation due to different extraction methods. In mollusks, species-specific proteins have been suggested to play a considerable role in physiological adaptations to environmental changes (Arivalagan et al., 2016) and in the formation of the numerous shell morphologies and properties (Kocot et al., 2016). However, to date, our knowledge of the functional roles of most SOM proteins in scleractinians is still lacking. Therefore, the clarification of novel biomineralization proteins from different species, combined with experimental functional validation, is still required to elucidate their significance.
Gene duplication events play a crucial role in the emergence of novel genes (Singh and Bansal, 2019), and are thought to have contributed to the evolution of morphological and physiological diversity (True and Carroll, 2002;Kondrashov, 2012;Lallemand et al., 2020). The high rates of duplications that occurred before the metazoan diversification (Supplementary Figure 2) suggest that both ancient duplications and the retention of duplicated genes have contributed to the expansion of the gene families. This allowed the emergence of novel functions and possibly promoting the specific evolution of scleractinian SOM proteins. The mechanisms that may lead to gene duplications in scleractinian lineages include tandem duplications, transposable elements, retrotransposition and transduplication, and segmental duplications (Lallemand et al., 2020). Duplicated genes can acquire novel functions, namely, undergoing neofunctionalization and subfunctionalization, where paralogs may carry complementary functions (Thongboonkerd et al., 2002). As we further explain here, it seems that the scleractinian SOM protein evolution is characterized by the gain of SOM-specific genes, supporting a neofunctionalization model.
Despite the many scleractinian SOM protein gene families that appear to be orthologous to diverse phyla, the evolution of scleractinian-specific SOM lineages and their neofunctionalization seem to be the primary force ( Figure 1A). For example, the aspartic acid-rich gene family (Figure 3) is represented by several known SOM proteins from all three scleractinian species with sequenced skeletal proteomes. While known coral SOM proteins in this orthogroup have several orthologous sequences across the cnidarian, molluscan and brachiopod phyla, the orthogroup expansion resulted in the speciation of two distinct scleractinian-specific clusters, each with its own unique last common ancestor (i.e., being spread across the robust and complex clades). A similar example can be found for the metal transport gene tree family (Figure 4), where the various scleractinian SOM proteins share diverse orthologous relationships across many phyla, nonetheless converging into a scleractinian-specific branch. Although the scleractinian "biomineralization toolkit" bound in the skeleton seems to differ between species, the presence of multiple orthologs to known SOM proteins seems to indicate that SOM-related functionality emerged in the last common ancestor of scleractinia ( Table 1). As such, these known SOM orthologs retained a fundamental role in biomineralization and are therefore conserved across scleractinian species.
Going further, we propose that for distinct scleractinian species whose skeletal proteomes have not yet been sequenced, the likelihood of orthologous genes to known SOM proteins found under specific "Scleractinia branches, " will be further identified upon proteomic profiling. Subsequently, the ability to have SOM-related gene markers will increase our capability to predict corals' response to changing environments, without the need to perform proteomic analysis on a large number of coral species. However, we acknowledge that our predictions should be taken with certain caveats. While proteomic data representing both the complex and robust scleractinian clades are available, our resolution may be limited, as very few scleractinian species possess a published SOM proteome profile relative to the ∼1,600 known scleractinian species (WoRMS, 2020). Consequently, we are likely missing groups of SOM proteins conserved across species and affiliated with specific growth forms, life strategies and habitats. As such, we suggest that the type of analysis used here will become more robust with the addition of a diverse representation of scleractinian SOM proteomes that will further help identify the "core biomineralization toolkit" across, and between, the scleractinian order.
Altogether, our results clarify the differing evolutionary dynamics of the scleractinian corals' "biomineralization toolkit" as illustrated in Supplementary Figure 4. First, we provide further evidence that the evolution of a subset of biomineralization proteins in scleractinians is through a stepwise process (Ramos-Silva et al., 2013;Takeuchi et al., 2016;Murdock, 2020). This is evident by the presence of preexisting genes shared by an assortment of skeleton forming and non-forming taxa. It suggests that gene co-option played an integral role in the initial development of an extracellular organic matrix in the last common ancestor of the scleractinian order. Second, the differential independent co-option, through gene duplications, followed by sub and neofunctionalization to form lineage-specific proteins and construct species-specific organic matrix frameworks, can have a significant role in distinct skeleton morphology between species. Third, this would supplement the contribution of novel species-specific proteins, crucially allowing organismal plasticity and adaptation to environmental change. While the presence of such lineageand species-specific key innovations appears to have crucial roles, our results emphasize the importance of the evolutionary dynamics through gene duplications, although the mechanism remains to be revealed.

DATA AVAILABILITY STATEMENT
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://github.com/ OOassafOO/gene-set-overrepresentation-in-trees.

AUTHOR CONTRIBUTIONS
TZ, AM, and TM designed the study. TZ and AM analyzed the data. All authors wrote the manuscript and approved it.

FUNDING
This project was received funding from the European Research Council (ERC) under the European Union's Horizon 2020 Research and Innovation Programme (grant agreement no. 755876 to TM) and from the Israeli Binational Science Foundation (BSF 2016321 to HP and TM). JD was supported by the Zuckerman STEM Leadership Program. Computations presented in this work were performed on the Hive computer cluster at the University of Haifa, which is partly funded by ISF grant 2155/15.