Collagen Binding Proteins of Gram-Positive Pathogens

Collagens are the primary structural components of mammalian extracellular matrices. In addition, collagens regulate tissue development, regeneration and host defense through interaction with specific cellular receptors. Their unique triple helix structure, which requires a glycine residue every third amino acid, is the defining structural feature of collagens. There are 28 genetically distinct collagens in humans. In addition, several other unrelated human proteins contain a collagen domain. Gram-positive bacteria of the genera Staphylococcus, Streptococcus, Enterococcus, and Bacillus express cell surface proteins that bind to collagen. These proteins of Gram-positive pathogens are modular proteins that can be classified into different structural families. This review will focus on the different structural families of collagen binding proteins of Gram-positive pathogen. We will describe how these proteins interact with the triple helix in collagens and other host proteins containing a collagenous domain and discuss how these interactions can contribute to the pathogenic processes.


INTRODUCTION
Collagen is the most abundant protein in the human body and an integral component of the extracellular matrix (ECM) (Shoulders and Raines, 2009). The ECM is a complex proteinaceous network that provides structural support to tissues along with the necessary signaling for cell adhesion, migration, and growth as well as for tissue development and regeneration (Frantz et al., 2010). Collagen plays a critical role in the functional integrity of most tissues including bone, skin, tendon, and cartilage (Burgeson and Nimni, 1992;Frantz et al., 2010). Collagen can also be the target of surface-anchored adhesins and other virulence factors produced by both Grampositive and Gram-negative pathogens (Harrington, 1996;Singh et al., 2012;Zhang et al., 2015;Duarte et al., 2016;Paulsson and Riesbeck, 2018;Vaca et al., 2020). Of these, the cell wall anchored collagen binding proteins in Gram-positive bacteria have been more extensively studied, and will be reviewed here.
There are 28 identified types of collagens in humans ( Table 1; Ricard-Blum, 2011). Each collagen molecule is formed through the interactions of three protein polypeptides known as α-strands. The α-strands come together to form a canonical right-handed triple helical structure termed the triple helix domain (Kadler et al., 2007;Ricard-Blum, 2011). Triple helices can be formed by association of identical α-strands to form a homotrimer or be composed of different α-strands (heterotrimer) (Ricard-Blum, 2011). The triple helix domain is a flexible rod-shaped structure held together through inter-chain hydrogen bonding (Kadler et al., 2007;Shoulders and Raines, 2009;Ricard-Blum, 2011). The triple helix is defined by Gly-X-X' amino acid repeats with X and X' commonly representing proline and 4-hydroxyproline, respectively (Shoulders and Raines, 2009). Glycine residues are required every 3rd residue as any other residue would result in steric hindrance and helix destabilization (Theocharis et al., 2016). Collagens also have non-triple helical domains at their N-and C-termini, which are referred here as "non-collagenous" domains. In addition to the conventional collagens, several other mammalian proteins contain collagenous domains (Fraser and Tenner, 2008;Zani et al., 2015;PrabhuDas et al., 2017;Casals et al., 2019).
Bacterial surface proteins contribute to pathogenic processes and play a critical role in mediating adhesion to host cells and tissues, enabling colonization, invasion, and biofilm formation (Foster et al., 2014;Foster, 2019). In addition, binding of bacterial surface proteins to host ligands can lead to evasion of the host defense systems (Foster et al., 2014;Foster, 2019). In Gram-positive bacteria, different classes of surface proteins exist: (1) lipoproteins, (2) proteins covalently anchored to the cell wall, (3) pilus proteins, (4) non-covalently surface-associated proteins, and (5) transmembrane proteins (Desvaux et al., 2006;Fischetti, 2019). Lipoproteins are proteins covalently attached to membrane lipids via their N-terminus (Desvaux et al., 2006). Cell wall anchored proteins and pilus proteins are anchored to the cell wall by the action of enzymes called sortases (Desvaux et al., 2006). Sortases mediate covalent linking of proteins to the peptidoglycan through a transpeptidase reaction, and can also enable assembly of surface pilus and anchor the pilus onto the peptidoglycan layer (Ton-That and Schneewind, 2004;Desvaux et al., 2006;Fischetti, 2019). Lastly, non-covalently surface associated proteins contain cell wall binding domains (Desvaux et al., 2006;Fischetti, 2019).
Bacterial surface proteins are modular multi-domain proteins that can often be grouped into structural families based on their structural similarities. Multiple structurally related families of proteins have been identified in the literature (Waldemarsson et al., 2006;Foster et al., 2014;Frost et al., 2017;Foster, 2019;Taglialegna et al., 2020). Notables examples of structural families in Gram-positive bacteria include the MSCRAMMs (microbial surface components recognizing adhesive matrix molecules) (Foster et al., 2014;Foster, 2019), serine-rich repeat proteins (Lizcano et al., 2012), and M-proteins (Fischetti, 2016). In this review, we will describe collagen binding proteins present on the surface of Gram-positive pathogens that are human pathogens. This review will focus on structural families where more than one protein with structural similarity has been reported to bind collagen directly. Some proteins reported in the literature use fibronectin as a bridging molecule to bind collagen, e.g., streptococcal fibronectin binding protein 1 (SfbI) of Streptococcus pyogenes (Dinkla et al., 2003a) and are not covered here.

TYPES OF COLLAGEN
Collagens can be divided into different categories, which include fibrillar collagen, network forming collagen, FACITs (fibril-associated collagens with interrupted triple helices), MACITs (membrane-associated collagens with interrupted triple helices), anchoring fibrils, beaded-filament-forming collagens, and MULTIPLEXIN (multiple triple-helix domains and interruptions). These major classes of collagen will be discussed briefly below (Ricard-Blum, 2011;Theocharis et al., 2016). Collagen structure, chain composition, tissue distribution and functions are listed in Table 1.
Network forming collagens include collagen types IV, VIII, and X, with collagen type IV being the archetype (Theocharis et al., 2016). Collagen type IV is found in the basement membrane along with other molecules such as laminin (Hohenester and Yurchenco, 2013;Theocharis et al., 2016). Unlike fibrillary collagens, the non-collagenous domains of these molecules are not cleaved and are utilized to form tail to tail interactions with other non-collagenous domains of collagen (Sundaramoorthy et al., 2002). Stabilizing tetramers are also formed via N-terminal head to head interactions. Once a mature network is formed, these collagens work to support the surrounding epithelial cell layer (Kadler et al., 2007).
Membrane-associated collagens with interrupted triple helices are transmembrane proteins and contain a short N-terminal cytoplasmic tail, a transmembrane helix, and a collagenous C-terminal extracellular domain. These collagens can act as cellular receptors and facilitate cell adhesion and as soluble collagen in ECM upon cleavage (Ricard-Blum, 2011;Theocharis et al., 2016). Examples of MACITs include collagen types XIII, XXIII, and XXV and these are expressed by several cell types (Kadler et al., 2007;Theocharis et al., 2016).  Beaded filament collagens include collagen types VI, XXVI, and XXVIII with type VI being the most studied (Theocharis et al., 2016). Once these collagens are secreted from the cell, they arrange in an anti-parallel fashion to form dimers. Dimers then form tetramers through interactions with other dimers. Next, tetramers connect by their globular domains to form filaments where globular domains appear as beads (Kadler et al., 2007;Theocharis et al., 2016). Beaded filament collagens are found in various connective tissues, e.g., cartilage, bone, tendon, etc. (Fitzgerald et al., 2013).
Multiplexins include collagen types XV and XVIII and have not been studied extensively (Theocharis et al., 2016). They are localized to vascular and epithelial basement membranes and participate in bridging other collagens to underlying structures (Theocharis et al., 2016).
The defense collagens contain a N-terminal segment, a collagen-like region and a globular recognition domain that recognizes pathogen-associated molecular patterns and dangerassociated molecular patterns (Fraser and Tenner, 2008;Casals et al., 2018Casals et al., , 2019. These proteins form multimeric structures and play an important role in pathogen clearance (Fraser and Tenner, 2008;Casals et al., 2018Casals et al., , 2019. The collagen-like regions of defense collagens vary in length and contain G-X-X' repeats where X is often a proline, and X' is often a hydroxylysine or a hydroxyproline (Casals et al., 2019). The collagen-like domains in human defense collagens serve two functions: (1) binding to associated proteases responsible for triggering the complement cascade and (2) binding cell receptors involved in clearance of pathogens and dead cells (Casals et al., 2019).
With the exception of Acb, all CNA-like proteins are anchored directly to the cell wall. Acb is unique and is a minor pilus protein of S. gallolyticus (Sillanpää et al., 2009) but has a predicted CNAlike structure. Furthermore, it shares 50-70% sequence identity with Acm, Cna, and Cne (Sillanpää et al., 2009).

Structure
Collagen Adhesin is the prototype of Collagen-binding MSCRAMMs (Foster et al., 2014;Foster, 2019). CNA like proteins harbor a N-terminal signal sequence, an A-region, a variable number of characteristic B repeats, a C-terminal cell wall and membrane spanning region and a short cytoplasmic tail (Figure 1). The ligand-binding A-region of CNA-like proteins is further divided into two or three sub-domains: N1, N2, and N3 (Figure 1; Patti et al., 1993;Zong et al., 2005). X-ray crystallography of Cna and Ace N1N2 sub-domains revealed that these domains adopt IgG-like folds called Dev-IgG and are consequently composed of mostly β-sheets (Figure 2; Foster et al., 2014;Foster, 2019). The N1 and N2 domains are connected by a rather long (10 aa) hydrophobic linker region, which creates a hole of ∼15 Å between the two domains and provides flexibility in domain orientation (Figure 2; Zong et al., 2005). Additionally, proteins in the CNA-like MSCRAMM family have a variable number of B repeats depending upon the protein (Patti et al., 1994a;Kang et al., 2013). One B repeat is ∼180 aa long and is further divided into two ∼90 aa subdomains, D1 and D2. The D subdomains adopt an inverse IgG fold and together B repeats are thought to form a stalk projecting the ligand binding region away from the bacterial cell surface (Deivanayagam et al., 2000).

Binding Mechanism
The truncated N2 domain is the minimum collagen-binding region of CNA, although optimal binding is achieved by the N1N2 segments. The CNA N1N2 segment binds collagen type I with an affinity of 54 nM (Patti et al., 1993(Patti et al., , 1995Zong et al., 2005;Ross et al., 2012). Electron microscopy imaging of rCNA with collagen triple helix monomers revealed that CNA binds collagen at multiple sites, without any obvious preference for a "hot spot". Surface plasmon resonance (SPR) studies of rCNA 31−344 with synthetic collagen peptides further confirmed its preference for a triple helical structure (Zhang et al., 2015). CNA binds preferentially to cleaved collagen in damaged or inflamed tissues (Madani et al., 2017). Collagen Adhesin-like proteins bind collagen by a "collagen hug" mechanism where the N1N2 segment "hugs" or wraps around the collagen triple helix molecule (Figure 2). A cocrystal of CNA bound in complex with the synthetic collagen peptide (GPO) 4 GPRGRT(GPO) 4 , where O is hydroxyproline, provided the insights into the molecular basis of this model. The collagen hug binding mechanism is initiated when the collagen triple helix interacts with the shallow groove on the CNA N2 domain. This interaction is low affinity, and involves polar and hydrophobic residues (Zong et al., 2005). The initial interaction leads to structural rearrangements within the N1 domain that repositions N1 closer to the N2 domain creating a "tunnel-like" FIGURE 2 | Collagen hug model. The N1N2 domains of CNA in both apo-form [PDB code: 2F68 (Zong et al., 2005)] and with collagen peptide [PDB code: 2F6A (Zong et al., 2005)] are shown. N2 and N1 domains are shown in dark blue and light teal color, respectively. CNA N1N2 domains are connected together by a linker shown in red. Collagen triple helix is shown in green. C-terminal extension of the N2 domain that forms a latch is shown as dark blue β-strand in N1 domain. structure. Finally, the C-terminal extension of the N2 domain undergoes structural changes, and inserts into the N1 domain by β-strand complementation thus forming a "latch" (Figure 2). The N1 domain of CNA interacts with the middle chain while the N2 domain interacts with the leading and trailing chains of the synthetic collagen peptide. The N1N2 linker region covers the collagen peptide and holds it in place (Zong et al., 2005;Liu et al., 2007).
The two-step binding mechanism of CNA to collagen was confirmed by atomic force microscopy studies where a moderate force (∼250 pN) was observed for the initial hydrophobic interaction between collagen and the N2 domain of CNA (Herman-Bausier et al., 2016). After binding collagen, a strong force of ∼1.2 nN was observed for the full interaction. Although B-repeats of CNA do not bind collagen directly, they act as a spring and help withstand the high mechanical stress encountered in vivo (Herman-Bausier et al., 2016).
Although all members of the CNA-like MSCRAMM family appear to bind collagen by a collagen hug mechanism, the proteins show differences in affinity ( Table 2) and mechanistic details because of structural variations. For example, CNA has a higher affinity for the collagen triple helix than Ace (Ross et al., 2012). In contrast to the two-step mechanism used by CNA, Ace binds collagen with a rapid association and dissociation rate in a one-step binding mechanism (Rich et al., 1999;Ross et al., 2012).

Virulence
Most proteins in the CNA-like MSCRAMM sub-family have been shown to act as virulence factors in experimental bacterial infections. CNA-like proteins target collagen to enhance adhesion of the bacteria to host tissues in early and later stages of infection. For example, CNA is a critical virulence factor of S. aureus in experimental septic arthritis and osteomyelitis models and this role depends on its ability to bind collagen (Patti et al., 1994b;Elasri et al., 2002;Xu et al., 2004b). Although CNA is not required in the initial targeting of joints, it is critical for hematogenous spread of S. aureus leading to bone infections (Elasri et al., 2002). Additionally, more bacteria were isolated from joints of mice infected with collagen binding cna + bacterial strains than those infected with non-collagen binding strains. Most CNA-like proteins also bind to collagen present in vegetations observed in non-bacterial thrombotic endocarditis, thus leading to infective endocarditis (Hienz et al., 1996;Nallapareddy et al., 2008;Singh et al., 2010). Ace and Acm, the enterococcal CNA-like proteins, are important virulence factors in infective endocarditis (Nallapareddy et al., 2008;Singh et al., 2010). The ace deletion mutant of E. faecalis OG1RF strain showed decreased colonization of heart valves in a mixed-infection rat endocarditis model compared to the wild type strain. Higher bacterial colony forming unit (CFU) counts were recovered from aortic valve vegetations at 4 h in mono endocarditis infection of rats with ace expressing E. faecalis OG1RF compared to the ace deletion mutant, indicating a role in early colonization of heart valves (Singh et al., 2010). Similarly, significantly more wild type (WT) E. faecium TX0082 CFUs were recovered from rat vegetations after mixed endocarditis infection compared to acm deletion mutant E. faecalis TX6051. Furthermore, Acm was also shown to enhance early adherence to heart valves (Nallapareddy et al., 2008). On the other hand, CNA's ability to bind collagen is of limited significance in early stages of attachment to traumatized aortic valves, but like Acm and Ace (Nallapareddy et al., 2008), CNA does contribute to establishment of infection at a 24 h time point in both mono and mixed endocarditis infections of rats with S. aureus isolates (Hienz et al., 1996).
Cbm and Cnm are homologous S. mutans proteins with 78% identity in their collagen binding domains. The cnm gene is sufficient and necessary for primary human coronary artery endothelial cell invasion by S. mutans isolates as shown with cnm S. mutans clinical isolates as well as cnm + Lactococcus lactis (Abranches et al., 2011;Freires et al., 2017). The cnm gene also permits invasion of other non-phagocytic cells like human gingival fibroblasts and human oral keratinocytes (Miller et al., 2015). In addition, cnm + S. mutants OMZ175 and cnm + L. lactis outcompeted cnm S. mutans OMZ175 and L. lactis, by 10 and 100-fold, respectively, in ex vivo bacterial adherence to aortic valve sections. Using a rabbit model of infective endocarditis, it was shown that cnm L. lactis mediated attachment to injured endocardium but not to the vegetations (Freires et al., 2017). Similar to Cnm, cbm + S. mutans attaches to aortic valves and leads to larger vegetations formed on the impaired heart valve tissue compared to cbm − S. mutans.
In addition, collagen-binding proteins have been implicated in various infections. For example, CNA has been implicated in pathogenesis of S. aureus keratitis (Rhem et al., 2000) and orthopedic prosthesis infections (Montanaro et al., 1999). Similarly, Cnm has been implicated in S. mutans cerebral hemorrhaging (Tonomura et al., 2016) and colonization of dental pulp (Nomura et al., 2016).

M and M-Like Proteins
M-protein, described by Rebecca Lancefield almost a century ago (Lancefield, 1928), is a major cell wall-anchored protein and virulence factor present on the surface of Group A, B, and C streptococci (GAS, GBS, and GCS) (Dinkla et al., 2007;Barroso et al., 2009;Reissmann et al., 2012). There are around ∼250 known M-protein types in GAS based on sequence variation in the first 50 amino acids of the protein. Variations in the M-protein lead to strain-specific immunity and, hence, M-proteins serve as a strain typing marker (Lancefield, 1928). M proteins have multiple functions, including inhibition of phagocytosis and binding to fibrinogen, collagen, complement, and other host proteins (Metzgar and Zampolli, 2011).

Structure
M proteins are multi-domain proteins that adopt an elongated αhelical structure and dimerize to form helical coiled-coil structures, a structure form also seen in mammalian proteins like tropomyosin and myosin (McNamara et al., 2008;Fischetti, 2016). M-protein fibrils are ∼500 Å long and coat the surface of Group A streptococcus (Phillips et al., 1981;Fischetti, 1989). When viewed by transmission electron microscopy, M-protein appears like "fuzz on a tennis ball" (Phillips et al., 1981). All M-proteins contain a signal peptide, a hypervariable region, a less variable central domains and a highly conserved C-terminus (Figure 1; Fischetti, 1989).
The prototypic M6 protein consists of a cleavable signal sequence, A repeats, which includes the hypervariable region (HVR), B repeats, C repeats, D-region and a LPXTG motif for sortase mediated anchoring to the cell wall (Figure 1). The HVR region is the first 50 amino acids of the mature M protein and shows variation amongst the different M-proteins. The M6 A-repeat region consists of five repeats of 14 amino acids each, where the central repeats are identical and end repeats are slightly divergent (Smeesters et al., 2010;Fischetti, 2016). The B-repeat region contains five repeats, each 25 amino acid long (Fischetti, 1989(Fischetti, , 2016. The M6 protein contains two C-repeats where each repeat is 35 residues long (Fischetti, 1989(Fischetti, , 2016. C-repeats show higher sequence conservation compared to A-and B-repeats. Lastly, the M6 protein contains four D-repeats, each 7 amino acid long (Fischetti, 1989(Fischetti, , 2016. Amongst the A, B, C, and D repeat regions, D-repeats show highest sequence homology to each other for any M protein (Smeesters et al., 2010). Together, A-, B-, C-, and D-repeats form the central helical rod (Fischetti, 1989(Fischetti, , 2016(Fischetti, , 2019McNamara et al., 2008).
As observed in tropomyosin and myosin, the coiled-coil nature of a protein molecule comes from heptad repeats, where the first and fourth residues in the register are generally hydrophobic (Fischetti, 1989(Fischetti, , 2016McNamara et al., 2008). Hydrophobic residues form the core of the coiled coil and the remaining residues in the heptad repeats are generally helix promoting (McNamara et al., 2008;Fischetti, 2016). Heptad repeats found in M-proteins are not perfect, which leads to irregularities and instabilities of the coiled-coil region (McNamara et al., 2008;Macheboeuf et al., 2011). McNamara et al. found that destabilizing residues in the coiled-coil region of M1 protein promote conformational dynamics, which is required for binding of M1 protein to fibrinogen (McNamara et al., 2008;Stewart et al., 2016). These irregularities in the heptad repeats also form the basis for sub-division of the protein into A-, B-, and C-repeats (Fischetti, 2019).
Sequence and structural variations amongst M-proteins are common. Homologous recombination in M-protein leads to differences in the frequency and length of the repeats and an overall variation in size (Fischetti, 1989). As a result, A-and B-repeats are not present in all M-proteins and when present, their sizes can vary. However, all M-proteins contain C-repeats and their total number can vary from two to four (Smeesters et al., 2010). The sequence variations between M-proteins lead to functional differences and hence not all M-proteins possess all the functional capabilities described in the literature.

Binding Mechanism
Amongst the >250 known types of M-proteins, about 20 have been shown to bind collagen ( Table 3). M-proteins bind directly to the triple helical regions of collagen (Nitsche et al., 2006;Barroso et al., 2009;Dinkla et al., 2009;Bober et al., 2011;Reissmann et al., 2012) with the exception of the M1 protein, which also interacts with the globular domain of collagen type VI (Bober et al., 2010). Rotary shadowing electron microscopy revealed that M3 protein binds collagen type IV at two different sites: one located on cyanogen bromide fragment 3 (CB3) and the other at a site 20 nm away from the 7S domain (Eble et al., 1993). CB3 is a fragment of collagen type IV that maintains its triple helix and is generated after cleavage of collagen with cyanogen bromide (Eble et al., 1993). When expressed on the surface of a heterologous non-collagen binding host (Streptococcus gordonii GP1221), M-proteins from GCS and Group G streptococci (GGS) enabled GP 1221 to bind to collagen type IV at the same level as GCS and GGS (Barroso et al., 2009). Peptide associated with rheumatic fever (PARF) is an eightresidue motif present in the hypervariable A region of some M-and M-like proteins (Dinkla et al., 2007;Barroso et al., 2009;Reissmann et al., 2012). Based on careful examination of multiple M-proteins from 69 isolates, a consensus sequence of the PARF motif was determined to be (A/T/E)XYLXX(L/F)N where charged amino acids are preferred at positions 2, 5, and 6, with at least one of the charged amino acids containing a basic side chain (Barroso et al., 2009;Reissmann et al., 2012). A PARF motif is required for binding of these M-proteins to collagen (Dinkla et al., 2007;Reissmann et al., 2012), as one or two substitutions of the conserved residues in the PARF motif abolishes binding to collagen type IV (Reissmann et al., 2012). However, additional data suggests that the binding of M-proteins to collagen can be more complicated and extends beyond the PARF motif. First, a series of recombinant truncated PARF-containing versions of an M-protein bind collagen with significantly different affinities (Dinkla et al., 2007). A full-length recombinant M-protein of GGS called "fibrinogen-binding protein of G streptococci" (FOG) binds to collagen type IV with a K D of 6 nM, whereas a truncated FOG protein containing A-and B-repeats binds collagen type IV with 24 times higher K D and a FOG protein containing the A-region only binds collagen type IV with a 200 fold higher K D compared to the full length FOG protein (Dinkla et al., 2007). Similarly, a truncated recombinant FOG protein binds collagen type I with a 20 fold higher K D than the full length recombinant FOG protein (Nitsche et al., 2006). Furthermore, Reissmann et al. (2012) identified M-proteins with PARF motifs that did not bind collagen type IV. Interestingly, M-proteins stG120.1, stG120.0, and stGM220 all contain the same PARF motif but only stG120.1 binds collagen type IV, while all three proteins bind fibrinogen. Moreover, the M1-protein lacks a PARF motif (Reissmann et al., 2012) but still binds to the triple helix of collagen types I and IV (Bober et al., 2011) and globular domains of collagen type VI (Bober et al., 2010). M-proteins binding to different types of collagens can have different consequences. Binding of M-proteins to collagen type IV leads to aggregation of collagen on the surface of the bacteria (Dinkla et al., 2003b(Dinkla et al., , 2007Barroso et al., 2009), which is not observed with the interaction of collagen type I to M-protein (Barroso et al., 2009). Expression of M-protein on the surface of a heterologous host leads to collagen type IV aggregation, demonstrating that the M-protein alone is sufficient for collagen aggregation.

Virulence
M or M-like proteins are major virulence factors of Streptococci and their role in streptococcus pathogenesis have been reported on extensively (Oehmcke et al., 2010;Smeesters et al., 2010;Frost et al., 2017;Fischetti, 2019). In this review article, we will focus on the contribution of the M-protein:collagen interaction to the pathogenesis of streptococci. Binding of M-proteins to collagen can have two consequences: (1) mediating bacterial adhesion to connective tissues and (2) inducing collagen auto-immunity.
M-protein binding to collagen is important in the colonization of human skin by streptococci (Nitsche et al., 2006). When incubated with human dermis ex vivo, higher CFU counts were recovered from a GGS strain expressing FOG protein compared to a FOG-deficient strain. Incubation of the bacteria with collagen type I decreased adherence of the FOG expressing strain to human dermis, thereby also suggesting that the interaction of FOG with collagen type I enables adhesion.
Acute rheumatic fever (ARF) and rheumatic heart disease are antibody-mediated autoimmune sequelae that can develop after a streptococcal infection (Tandon et al., 2013;Carapetis et al., 2016). Binding of M-protein to collagen has been shown to be a relevant factor in developing ARF (Dinkla et al., 2003b(Dinkla et al., , 2007Barroso et al., 2009). Binding of M or M-like protein to collagen type IV can lead to production of antibodies binding the collagen molecule (Dinkla et al., 2003b(Dinkla et al., , 2007Barroso et al., 2009). Analysis of mouse sera obtained from immunization with recombinant M or M-like protein led to identification of two distinct antibody populations: anti-collagen type IV antibodies and anti-M protein antibodies. These distinct antibodies did not cross-react with each other (Dinkla et al., 2007), indicating that collagen type IV autoimmunity was not generated through molecular mimicry. In addition, sera of ARF patients contain antibodies that specifically recognize the CB3 region of collagen type IV and the collagen-binding region of the M3 protein (Dinkla et al., 2007. The N-terminal half of the protein containing the PARF motif is required for generating autoimmunity (Dinkla et al., 2007). Immunization of mice with full-length FOG led to a significantly higher titer of anti-collagen type IV antibodies compared to mice immunized with FOGB2-C2, a region of FOG that does not bind collagen (Dinkla et al., 2007). Similar results have been obtained with other M-proteins (Dinkla et al., 2007;Barroso et al., 2009). While auto-antibodies to collagen type I have not been demonstrated, given the structural similarities between the collagens, anticollagen type IV antibodies potentially could also react with other collagen types.

EMERGING FAMILIES OF COLLAGEN-BINDING PROTEINS
Numerous collagen-binding proteins of Gram-positive pathogens have been reported in the literature but their mechanisms of collagen binding are unclear. We have identified three emerging families of collagen-binding proteins where, although one or more than one family member binds to collagen, a clear picture of how these proteins bind to collagen is not yet available.
von Willebrand Factor A-Domain Containing Proteins von Willebrand factor (vWF) is a host glycoprotein found in blood, blood vessel ECM, and platelet α-granules . vWF is a large modular protein that contains two binding sites for collagen located in the A1 and A3 domains. The A3 domain of vWF binds collagen types I and III whereas the A1 domain binds collagen types IV and VI . Crystal structures of both A1 and A3 domains show a central β-sheet composed of six β-strands and flanked on both sides by α-helices (Huizinga et al., 1997;Emsley et al., 1998). These domains are structurally similar to the I-domain of some integrin α-chains, including the collagenbinding α1-, 2-, 10-, and 11-chains. The collagen-binding α-chain integrins also contain a metal ion-dependent adhesion site (MIDAS) important for ligand binding (Lee et al., 1995;Huizinga et al., 1997;Emsley et al., 1998).
Structural homologs of vWF A-domains, called vWA domains, have been found in minor pilus proteins that bind to ECM proteins and host cells. These pilus proteins include RrgA from Streptococcus pneumoniae (Izore et al., 2010), GBS104 from Streptococcus agalactiae (Krishnan et al., 2013), PilA from S. agalactiae (Konto-Ghiorghi et al., 2009;Banerjee et al., 2011), SpaC from Corynebacterium diphtheriae (Mandlik et al., 2007), and EbpA from E. faecalis (Nielsen et al., 2012). Most structural information about bacterial vWA domains comes from crystal structures of the RrgA and the GBS104 proteins (Izore et al., 2010;Krishnan et al., 2013). RrgA and GBS104 are homologs that share 51% sequence identity with each other and have a similar domain organization (Krishnan et al., 2013). Both proteins contain an N-terminal signal sequence, four D domains named D1, D2, D3, and D4, and a C-terminal sorting signal (Figure 1). The primary sequence of both D1 and D2 domains is non-contiguous, and is divided into two regions, one present in the N-terminal half and other present in the C-terminal half of the protein (Figure 1). The two regions fold back on each other to form the tertiary structure of the D1 and D2 domains. The D3 domain is inserted in between the two regions encoding the D1 and D2 domains and the D4 region is located distal to the C-terminal half of the D1-D2 domain (Figure 1). It is worth noting that while RrgA and GBS104 are structural homologs, other pilus proteins containing vWA domains like PilA have a different overall domain organization (Mandlik et al., 2007;Konto-Ghiorghi et al., 2009;Izore et al., 2010;Banerjee et al., 2011;Nielsen et al., 2012;Krishnan et al., 2013).
The D3 domains of both RrgA and GBS104 adopt a structure similar to the vWF A-domain. These D3 domains of both RrgA and GBS104 consist of a central β-sheet flanked by α-helices on both sides as seen in the vWF A-domain and the integrin I-domain ( Figure 3A; Huizinga et al., 1997;Emsley et al., 1998;Izore et al., 2010;Krishnan et al., 2013). In addition, both RrgA and GBS104 have two arms inserted into the vWA-domain that are absent in the A-domains of vWF and the I-domain of integrins (Lee et al., 1995;Huizinga et al., 1997;Emsley et al., 1998;Izore et al., 2010;Krishnan et al., 2013). The first arm of RrgA contains two β-hairpins folded together to form an elongated arm (Figure 3A). The second arm of RrgA consists mostly of loops along with one short hairpin, two α-helices and loops ( Figure 3A; Izore et al., 2010;Krishnan et al., 2013). The two inserted arms extend away from the core of the domain and extend the length of the protein (Figure 3A; Izore et al., 2010). The D3 domain of the two bacterial proteins also contains a MIDAS motif (Figure 3B) present in the I-domain of integrins but absent in the vWF A-domains. Amongst the pilus proteins with a vWA-domain, RrgA, PilA, and GBS104 have been reported to bind collagen. RrgA binds collagen type I, fibronectin, and laminin (Hilleringmann et al., 2008;Moschioni et al., 2010). However, the D3 vWA domain alone was not able to bind ECM proteins (Moschioni et al., 2010). Full-length RrgA protein is required for binding (Moschioni et al., 2010). RrgA binds to collagen type I with weaker force than expected for a ligand:receptor interaction. It has been suggested that the low binding force might help the pilus adhere and detach under physiological flow conditions. However, kinetic data for RrgA and collagen type I is lacking and the suggested consequences of low binding force awaits elucidation (Becke, 2019). Although recombinant PilA has been reported to bind collagen, its role in S. agalactiae collagen binding is not clear (Banerjee et al., 2011;Dramsi et al., 2012). Similarly, GBS104 has been reported to bind collagen but the interaction in a solid-phase binding assay is weak and does not reach saturation indicating that the interaction of GBS104 and collagen type I may not be specific or have functional relevance (Krishnan et al., 2013).
Two different binding regions in the vWA domain-containing pilus proteins have been proposed: the vWA-domain with the MIDAS motif and the U-shaped cradle formed by the inserted arms ( Figure 3B; Izore et al., 2010). Apo-crystal structure of the vWA-domain with the MIDAS motif revealed a trenchlike region formed by the two inserted arms and the MIDAS motif present on the central β-sheet (Izore et al., 2010;Krishnan et al., 2013). Based on structural comparison with co-crystals of integrin α2β1 and a synthetic triple helix peptide, the trenchlike region has been proposed to be the collagen binding site ( Figure 3B; Emsley et al., 2000;Izore et al., 2010). The vWAdomain and the integrin I-domain undergoes conformational change during binding events and transition from a closed form to an open form. Participation of the trench-like region and a change in confirmation upon ECM binding was confirmed using an open form of the GBS104-D3 domain stabilized by a disulfide bridge. The open form of the GBS104-D3 domain alone was sufficient for binding to fibronectin, whereas the closed form of the D3 domain showed no binding (Krishnan et al., 2013). The vWA-domains of pilus proteins have considerable variation in their primary sequence, with the most variations in the inserted arms (Konto-Ghiorghi et al., 2009;Izore et al., 2010;Krishnan et al., 2013). Therefore, despite structural similarities, these pilus proteins have been suggested to bind different ligands with different affinities (Izore et al., 2010;Krishnan et al., 2013). A second binding site is the U-shape cradle formed by the two inserted arms joining together at the tip of the protein (Figure 3B). This cradle contains basic residues and has been proposed to bind negatively charged molecules like glycosaminoglycans attached to ECM proteins (Izore et al., 2010). While the vWA domain is critical for virulence (Konto-Ghiorghi et al., 2009;Nielsen et al., 2012), evidence that the vWA domain of RrgA is responsible for binding to collagen is lacking.

Leucine Rich Repeat Containing Proteins
Leucine rich repeats (LRRs) are protein recognition motifs present in eukaryotic proteins with diverse functions (Kobe and Kajava, 2001). Small leucine rich proteoglycans (SLRPs) in mammals are an example of LRR proteins and play important roles in collagen fibrillogenesis (Kalamajski and Oldberg, 2010). LRR containing proteins have been found in some pathogenic bacteria, e.g., Yersinia pestis, Listeria monocytogenes, plants, animals, and fungi (Kobe and Kajava, 2001). Each repeat is 20-29 aa long and are often present in tandem with multiple LRRs to form an overall curved shape where β-sheets are present on the concave side and α-helices are often on the convex side (Kobe and Kajava, 2001).
The extracellular matrix protein (Emp) of S. aureus is 340 aa long secreted protein with a 26 aa long signal peptide at the N-terminus. Emp binds collagen type I with a K D of 27 nM. Emp is structurally intriguing, as Emp is not predicted to be multi-domained. When viewed through a transmission electron microscope, the Emp monomer was revealed to form a horseshoe-type structure with an 8 nm diameter. Interestingly, even though it lacks leucine repeats, structure prediction through I-TASSER identified leucine rich repeat proteins as the top ten structural analogs (Geraci et al., 2017).
While molecular details of Slr and Emp binding to collagen have not been studied beyond the confirmation of their interaction, their intriguing overall structural similarities leads the way for postulating an emerging LRR-containing or LRR like protein family that binds collagen. Given that several human LRR proteins [e.g., decorin (Schönherr et al., 1995), fibromodulin (Font et al., 1998)] interact with collagen, it is not surprising that bacterial LRR proteins bind collagen. Additional studies are needed to determine the residues that mediate the interaction and determine similarities of those interactions with host LRR proteins and collagen.

Sgo0707 N1-Domain Containing Proteins
Streptococci express multiple surface proteins that have been reported to bind collagen . One emerging family of collagen-binding proteins in Streptococci is related to the N1-domain of Sgo0707 protein from S. gordonii. The Sgo0707 protein, which has been shown to bind collagen, contains a N-terminal signal sequence, a 419 aa long N-terminal region, eight repeats of 84 aa, five repeats of 88 aa, a unique domain, an LPXTG cell wall sorting signal and a transmembrane helix (Figure 1; Nylander et al., 2013). The 419 aa long N-terminal region is divided into two domains: N1 and N2 (Figure 1). Both the N1 and N2 domains adopt a β-sandwich with anti-parallel β-sheets (Nylander et al., 2013), where β-sheet 1 contains nine β-strands and β-sheet 2 contains eight strands ( Figure 4A). The N1 domain also contains two small subdomains A and B. The N2 domain consists of two β-sheets of five β-strands and a third small sheet of three strands and adopts a DeV-IgG fold also observed in the N1N2 domains of CNA (Nylander et al., 2013).
A search of proteins with similar N1 domains identified the variable domains in two Ag I/II family proteins; SpaP from S. mutans and SspB from S. gordonii. Both these domains are predicted to adopt a similar structure despite having only 10% sequence identity to N1 of Sgo0707 (Forsgren et al., 2009;Larson et al., 2010;Nylander et al., 2013). These proteins have a different domain organization than Sgo0707, with an N-terminal signal sequence, alanine rich repeats, a variable domain, proline rich repeats, a C-terminal domain, and an LPXTG cell wall sorting signal (Forsgren et al., 2009;Larson et al., 2010). All three proteins form an extended confirmation with the putative collagen binding domain (N-region of Sgo0707 and variable domain of SpaP and SspB) predicted to be located at the tip of the protein (Forsgren et al., 2009;Larson et al., 2010;Nylander et al., 2013).
Docking of the collagen triple helix to the Sgo0707 N1N2 domain identified two different potential binding sites (Figures 4B,C). The first binding site is on top of the N1 domain in the open cleft formed by the two subdomains in the N1 domain ( Figure 4B). This site has a higher negative surface potential compared to the SspB and SpaP proteins, and lacks the metal ion located in the cleft found in both of the Ag I/II proteins (Forsgren et al., 2009;Larson et al., 2010;Nylander et al., 2013). A second putative collagen-binding site is formed by the loops of the N1 domain and a β-sheet of the N2 domain, which together form a concave surface where collagen can dock ( Figure 4C; Nylander et al., 2013). The concave site consists of mostly non-polar residues (Nylander et al., 2013).
All three proteins (Sgo0707, SspB, and SpaP) have been implicated in collagen binding. Binding of the three proteins to collagen type I was shown in a bacterial adhesion assay using deletion mutants (Love et al., 1997(Love et al., , 2000Nylander et al., 2013). Deletion of the sgo0707 gene in S. gordonii DL1 decreased collagen type I binding by 40% compared to the WT strain (Nylander et al., 2013). Similarly, an isogenic deletion mutant of the sspB gene in S. gordonii and of the spaP gene in S. mutans showed decreased binding to collagen type I compared to WT strains (Love et al., 1997(Love et al., , 2000. Additionally, binding of S. gordonii DL1 to collagen type I in a bacterial adhesion assay was inhibited by recombinant N-region of Sgo0707, thus narrowing down the N-region as the collagen binding partner (Nylander et al., 2013). While the three proteins have been implicated in collagen binding, their direct binding to collagen has not been demonstrated. The three proteins only share structurally similar N1-domains as both SspB and SpaP lack the N2 domain found in Sgo0707 (Love et al., 1997(Love et al., , 2000Nylander et al., 2013). Do the three proteins bind collagen at the cleft on top of the N1-domain? Further studies are required to narrow down the collagen-binding site in these proteins and to determine if they form a structural family of proteins that bind collagen.

CONCLUDING REMARKS
Gram-positive pathogens utilize their interactions with the ECM for tissue colonization and to establish infections in the host. Molecular insight into these interactions can pave the way for the design of novel anti-infectives. However, studies of collagenbinding proteins in Gram-positive pathogens are in their infancy and do not provide a complete picture of the different binding mechanisms involved. Further structural studies are required to fully understand the molecular basis for the interaction between bacterial collagen-binding proteins and the triple helix of collagen. In particular, the interaction of emerging collagenbinding protein families with collagen needs to be further characterized using biochemical and microbiological techniques to determine which family members bind collagen. Mammalian proteins containing a collagen-like region play a role in host defense. Collagen-binding host proteins, e.g., the LRR proteoglycan decorin, bind soluble host defense collagens (Krumdieck et al., 1992). Interaction of bacterial collagen-binding proteins with soluble defense collagens can provide an opportunity for pathogens to evade the host immune response. CNA binds C1q, a complement protein and collagen (Kang et al., 2013). The classical complement pathway is initiated upon recognition of pathogen-bound antibodies by the C1 complex, which consists of C1q, C1r, and C1s. C1q protein contains the globular recognition domain and binds pathogenbound antibodies. C1r and C1s are proteases that are required for the complement cascade. C1r and C1s bind the collagen-like stalk of C1q (Mortensen et al., 2017). CNA uses its interaction with C1q for immune evasion by interfering with the interaction between C1r and C1q and thus deactivating the C1 complex (Kang et al., 2013).
Interaction of collagen-binding bacterial proteins with other host proteins containing collagen like-regions, especially soluble human defense collagens, is an understudied area. While acknowledging that not all collagen-binding proteins will bind soluble defense collagens and vice versa, future studies focusing on the interaction between bacterial collagen binding proteins and host defense collagens will lead to a better understanding of the pathogenic mechanisms utilized by Gram-positive bacteria.

AUTHOR CONTRIBUTIONS
SA and JG wrote the manuscript. SA and MH edited the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
This work was supported by NIH R01 grant AI020624-35.