Properties of Cavities in Biological Structures—A Survey of the Protein Data Bank

We performed a PDB-wide survey of proteins to assess their cavity content, using the SPACEBALL algorithm to calculate the cavity volumes. In addition, we determined the hydropathy character of the cavities. We demonstrate that the cavities of most proteins are hydrophilic, but smaller proteins tend to have cavities with hydrophobic walls. We propose criteria for distinguishing between cavities and pockets, and single out proteins with the largest cavities.


INTRODUCTION
Cavities appear in many biological structures (Andrews and Tata, 1971;Martin et al., 1991;Jin and Brennan, 2002;Hartl et al., 2011). Cavities are observed in single-domain proteins (Marion et al., 2007), in multimeric protein aggregates, in virus capsids, (Zandi et al., 2004;Zlotnick, 2005;Michel et al., 2006;Robbins, 2010, 2013;Roos et al., 2010), and in still larger complexes, such as the ribosomes. Biological cavities may enclose space completely, as in the majority of icosahedral virus capsids. Usually, however, the closure is not complete since there are openings or connections to the outside solvent. This situation is encountered, e.g., in the Pathogenesis-Related class 10 proteins (PR-10) (Fernandes et al., 2013). In the ribosome, the opening is a part of the peptide exit channel.
It has been estimated that about 1% of structured proteins are endowed with cavities (Williams et al., 1994). The cavities may or may not be occupied by solvent molecules (Hubbard et al., 1994;Williams et al., 1994), and it is not clear what factors are responsible for that. It is known that in the case of the PR-10 proteins, the cavities serve as reservoirs for small-molecule ligands, but in general, the cavities may play many different roles. For instance, the ribosomal exit channel supports the formation of secondary structures in the nascent proteins, while viral cavities encapsulate, and may help to pack the genomic nucleic acid. The presence of a cavity in thermophilic proteins influences their stability. The stability is also affected by the character of the hydropathy of the internal cavity walls. Hydropathically neutral cavities are expected to prevent reversible protein unfolding, whereas hydrophobically lined cavities destabilize folded structures (Xue et al., 2019).
To gain insights into the properties and roles of protein cavities, we conducted two surveys: (1) of 24,280 single-chain protein structures from the CATH database (Dawson et al., 2017;Lewis et al., 2018) and (2) of all 160,233 structures released by the Protein Data Bank (PDB) (Berman et al., 2000) on February 9, 2020, with 148,516 of them corresponding to proteins without any admixture of nucleic acids. In the former case, we calculated the volume of the cavities within each single chain deposited in the CATH database, but if there were several chains of the same protein, we considered only the case with the largest cavity. In the latter case, we considered all kinds of possible cavities: within the component subunits as well as cavities created within the complete oligomeric structure. We discuss the CATH-based survey separately because these structures are of good quality. The CATH proteins constitute FIGURE 1 | Explanation of the SPACEBALL algorithm used for the determination of the cavity position and volume in a two-dimensional cross-section of a protein with the PDB code 1u8e (green ball-and-stick model). The cavity is part of the white area, covered with the red squares (transparent or shaded). The blue circle represents a water probe. The protein is placed in a cuboid box that is divided into a grid (thin blue lines). The probe is placed at each grid point and we check if there is overlap between the probe and protein atoms. The grid points without any overlap are counted as belonging to the cavity (shaded red squares). We also count the transparent red squares which are encompassed by the probe sitting on the allowed grid points, even if the probe placed on the transparent square would overlap with the protein atoms. The blue squares indicate the lattice points where the probe was placed to define an exterior of the protein. The squares shaded in blue indicate the grid points of the lattice where the probe can be placed without any overlap with the protein atoms. The transparent blue squares indicate the grid points that are encompassed by the probe when it is placed on the allowed blue grid points. The transparent blue squares mark the outer surface of the protein.
a subset of the full PDB set. The results for all PDB proteins can be found at our website: http://www.ifpan.edu.pl/chwastyk/ spaceball. The objective of our studies is to gain an overview of the known protein structures from the point of view of their internal cavities.
Our survey is focused on identifying structures with the largest cavities, and on determining the hydropathy levels of the cavities. There are many hydropathy scales available (Palliser and Parry, 2001;Kapcha and Rossky, 2014). We have chosen the scale constructed by Kyte and Doolittle (1982) as it seems to be the most widely used. The specific hydropathy values that we derive for the cavities are expected to depend on the choice FIGURE 2 | (Top) Volume of the cavities V C as a function of the total volume V T of the corresponding proteins. The proteins with the most hydrophobic and the most hydrophilic cavities as well as some outlying structures are marked by their PDB codes. Ten most hydrophobic and ten most hydrophilic structures are marked in Frontiers in Molecular Biosciences | www.frontiersin.org FIGURE 2 | green and blue, respectively. Among the most hydrophobic structures there is one group of proteins at similar position in the scatterplot and they have been grouped together and marked by oval (hb1): 2jag (29.76 kDa), 2onk (303.31 kDa), 2nq2 (130.20 kDa), 2npj (44.66 kDa), 2b2h (42.09 kDa). Seven of the most hydrophilic structures and eight of the most hydrophobic structures with different folds are shown in the panels below. The protein structures are in green and the cavities are in red. Nine PR-10 proteins considered separately in this survey are also grouped together in one oval: 2bk0 (32.67 kDa), 2wql (66.84 kDa), 2flh (72.47 kDa), 1txc (34.29 kDa), 1tw0 (32.79 kDa), 1vjh (27.75 kDa), 1qmr (17.33 kDa), 1llt (17.39 kDa), 1xdf (33.85 kDa).
FIGURE 3 | Histogram of the values of the parameter s = S CP /S C obtained in this survey. The s parameter defines the degree of cavity closure. The mean(standard deviation) of this parameter iss(σ ) = 0.36(0.05). Pockets correspond to s <s − 3σ which means s < 0.21. The inset shows a schematic representation of S CP (blue) and S C (blue combined with black). The protein with the most buried cavity is chain A of the iron ( of the scale. However, the relations among the calculated values and the resulting trends are expected not to be very sensitive to the choice. The cavities of the PR-10 proteins (Fernandes et al., 2013) are usually hydrophobic. This means that they can accommodate hydrophobic ligands as they are excluded from the hydrophilic cytosol. It is not clear, however, whether these proteins are typical or unusual in this respect. Here, we show that the PR-10 proteins represent, in fact, a minority as most protein cavities found by us are hydrophilic. The PR-10 proteins have been well-studied before so they may serve as a benchmark in our studies.

MATERIALS AND METHODS
There are many programs and algorithms that allow one to detect, define, and calculate the geometrical parameters of cavities. We discuss them in Chwastyk et al. (2014). All of them have to address the problem of how to delineate a cavity from the external environment of the protein. The choice of the method affects the estimate of the volume of the cavity. In addition, one usually has to start with a visual identification of the location of a cavity. Thus, these methods are not fully objective. Several years ago, we proposed a more objective approach to the problem of cavity-volume determination (Chwastyk et al., 2014(Chwastyk et al., , 2016 by using an algorithm that we named SPACEBALL. We define the cavity as a region that is surrounded by atoms into which no water molecule can enter when moving along some straight line from the outside. This definition holds for a static structure but is also valid for any conformation that arises through thermal fluctuations. The size of the cavity depends on the conformation. Thus, thermal averages of the cavity volume can be obtained by considering sets of conformations that correspond to a given temperature. In order to detect a cavity in a protein structure, we place the structure in a cuboid box with a regular grid of lattice points, as shown schematically in Figure 1. The default lattice constant is set at a = 0.2 Å. Each of the six walls of the box is the source of "rain" of water molecules. The rain is modeled by a network of beads of radius r w = 1.42 Å corresponding to the water molecules. When a water molecule moves in a given direction it marks the grid points it has visited. It stops when the sphere with the radius r w overlaps with any of the spheres associated with the atoms of the considered biomolecular structure. The radii of the atomic spheres are taken as the van der Waals radii compiled in the classic book by Pauling (1960). All of the unmarked points define the interior of the structure. In the next step, we put the water-molecule probe on the remaining (not visited) grid points, and check whether the probe does not overlap with the molecular structure. If it does, we count such points as belonging to the structure. The total number of such points, when multiplied by a 3 , determines the total volume of the structure, V T . If there is no overlap, such a grid point is counted as belonging to the cavity. The total number of these points multiplied by a 3 , determines the total volume of the cavity, V C . If the interior of the structure FIGURE 4 | The representation of three main radii R 1 , R 2 , R 3 (marked by the black arrows) associated with the eigenvalues of the tensor of inertia of 1kmp (90.28 kDa) protein cavity shell. The blue and green dots represent the hydrophilic and hydrophobic amino acids, respectively. The yellow arrow represents the hydropathy vector, h. The gray lines indicate the calculation box and offers a perspective. is divided into separate chambers, then the volume of the largest chamber is taken as representing the cavity volume of the structure.
The accuracy of the results depends on the selection of the lattice constant. Theoretically, the smaller the value of a, the more accurate the results but also the lower the efficiency of the calculations. In our previous studies, Chwastyk et al. (2014Chwastyk et al. ( , 2016 we chose a = 0.2 Å. Nevertheless, we found that such a small value of the lattice constant is not optimal in the context of a large-scale survey. Instead, in the present work, we use a = 0.6 Å. This value is still smaller than the probe radius r w = 1.42, so the final result is still correct though somewhat less precise. Previously, we have also showed (Chwastyk et al., 2014(Chwastyk et al., , 2016 that to obtain accurate result it is necessary to average the results over a number of rotations of the macromolecule within the box. Instead of the 25 rotations recommended before, we now implement five random rotations for each structure. We found that this approximation is sufficient for the purpose of the present surveys. Amino acids that are considered as belonging to the cavity shell were selected by calculating the distances between the grid points that define the cavity surface and the surrounding amino acids. The area of the cavity surface was calculated by using the SPACEBALL algorithm but this time for the pseudo-structure created by water molecules placed on the grid points defining the cavity. This allowed us to select the points on the cavity surface. The amino acid in the smallest distance in a given direction was considered as a part of the cavity shell. This procedure used the Python MDAnalysis package (Michaud et al., 2011;Gowers et al., 2016). The grid points without any protein atoms along the line connecting the cavity with the outside of the protein are considered as entrances to the chamber.
All of the results presented in this manuscript were obtained in the CATH-based survey. Only single protein chains were considered. Non-protein parts were removed.

Geometrical Properties of Cavities
For each of the analyzed proteins, we determined the position of the largest cavity, its volume V C , and we identified the residues that form the cavity shell. Moreover, we calculated the total volume V T of the whole protein.  1 | PDB ID, cavity volume V C , surface of sites that are in the immediate contact with the protein S CP , s = S CP /S C , where S C is the total surface of the cavity, parameter w, radius of gyration R g and hydrophobicity H for 50 structures with the largest hydrophilic and hydrophobic cavities.

Hydrophilic
Hydrophobic   Usually, the boundary between a cavity and the external environment is not marked by protein atoms, but it is defined by the protein shape. This means that the cavity is open to at least some extent. To distinguish such cases from proper, fully enclosed cavities, we will refer to such formations as pockets. There is no rigorous definition of an internal molecular pocket in the literature. Based on our experience, we propose the following distinction between a pocket and a true cavity. We calculate the fraction s = S CP /S C where S C is the total surface of the cavity, and S CP is the surface of sites that are in the immediate contact with the protein. This is illustrated in the inset of Figure 3: S CP is indicated in blue, and S C as a combination of blue and black. The black line corresponds to the closing cup of the cavity. The protein is shown in green and the cavity in red. For a Gaussian approximation of the distribution of the s values calculated for all structures considered in our survey and presented in Figure 3, we obtained the mean value ofs = 0.36 and standard deviation of σ = 0.05. We define pocket as corresponding to the situation where s <s − 3σ , i.e., when s < 0.21. This criterion means that most of the cavity is exposed to the solvent. The results presented in Figure 3 change our view of cavities in real proteins.
In the literature (Benkaidali et al., 2013), cavities in proteins are defined as a space buried inside the protein, and connected to the outside environment by channels. Some cavities, however, arise very close to the outside protein surface, and are very wellconnected with the outside environment. This can be captured by introducing the parameter s, described above that is equal to 1 for a closed sphere, and is much smaller for fairly open cavities. We find that there are only thirteen proteins with cavities with s > 0.85 and similar folds. They correspond to the points shown at the top of Figure 2. The largest of them had s = 0.94, and corresponded to chain A of the iron(III) dicitrate transport protein Feca (PDB: 1kmp). Its structure corresponds to the top leftmost panel of the structures shown in the figure.
To describe the shape of a cavity, we introduce two parameters: R g and w . Here, where N C is the number of cavity-surface residues, and r k is their position vector with respect to the center of mass of these residues, i.e., protein's amino acids which are in contact with the cavity. The parameter w that characterizes the nature of the shape of the cavity depends on all three main radii, R α , associated with the eigenvalues of the tensor of inertia (Foote and Raman, 2000) D α characterizing the cavity wall: R α = √ D α /N C , as represented in Figure 4. R 1 is the smallest radius and R 3 -the largest. The parameter w is defined as where R = 1 2 (R 1 + R 3 ) and R = R 2 − R. Spherical shapes correspond to w being close to 0. The tensor of inertia is calculated using all atomic masses of residues belonging to the surface of the cavity. Elongated cigar-like shapes yield substantial positive values of w because then R 2 is close to R 3 and w ∼ 1 2 (R 2 − R 1 ). Substantial negative values of w indicate planar shapes as then R 2 ∼ R 1 and w ∼ 1 2 (R 1 − R 3 ). The values of the geometrical parameters calculated for 50 structures with neutral, and the most hydrophobic or hydrophilic cavities are presented in Tables 1, 2.
The results for all of the calculated structures can be found on our website at http://info.ifpan.edu.pl/chwastyk/spaceball.

Chemical Properties of Cavities
In the first approach we calculated the degree of cavity hydrophobicity, H, in analogy of how it is done for the whole proteins  except that now we consider only the residues that are on the surface (forming the wall) of the cavity. Specifically, where q i is the hydropathy index of residue i, and N CP is the total number of residues that create the shell. We used q i values as determined by Kyte and Doolittle (1982). Moreover, we define the hydropathy vector, h, of a cavity shell similarly to Cieplak et al. (2014) but again by taking only the shell residues into account: where δ i is a position vector with respect to the center of mass of the cavity shell. The hydropathy vector calculated for 1kmp (90.28 kDa) protein cavity shell is presented by the yellow arrow in Figure 4. Figure 5 presents the results using a color code for cavity hydrophobicity. The scatterplots present the hydrophobicity of cavities of all structures but considering the thickness of the cavity shell defined as V C /V T . We see that the proteins with the most hydrophilic cavities are those with the biggest cavities which constitute their total interiors. Scatterplots that present explicitly the value of the cavity hydrophobicity in function of the cavity volume or the volume of the whole protein are presented in Figure 6. We see in the scatterplot at the bottom that there are big proteins with neutral cavities [for example: 2p8n ( Figure 7 shows the homogeneity of the hydrophobicity of the considered cavities. The scatterplot shows the absolute value of the hydropathy vector | h| as a function of cavity hydrophobicity. As expected, the biggest values of the hydropathy vectors (indicating large hydropathy gradients across the cavity) are found mostly for proteins with hydrophilic cavities. This suggests that the strongly hydrophilic cavities are important for signal transduction (Harley et al., 1998).

DISCUSSION
We start our analysis of the hydrophobicity of cavities in the examined proteins by considering 10 most hydrophobic and 10 most hydrophilic cavities. The selected proteins are listed in Tables 3, 4, respectively.
When considering the biochemical functions of the proteins with the most hydrophobic cavities, we can see that most of them are responsible for selective and non-covalent interaction between identical proteins (identical protein binding), with any proteins or complexes, even containing non-protein molecules (protein binding), with chloride ions (Cl − ) (chloride ion binding), with any metal ion (metal ion binding), or with anions, charged atoms or groups of atoms with negative net charge (anion binding). The exceptions from this observations are the pre-protein translocase secY subunit from M. jannaschii and ammonium transporter from A. fulgidus which enable protein transfer across cell membrane (protein transmembrane transporter activity) without specific binding function. Proteins from this group are generally responsible for transport phenomena. For example, the ammonia channel protein from E. coli catalyze the transport of single molecular species across the membrane (uniporter activity). Such transport is independent of the movement of any other molecular species. Some proteins enable active transport of a solute across the membrane by a mechanism whereby two or more species are transported together in the opposite directions in a tightly coupled process. Such process does not have to be directly linked to a source of energy other than chemiosmotic energy (antiporter activity). A similar process where molecular species are transported in the same direction (symporter activity) is also enabled by one of the considered proteinsproton glutamate symport protein from P. horikoshii. The proteins considered here enable also the cross-membrane transfer of ammonium (ammonium transmembrane transporter activity), glutamate (glutamate: sodium symporter activity), Laspartate-anion from aspartic acid (L-aspartate transmembrane transporter activity), other amino acids (amino acid: sodium symporter activity), chloride ions (chloride transmembrane transporter activity), and the transmembrane transfer of a   Gmelin et al., 2007 The first column shows the protein name and its source organism (italicized); the second column lists the PDB codes of particular proteins, the resolution of the structure determination (in parentheses) and the total structure weight; the third column lists the degree of cavity hydrophobicity; the fourth, fifth, and sixth columns report main biochemical function, main biological function and the protein localization in the cell, respectively. The last column lists the references for the presented data.
Frontiers in Molecular Biosciences | www.frontiersin.org   Pickersgill et al., 1998 The first column shows the protein name and its source organism (italicized). chloride ion by a voltage-gated channel (voltage-gated chloride channel activity). From the biological point of view, the selected proteins with the most hydrophobic cavities are responsible for transport of various structures, such as ions (nitrate, chloride, ammonium, etc.), carbon dioxide, inorganic anions or even amino acids or proteins from, to or between cells across the membrane. Ammonia channels protein is also responsible for processes  Table 3 and are marked in green in Figures 2, 5-7. The most hydrophilic proteins (negative tail of the histogram) are listed in Table 4 and are marked in blue in Figures 2, 5-7. The histogram shows that most of the cavities in proteins deposited in the PDB are hydrophilic.

Frontiers in Molecular
that form an integrated mechanism by which a cell detects the depletion of primary nitrogen source, usually ammonia, and then activates genes to scavenge the last traces of the primary nitrogen source and to transport and metabolize alternative nitrogen sources. The proteins from this group are embedded within the phospholipid bilayer. In summary, proteins with the most hydrophobic cavities are usually responsible for binding and molecular transport processes.
By inspecting the 10 proteins with the most hydrophilic cavities listed in Table 4, we infer that five of them are membrane components. The iron(III) dicitrate transport protein Feca from E. coli and ferripyoverdine receptor from P. aeruginosa are signaling proteins. Moreover, from the biological point of view they are responsible for iron ion or siderophore transport. This means that they keep the iron ion homeostasis constant. Similarly, the protein of vitamin B12 transporter BtuB E. coli is responsible for the ion and vitamin transmembrane transport. Next, we have also six proteins that are responsible for catalytic processes. Two of them, protein of periplasmic trehalase from E. coli and protein of sialidase A from Streptococcus, are also membrane proteins. From the biological point of view, they participate in catabolic and metabolic processes. The next two proteins, phenol hydroxylase component from P. stutzeri and O. aries (sheep) lactoperoxidase are assigned to the extracellular region. They are necessary for oxidationreduction processes. Moreover, the sheep lactoperoxidase protein plays a role in metal ion binding. The last two of the catalytic proteins, cellobiohydrolase from C. thermocellum and pectate lyase protein from B. subtilis are responsible for metal ion binding, but their most important functions are the cellulase and pectate lyase activities, thus they are responsible for catabolic processes. The last protein from this group is a protein from adeno-associated virus. This protein is different from the proteins described above, as a component of viral capsid, but it is still related to a membrane-like behavior because it is responsible for permeabilization of host organelle membrane, and then it is involved in the viral entry into host cell.
Our results obtained with a smaller accuracy are comparable to the precise results of cavity volume calculations in case of the PR-10 proteins presented in our previous work (Chwastyk et al., 2016). Our selection of the proteins with cavities, and pockets is different than study (Gao and Skolnick, 2013) of structures deposited in the PDB that is based on protein-ligand binding and structural comparison methods. We provide a new definition of a pocket which is more precise in comparison to just a "ligand binding site" (Gao and Skolnick, 2013). Moreover, we add informations about chemical properties of the pockets considered in that paper.
We emphasize that the results have been obtained from the analysis of single chains of various CATH proteins. We should point out that cavities often appear not only within single protein subunits but also within complete quaternary structures. One such example is cross-linked human hemoglobin (HbA) presented in Figure 8. The full quaternary structure with a central cavity measuring 7.634 ± 0.129 nm 3 is composed of four chains each containing smaller cavities that are an order of magnitude smaller. A similar situation can be observed in more complex structures, like the capsid of the turnip yellow mosaic virus (TYMV) which is formed from of three different protein subunits. None of them contains any cavity. The volume of the cavity within the virus capsid, however, is 6731.10 ± 99.12 nm 3 .

CONCLUSIONS
We conducted a survey of 24,280 protein structures from the CATH database. For each of the considered structures we calculated the net hydropathy index. The results are presented as a histogram in Figure 9. The most surprising result is that, unlike in the PR-10 proteins, most of the cavities are hydrophilic. Moreover, the largest cavities are also hydrophilic. On the other hand, the smallest cavities (in small proteins) are highly hydrophobic.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.