Do Bacterial “Virulence Factors” Always Increase Virulence? A Meta-Analysis of Pyoverdine Production in Pseudomonas aeruginosa As a Test Case

Bacterial traits that contribute to disease are termed “virulence factors” and there is much interest in therapeutic approaches that disrupt such traits. What remains less clear is whether a virulence factor identified as such in a particular context is also important in infections involving different host and pathogen types. Here, we address this question using a meta-analytic approach. We statistically analyzed the infection outcomes of 81 experiments associated with one well-studied virulence factor—pyoverdine, an iron-scavenging compound secreted by the opportunistic pathogen Pseudomonas aeruginosa. We found that this factor is consistently involved with virulence across different infection contexts. However, the magnitude of the effect of pyoverdine on virulence varied considerably. Moreover, its effect on virulence was relatively minor in many cases, suggesting that pyoverdine is not indispensable in infections. Our works supports theoretical models from ecology predicting that disease severity is multifactorial and context dependent, a fact that might complicate our efforts to identify the most important virulence factors. More generally, our study highlights how comparative approaches can be used to quantify the magnitude and general importance of virulence factors, key knowledge informing future anti-virulence treatment strategies.


INTRODUCTION
Understanding which bacterial characteristics contribute most to disease is a major area of research in microbiology and infection biology (Rahme et al., 1995;Jimenez et al., 2012;Lebeaux et al., 2014). Bacterial characteristics that reduce host health and/or survival are considered "virulence factors." Such factors include structural features like flagella and pili that facilitate attachment to host cells (Josenhans and Suerbaum, 2002;Kazmierczak et al., 2015), as well as secreted products like toxins and enzymes that degrade host tissue (Vasil et al., 1986;Lebrun et al., 2009), or siderophores that scavenge iron from the host (Miethke and Marahiel, 2007). Research on virulence factors has not only increased our fundamental understanding of the mechanisms underlying virulence, but has also identified potential novel targets for antibacterial therapy. There is indeed much current interest in developing "anti-virulence" drugs to disrupt virulence factor production-the idea being that by simply disarming pathogens rather than killing them outright, we could ostensibly elicit weaker selection for drug resistance (Clatworthy et al., 2007;Pepper, 2008;Rasko and Sperandio, 2010).
Although our understanding of different types of virulence factors and their interactions is continuously deepening, it is still unclear just how generalizable this assembled knowledge is. It is often assumed, for reasons of parsimony, that a given structure or secreted molecule central to the virulence of a particular bacterial strain in a specific host context will similarly enhance virulence in another bacterial strain, or in a different host (Dubern et al., 2015). Yet, ecological theory predicts that the effects of a given trait will frequently vary in response to the environment (Lambrechts et al., 2006). In the context of infections, this may be particularly true for opportunistic pathogens, which face very heterogeneous environments: they can live in environmental reservoirs (e.g., soil, household surfaces), as commensals of healthy hosts, or, when circumstances allow, as pathogens, causing serious infections in a range of different hosts and host tissues (Kurz et al., 2003;He et al., 2004;Calderone and Fonzi, 2001). Opportunistic pathogens underlie many hospital-acquired infections, especially in immune-compromised patients (Gaynes and Edwards, 2005;Obritsch et al., 2005;Länger and Kreipe, 2011), and the treatment of such infections is often challenging because, as generalists, such pathogens are pre-selected to be tenacious and highly adaptable. Thus, in designing new antivirulence drugs against opportunistic pathogens, we need to know not only whether the targeted trait is indeed associated with pathogenicity, but also the generality of this association across different pathogen strain backgrounds, host species, and infection types.
To address this issue, we introduce a meta-analysis approach, which allows us to quantify variation and overall effects of virulence factors across host environments. As a test case, we focus on pyoverdine, a siderophore secreted by the opportunistic pathogen Pseudomonas aeruginosa to scavenge iron from the host environment (Visca et al., 2007). Table 1 provides an overview of the workflow of our meta-analysis, where we combined the outcomes of 81 individual virulence experiments from 24 studies (Meyer et al., 1996;Takase et al., 2000;Xiong et al., 2000;Gallagher and Manoil, 2001;Ochsner et al., 2002;Silo-Suh et al., 2002;Salunkhe et al., 2005;Harrison et al., 2006;Attila et al., 2008;Papaioannou et al., 2009;Zaborin et al., 2009;Carter et al., 2010;Nadal Jimenez et al., 2010;Oliver, 2011;Romanowski et al., 2011;Feinbaum et al., 2012;Okuda et al., 2012;Imperi et al., 2013;Kirienko et al., 2013;Ross-Gillespie et al., 2014;Dubern et al., 2015;Lin et al., 2015;Lopez-Medina et al., 2015;Minandri et al., 2016, see also Tables S1, S2 in the Supplemental Material). Using a weighted meta-analysis approach, we were able to investigate the evidence for pyoverdine's contribution to virulence across eight host species, including vertebrates, invertebrates and plants, five tissue infection models and various P. aeruginosa genotypes. We chose pyoverdine production as the model trait for our analysis because: (i) it has been extensively studied across a range of Pseudomonas strains (Meyer et al., 1997); (ii) its virulence effects have been examined in a large number of host species; (iii) P. aeruginosa is one of the most troublesome opportunistic human pathogens, responsible for many multi-drug resistant nosocomial infections (Hauser and Rello, 2003;Hirsch and Tam, 2011); and (iv) multiple anti-virulence drugs have been proposed to target pyoverdine production and uptake (Kaneko et al., 2007;Imperi et al., 2013;Ross-Gillespie et al., 2014). The applied question here, then, is whether targeting pyoverdine could generally and effectively curb pathogenicity.

Literature Search
We conducted an extensive literature search, using a combination of two online databases: Web of Science and Google Scholar. The following terms were used to search abstracts and full texts: "aeruginosa" in combination with "pyoverdin" or "pyoverdine" and in combination with "virulence" or "infection" or "pathogen" or "disease" or "mortality" or "lethality." This search was first performed on May 19th 2014 and it was repeated periodically until Aug 1st 2016 in order to include more recent publications. In addition, the reference lists of all shortlisted studies were scanned for relevant publications. We further contacted the corresponding authors of several publications to ask for unpublished datasets. Ultimately, no unpublished datasets were included in the final meta-analysis (see Supplementary Tables  S1+S3 for details).

Inclusion Criteria
The database search yielded a total of 529 studies, and we identified 10 additional records through other sources. These 539 studies were then scanned for relevant content according to the following set of inclusion criteria. Studies were considered potentially eligible for inclusion if they contained original research, were written in English and provided data that compared the virulence of a wildtype pyoverdine-producing P. aeruginosa strain with that of a mutant strain demonstrating impaired pyoverdine production. We defined virulence as a decrease in host fitness, measured as an increase in mortality or tissue damage when infected with bacteria. We defined "wildtype strains" as strains that were originally clinical isolates, have been widely used in laboratories as virulent reference strains and have not been genetically modified. Strains with impaired pyoverdine production included strains that were completely deficient in pyoverdine production and strains that were only partially deficient, i.e., that produced less than the wildtype strain under identical experimental conditions. We considered both genetically engineered knock-out strains and clinical isolates with reduced pyoverdine production. Since pyoverdine is not directly encoded in the genome but is synthesized via non-ribosomal peptide synthesis (Visca et al., 2007), there are no studies that modified the functioning of pyoverdine through targeted mutagenesis. There were 32 original publications containing 120 experiments that satisfied our criteria and were thus considered appropriate for in-depth examination.
We screened these 120 experiments using a second set of rules to identify those experiments that contain comparable quantitative data, which is essential for a meta-analysis. Inclusion criteria were: (i) virulence was measured directly (and not  PREDICTION: Pyoverdine-defective mutants cause less virulence than wildtype strains.

Systematically search for relevant studies
We searched (see details in main text) for any reports of experiments featuring monoclonal infections of whole live host organisms with P. aeruginosa strains known to vary in pyoverdine phenotype, where virulence was quantified in terms of host mortality.

Extract and standardize effect sizes and their standard errors
For each case reporting host survival, we calculated the (Iog) ratio of mortality odds from pyoverdine-mutant infections vs. wildtype infections-i.e., the (log) odds-ratio.
4a. Check heterogeneity across studies Our assembled effect sizes were more heterogeneous than expected from chance-even when we allowed that some of this variation could be due to random noise.
4b. Consider putative moderator variables (optional) We tested for evidence of distinct sub-groups in our dataset, within which the effects might be more homogeneous. We identified four putative moderators and codified each study for the following: (i) host taxon; (ii) infection type; (iii) strain background; and (iv) level of pleiotropy expected, given the particular mutation(s) involved.
4c. Check for publication bias (optional) We found that smaller/lower-powered studies were more likely to report large effect sizes in support of the hypothesis, whereas larger/higher-powered studies tended to report smaller effect sizes.

Derive mean effect size(s); quantify influence of moderator variables (if applicable)
Despite the steps taken (see above), our dataset still showed substantial heterogeneity. Estimates of mean effect sizes (in/across subgroups) and moderator coefficients should therefore be viewed as best approximations.
inferred indirectly via genetic analysis); (ii) virulence was measured in vivo (and not in vitro via virulence factor production); (iii) virulence was measured quantitatively as direct damage to the host caused by bacterial infections, and not by indirect or qualitative measures such as bacterial growth performance in the host, threshold infective dose required to kill a host, the damage associated with virulence factor administration, or resistance to macrophage-like predation (Ryan et al., 2009;Jeukens et al., 2014;Kirienko et al., 2015); and (iv) absolute virulence data were presented (and not only data scaled relative to the wildtype without information on the absolute risk of mortality, since effect sizes cannot be calculated from such data). This second set of rules was fulfilled by 24 original publications containing 81 individual experiments (see Tables S1, S2 in the Supplemental Material). For an overview of the whole selection process, see Figure S1 in the Supplemental Material.

Data Extraction and Effect Size Calculations
From all of these 81 experiments, we extracted information on: (i) the host organism; (ii) the type of infection; (iii) the observation period of infected hosts; (iv) the identity of the control (wildtype) strain; (v) the identity of the pyoverdine-deficient strain; (vi) the mutated gene in the pyoverdine-defective strain; (vii) the mutation type (e.g., insertion/deletion); (viii) the sample size used for the wildtype and mutant experiments; and (ix) the relevant virulence measure (host survival or tissue damage) for wildtype and mutant strains (see Table S1 in the Supplemental Material).
This information was used to categorize the experiments and identify potentially important moderator variables (see below). Next, we extracted quantitative data from these experiments so we could calculate effect sizes for the virulence associated with pyoverdine production. For mortality assays, we extracted raw counts of how many individuals (i.e., whole animals or seedlings) died and how many survived following infection with a dose of P. aeruginosa wildtype or, alternatively, a mutant strain known to be deficient for pyoverdine production. For experiments on tissue damage, we extracted information on the number of individuals with and without the symptoms related to tissue damage (e.g., a lesion in an organ). In cases with zero counts (i.e., either all or none of the individuals in a particular treatment group died or experienced tissue damage), we converted counts to 0.5 to avoid having zero denominators in the subsequent calculations of the (log-odds ratio) effect sizes (Cox, 1970). In cases where data from multiple time-points or survival curves were available, we concentrated on the time point with the largest difference between the wildtype and the mutant infection.
Using this count data, we calculated the effect size for each experiment as the log-odds-ratio = ln ((m virulent /m non−virulent ) /(w virulent /w non−virulent )), where m virulent and w virulent are the number of individuals that died or experienced tissue damage when infected by the mutant and the wildtype strain, respectively, and m non−virulent and w non−virulent are the number of individuals that survived or remained unharmed by the infection. Information on the sample size was used to calculate the 95% confidence interval for each effect size and for weighting effect sizes relative to one another (see details below). Where experiments reported a range of sample sizes, we used the arithmetic mean. Some studies reported only a minimum sample size. In those cases, we used this number. For experiments using C. elegans, infections were often carried out on replicate petri dishes in a large number of individuals. In these cases, we used the total number of individual worms used in each treatment group as sample size, and not the number of replica plates.

Additional Analysis: Effect of Pyoverdine on Growth in Mammalian Hosts
Twelve experiments conducted in mammals that were excluded from our main meta-analysis contained data on in vivo growth of wildtype and pyoverdine deficient strains. Ten of these came from two publications that also supplied virulence measures meeting our inclusion criteria, while the remaining two came from studies that did not supply any virulence measures (see Table S4). While growth is at best an indirect proxy for virulence, the role of pyoverdine in facilitating in-host growth is an interesting question to ask when considering the in vivo role of this molecule. Therefore, we conducted a second, much smaller meta-analysis of these 12 datasets to investigate whether pyoverdine production affects P. aeruginosa growth in vivo. We calculated the standardized mean differences in growth (bacterial cell counts per gram of tissue or ml of blood) between wild-type and mutant infections for this set of studies (Table S4, Figure  S2). This secondary dataset shows a similar pattern to the main meta-analysis shown in Figure 1.

Basic Analytical Approach
Analyses were performed in R version 3.3.1 (R Development Core Team, 2016), using functions from packages "meta" (Schwarzer, 2016) and "metaphor" (Viechtbauer, 2010). We used the "metabin" function to transform the count data into the (log-) odds ratio described above. We then weighted these values by the inverse of their respective squared standard errors, and pooled them to obtain a single distribution of effect sizes. We reasoned that the variability of the effect sizes in our dataset probably reflects more than simple sampling error around a single true mean. Rather, we assume that our effect sizes represent a random sample from a larger distribution comprising all possible true effect size estimates. As such, we inferred that a random effects meta-analysis would be more appropriate for our dataset than a fixed effects model (for further discussion, see Borenstein et al., 2009Borenstein et al., , 2010. In a random effects meta-analysis, we partition the total heterogeneity observed in our dataset (described by the statistic Q) into two constituent parts-withinexperiment variation (ε) and between-experiment variation (ζ). The latter component, scaled appropriately to account for the weightings intrinsic to meta-analysis, is quantified as the τ 2 statistic. There are several different algorithms one can use to effect this partitioning of variance. We chose a restricted maximum likelihood (REML) approach. The use of a random model, rather than a simpler fixed model, affects the weights accorded to each constituent effect size, which in turn changes our estimates for pooled means and their associated errors. We further slightly broadened confidence intervals and weakened test statistics using Knapp and Hartung's algorithm (Knapp and Hartung, 2003)-a widely-used and conservative adjustment designed to account for the inherent uncertainty associated with the partitioning of heterogeneity we perform in the course of fitting a random effects model.
We assessed the degree of residual heterogeneity in our dataset using statistics I 2 and H. I 2 estimates the approximate percentage of total variability across experiments that is attributable to unexplained heterogeneity, as opposed to simple sampling error (chance). It is calculated as 100% × (Q-df )/Q, where Q is Cochrane's heterogeneity statistic and df its associated degrees of freedom. H is directly related to I 2 (see equation in Higgins and Thompson, 2002) and reports "excess" heterogeneity as a fold difference compared to the baseline amount of variability we would have expected if the sample were homogenous (Higgins and Thompson, 2002).
Both metrics described above indicated considerable residual heterogeneity in our dataset, so we inferred that, beyond the random-and sampling error, some measurable characteristics of the experiments in our dataset could be contributing, in predictable ways, to the observed heterogeneity of our assembled effect sizes. We therefore extended our basic analysis to take into consideration four potential moderator variables, which we describe below.

Stratification by Moderator Variables
We selected and defined moderator variables on the basis of (a) our a priori expectation that they may be important, and (b) the availability of data. The four we investigated were host taxon, infection type, the wildtype strain background, and the expected level of pleiotropy associated with mutations involved. In cases where information was missing for a specific moderator variable, we contacted the authors to obtain additional information. For each moderator, we defined the following relevant subgroups.

Host Organism
Because immune responses and the chemistry of infected tissue/fluid will vary between host taxa, host taxon could reasonably be expected to influence both wild-type levels of virulence and the contribution of siderophores to growth and/or virulence. We first split experiments into broad taxonomic units (mammals, invertebrates, plants), and then classified hosts by genus.

Infection Type
Similarly, different organs or tissue types, even within the same host, could provide different environments and nutrient regimes for bacteria. We classified experiments according to the organ or body region targeted by the infection. Major categories include infections of the host organisms' respiratory tract, digestive system, skin (including burn wounds), and infections that generated a non-localized infection of the body cavity (systemic infection). Experiments that did not fit in any of these categories, such as infections of whole seedlings, were classified as "other infection types."

Wildtype Strain Background
Four different P. aeruginosa wildtypes (PAO1, PA14, FRD1, and PAO6049) were used for infection experiments. These are known Frontiers in Microbiology | www.frontiersin.org FIGURE 1 | Forest plots depicting the variation in effect size across experiments on pyoverdine as a virulence factor in P. aeruginosa. All panels display the same effect sizes originating from the 81 experiments involved in the meta-analysis, but grouped differently according to four moderator variables, which are: (A) host taxon; (B) infection type; (C) wildtype strain background; and (D) the likelihood of pleiotropy in the pyoverdine-deficient strain. Effect sizes are given as log-odds-ratio ± 95% confidence interval. Negative and positive effect sizes indicate lower and higher virulence of the pyoverdine-deficient mutant relative to the wildtype, respectively. The scale of the circular markers is proportional to each experiment's relative weighting in the context of a (fixed) meta-analysis. As this weighting is the inverse variance associated with an observed effect, less "noisy" experiments are accorded larger weights. Diamonds represent the mean effect sizes (obtained from meta-regression analysis) for each subgroup of a specific moderator variable. IDs of the individual experiments are listed on the Y-axis (for details, see Table S1 in the Supplemental Material). The numbers in brackets on the Y-axis correspond to the citation number of the corresponding publication.
Frontiers in Microbiology | www.frontiersin.org to differ from one another genetically and phenotypically: for instance, PA14 is generally found to be more virulent than PAO1 across a range of lab models. As background genotype and virulence levels might affect the impact of siderophore mutation on virulence, we included wild-type strain in our analysis. Although it is well established that even standard strains such as PAO1 can substantially differ between labs, there was not enough information available to take such strain-level variation into account.

Likelihood of Pleiotropy
The focal phenotype investigated in this meta-analysis is the production of pyoverdine, the main siderophore of P. aeruginosa. Mutants exhibiting reduced or no pyoverdine production can be generated either by deleting a specific pyoverdine-synthesis gene, or through untargeted mutagenesis (e.g., UV light, Hohnadel et al., 1986). The latter mutants are likely to have mutations in other genes unrelated to pyoverdine synthesis. These mutations are typically unknown but could also affect virulence. In principle, even single gene deletions can have pleiotropic effects on the phenotype, via disruption of interactions with other genes. Depending on the locus in question, certain genetic modifications are more likely to have pleiotropy-and so affect the expression of other virulence factors-than others. To account for these complications, we inferred on a case-by-case basis whether the mutation used was likely to only induce a change in (or loss of) pyoverdine production (i.e., pleiotropy less likely) or was likely to induce a change in other phenotypes as well (i.e., pleiotropy more likely). In the biosynthesis of pyoverdine, multiple enzymes are involved in non-ribosomal peptide synthesis (Visca et al., 2007). Two gene clusters, the pvc operon and the pvd locus, encode proteins involved in the synthesis of the chromophore and peptide moieties, respectively (Stintzi et al., 1999;Visca et al., 2007). In most of these genes, a mutation or deletion leads to a complete loss of pyoverdine production, and most likely does not affect any other trait. Accordingly, we assigned mutants carrying mutations in these genes to the category "pleiotropy less likely." An exception is pvdQ, a gene coding for a periplasmic hydrolase, which is required for pyoverdine production, but is also involved in the degradation of N-acyl-homoserine lactone quorum-sensing molecules (Nadal Jimenez et al., 2010). Strains with deletions in this gene were therefore assigned to the category "pleiotropy more likely." Other strains falling into this category included: (i) mutants where the key regulator of pyoverdine synthesis, PvdS, was deleted, leading to deficiencies in toxin and protease production, in addition to a complete loss of pyoverdine production (Beare et al., 2003); (ii) strains that carry a deletion in a central metabolic gene and only coincidentally show no (or strongly reduced) pyoverdine production; (iii) double mutants that carry deletions in both the pyoverdine and the pyochelin synthesis pathway (pyochelin is the secondary siderophore of P. aeruginosa, Braud et al., 2009); and (iv) pyoverdine mutants created via non-targeted (e.g., UV) mutagenesis.
Using the stratification described above, we focused on estimating mean effect sizes and their associated confidence intervals within all subgroups represented in our full dataset (i.e., the diamonds in Figure 1). To this end, we fitted a series of four univariate models, one for each moderator. Each model was a random-effects meta-analysis (as above) but we constrained the level of between-experiment heterogeneity (τ 2 ) to be common across all subgroups (i.e., each factor level of the moderator under examination).

Comparison of Moderator Variables' Relative Influence
To better understand the relative importance of the four moderator variables described above, we focused on a reduced "core" dataset, for which we excluded experiments belonging to rare or poorly characterized subgroups to generate a smaller but more homogenous dataset. We excluded experiments involving plants and/or undefined wildtype strains (n = 6), experiments reporting tissue damage as a measure of virulence (n = 12), and experiments where the hosts were likely not colonized by bacteria but died from exposure to bacterial toxins (n = 8). This resulted in a core dataset comprising 55 experiments.
Using this dataset, we then fitted meta-regression models that simultaneously considered the contributions of multiple moderator factors. Our models were able to estimate moderators' additive effects only, because even in this "core" dataset, with its comparatively better data coverage across the remaining subgroups, the distribution of data across different combinations of factor levels was still too patchy to permit a proper investigation of moderators' interactive effects. Moderators' alterations of the expected (i.e., baseline) effect size could be quantified as coefficients, which could, when standardized as t-statistics, be tested for significant differences from zero. In addition, we could test whether, collectively, the inclusion of moderators in our meta-analysis model significantly reduced the residual heterogeneity relative to a situation with no moderators.
To estimate what share of the residual heterogeneity in our dataset could be individually attributable to each of the respective moderators, we performed a series of likelihood ratio tests comparing, in each case, a full model including all four moderators, against a reduced model that excluded one of the moderators. Variance component estimation in these models used maximum likelihood instead of REML because nested REML models cannot be compared in this way. From each pairwise comparison, we obtained a pseudo-R 2 value, which reflects the difference in τ 2 (between-experiment heterogeneity) between the two models, scaled by the τ 2 of the simpler model.

Potential Within-Study Bias
The most likely source of potential directional bias within studies is the genotype of the siderophore mutant used. One unpublished dataset was excluded partly because the siderophore mutant used carried a mutation in a locus that is not essential for siderophore production (Table S3), and so experiments using this mutant may be biased toward finding no effect of mutation on virulence. Sixteen included datasets used pvdS mutants; because pvdS positively regulates the expression of other virulence factors as well as siderophores, experiments using this mutant may be biased toward finding an effect of mutation on virulence. Similarly, three included datasets used a UV-generated mutant, PAO6609 (PAO9) which carries mutations in several loci affecting growth or virulence (F. Harrison, A. McNally, A. Da Silva & S. P. Diggle, unpublished data). Because we included likelihood of pleiotropy as a moderator variable in our analysis, we should be able to partition out between-study variance caused by the use of mutants that risk a bias toward positive results.

Testing for Signs of Publication Bias
To test for putative publication bias in our dataset, we compared effect sizes against their respective standard errors, the idea being that if there is no bias, there should be no link between the magnitude of the result from a given experiment, and the "noisiness" or uncertainty of that particular result. If there is bias, we could find an overrepresentation of noisier experiments reporting higher magnitude results. Using the "metabias" function of the R package "meta, " we performed both (weighted) linear regressions and rank correlations to test for this pattern (Begg and Mazumdar, 1994;Egger et al., 1997).

Literature Search and Study Characteristics
We searched the literature for papers featuring infections of whole live host organisms with P. aeruginosa strains known to vary in pyoverdine phenotype. Following a set of inclusion/exclusion rules (see materials and methods for details), we were able to include data from a total of 81 experiments from 24 original papers in our meta-analysis ( Table 1; see also Figure S1 and Tables S1, S2 in the Supplemental Material). These experiments featured a range of host organisms, including mammals (mice and rabbits, n = 37), the nematode Caenorhabditis elegans (n = 32), insects (fruit fly, silk worm and wax worm, n = 8) and plants (wheat and alfalfa, n = 4). Experiments further differed in the way infections were established and in the organs targeted. The most common infection types were gut (n = 34), systemic (n = 16), respiratory (n = 13) and skin infections (n = 6), but we also included some other types of infections (n = 12). Each experiment compared infections with a control P. aeruginosa strain (which produced wildtype levels of pyoverdine) to infections with a mutant strain defective for pyoverdine production. The most common control strains used were PAO1 (n = 58) and PA14 (n = 19), which are both well-characterized clinical isolates. However, some experiments used less well-characterized wildtype strains, such as FRD1 (n = 2) and PAO6049 (n = 2). Twentyeight experiments used mutant strains with clean deletions or transposon Tn5 insertions in genes encoding the pyoverdine biosynthesis pathway. In these cases, pleiotropic effects are expected to be relatively low-i.e., presumably only pyoverdine production was affected. The other 53 experiments used mutants where pleiotropic effects were likely or even certain. For example, some mutant strains carried mutations in pvdS, which encodes the main regulator of pyoverdine synthesis that also regulates the production of toxins and proteases (Ochsner et al., 1996;Wilderman et al., 2001). Others carried mutations in pvdQ, encoding an enzyme known to degrade quorum-sensing molecules in addition to its role in pyoverdine biosynthesis (Nadal Jimenez et al., 2010).

Mean Effects across and Within Subgroups
We combined data from the set of experiments described above in a meta-analysis to determine the extent to which pyoverdine's effect on virulence varied across four moderator variables: (i) host taxa, (ii) tissue types, (iii) pathogen wildtype background, and (iv) pyoverdine-mutation type. To obtain a comparable measure of virulence across experiments, we extracted in each instance the number of cases where a given infection type did or did not have a virulent outcome (i.e., dead vs. alive, or with vs. without symptoms) for both the mutant (m) and the wildtype (w) strain for each experiment (see materials and methods for details). We then took as our effect size the log-odds-ratio, i.e., ln((m virulent /m non−virulent )/(w virulent /w non−virulent )) (see Table  S2 in the Supplemental Material), a commonly-used measure especially suitable for binary response variables like survival (Szumilas, 2010).
Consistent with the theoretical prediction that host-pathogen interactions and host ecology are important modulators of virulence, we found considerable variation in the effect sizes across experiments and subgroups of all moderators (Figure 1). Pyoverdine-deficient mutants showed substantially reduced virulence in invertebrate and mammalian hosts, whereas there was little evidence for such an effect in plants ( Figure 1A). Overall, evidence for pyoverdine being an important virulence factor was weak for taxa with a low number of experiments (i.e., for plants, and the insect models Drosophila melanogaster and Galleria mellonella). We found that pyoverdine-deficient mutants exhibited reduced virulence in all organs and tissues tested, with the exception of plants ( Figure 1B). Comparing the effect sizes across wildtype strain backgrounds, we see that pyoverdine deficiency reduced virulence in experiments featuring the well-characterized PA14 and PAO1 strains ( Figure 1C) whereas the reduction was less pronounced in experiments with less well-characterized wildtype strains. This could be due to sampling error (only a few experiments used these strains) or it may be that these strains really behave differently from PA14 and PAO1. Finally, we observed that the nature of the pyoverdine-deficiency mutation matters ( Figure 1D). Infections with strains carrying welldefined mutations known to exclusively (or at least primarily) affect pyoverdine production showed a relatively consistent reduction in virulence. Conversely, where mutants were poorlydefined, or carried mutations likely to affect other traits beyond pyoverdine, here the virulence pattern was much more variable, with both reduced and increased virulence relative to wildtype infections ( Figure 1D). We posit that at least some of the differences in observed virulence between these mutants and their wildtype counterparts was likely due to pleiotropic differences in phenotypes unrelated to pyoverdine. Figure 1 highlights that we are dealing with an extremely heterogeneous dataset (a random meta-analysis of the full dataset without moderators yielded heterogeneity measures I 2 = 97. 92% (97.16-98.48) and H = 6.93 (5.93-8.10), where values in brackets indicate the 95% confidence limits associated with each estimate). Much of the variation we observe is probably due to other factors beyond those explored in Figure 1. The issue is that (a) we do not know what all these additional factors might be, and (b) the probably patchy distribution of experiments across the levels and ranges of these other factors would leave us with limited power to test for their effects. Accordingly, we decided to focus our attention on quantifying the impact of the four previously described moderators by using a more homogenous core dataset (n = 55), where rare and poorly characterized subgroups were removed. Specifically, we excluded experiments involving plants and/or undefined wildtype strains (n = 6), experiments reporting tissue damage as a measure of virulence (n = 12), and experiments where the hosts were likely not directly colonized by bacteria but died from exposure to bacterial toxins (n = 8). This leaves us with a core dataset comprising only those experiments where animal host models were infected with strains from well-defined PA14 or PA01 wildtype background, and survival vs. death was used as a virulence endpoint.

Relative Importance of Moderator Variables
Using this restricted dataset, we performed a series of metaregression models to test for significant differences between subgroups of our moderator factors, and we also estimated the share of total variance in effect sizes that is explained by each moderator variable (Figure 2). These models revealed that infection type is the variable that explains the largest share of total variance (25.4%). For instance, in systemic infection models the pyoverdine-defective mutants showed strongly reduced virulence compared to the wild-type, whereas this difference was less pronounced in gut infections. Host taxon explained only 8.2% of the total variance in effect sizes, and there was no apparent difference in the mean effect size among invertebrate vs. mammalian host models. Finally, the wildtype strain background and the likelihood of pleiotropy in the mutant strain both explained less than 1% of the overall effect size variation, and accordingly, there were no apparent differences between subgroups (Figure 2). This was interesting, because we predicted a priori that mutations with pleiotropic effects on other virulence factors could introduce within-study bias toward a greater effect of siderophore loss on virulence. Note that even with the inclusion of these moderator factors in the model, substantial heterogeneity remained in our restricted data set [I 2 = 96.15% (94.23-97.43), H = 5.10 (4.16-6.24)].

Publication Bias
In any field, there is a risk that studies with negative or unanticipated results may be less likely to get published (e.g., in our case, pyoverdine-deficient mutants showing no change or increased levels of virulence, Dwan et al., 2008). Especially when negative or unanticipated results are obtained from experiments featuring low sample sizes (and thus high uncertainty), the scientists responsible may be less inclined to trust their results, and consequently opt not to publish them. This pattern could result in a publication bias, and an overestimation of the effect size. To test whether such a publication bias exists in our dataset, we plotted the effect size of each experiment against its (inverted) standard error (Figure 3). If there is no publication bias, we would expect to see an inverted funnel, with effect sizes more or less evenly distributed around the mean effect size, irrespective of the uncertainty associated with each estimate (i.e., position on the y-axis). Instead, we observed a bias in our dataset, with many lower-certainty experiments that show strongly negative effect sizes (i.e., supporting the hypothesis that pyoverdine is important for virulence; Figure 3) but a concomitant paucity of lower-certainty experiments that show weakly negative, zero or positive effect sizes (i.e., not supporting the hypothesis).

What Have We Learned from this Meta-Analysis
Our meta-analysis reveals that pyoverdine-deficient strains of the opportunistic pathogen P. aeruginosa typically showed reduced virulence across a wide range of host species and bacterial genotypes. This confirms that iron limitation is a unifying characteristic of the host environment, making siderophores an important factor for pathogen establishment and growth within the host (Parrow et al., 2013;Becker and Skaar, 2014). However, we also saw that the extent to which pyoverdine deficiency reduced virulence varied considerably, and was quite modest in many instances. Pyoverdine-deficient mutant strains were typically more benign, owing to a reduced capacity for in vivo growth and/or a reduced capacity for inflicting damage on their host. Nonetheless, these mutants were typically still able to establish a successful infection, and, in many cases, could still kill their host (Romanowski et al., 2011;Kirienko et al., 2013;Ross-Gillespie et al., 2014;Lopez-Medina et al., 2015). These results support ecological theory predicting that the effect of a certain phenotype (i.e., producing pyoverdine in our case) should vary in response to the environment (i.e., the host and infection context, Lambrechts et al., 2006).
Our findings have direct consequences for any therapeutic approaches targeting this particular virulence factor. Because pyoverdine seems to be generally involved with virulence, treatments inhibiting pyoverdine production could have wide applicability and be effective against different types of infections across a wide host range. However, given the variation observed and pyoverdine's generally modest effect on virulence, the clinical impact of such treatments would likely vary across infection contexts, and be limited to attenuating rather than curing the infection. This would mean that for P. aeruginosa infections, at least, therapies targeting siderophore production could be helpful but should probably still be accompanied by other therapeutic measures, and applied for instance in combination with an antibiotic treatment (Banin et al., 2008). Certainly promising is that pyoverdine seems to have a more consistent ( Figure 1A) and more prominent (Figure 2 although not significant) effect in mammalian compared to invertebrate hosts. From this observation, one could infer that pyoverdine may have potential as a target for infection control in humans.
Our work demonstrates how meta-analyses can be used to quantitatively synthesize data from different experiments FIGURE 2 | Test for differences between subgroups of moderator variables with regard to the effect sizes for pyoverdine as a virulence factor in P. aeruginosa. Our baseline condition for all comparisons is the following: gut infections in invertebrate hosts, using the P. aeruginosa wildtype strain PA14 vs. a pyoverdine-deficient PA14 mutant with a low expected level of pleiotropy. The effect size for this baseline scenario is set to zero. All other scenarios had more extreme (negative) effect sizes, and are therefore scaled relative to this baseline condition. Comparisons reveal that virulence in pyoverdine-deficient strains was significantly more reduced in systemic compared to gut infections, and that most effect size variation is explained by the infection type. There were no significant effect size differences between any of the other subgroups. Bars show the difference in log odds-ratio (± 95% confidence interval) between the baseline and any of the alternate conditions. Values given in brackets indicate percentage of effect size heterogeneity explained by a specific moderator.
carried out at different times by different researchers using different designs. Such an analytical approach goes beyond a classical review, where patterns are typically summarized in a qualitative manner. For instance, a recent study proposed that three different virulence factors (pyocyanin, protease, swarming) of P. aeruginosa are host-specific in their effects (Dubern et al., 2015). Here we use a meta-analytic approach to quantitatively derive estimates of the overall virulence potential of a given bacterial trait and investigate variables that affect infection outcomes. We assert that such quantitative comparisons are essential to identify those virulence factors that hold greatest promise as targets for effective broad-spectrum anti-virulence therapies.
Our finding that effect sizes vary considerably across our assembled experiments provides a different perspective compared to that which one would obtain from a cursory reading of the literature. For instance, the first study investigating pyoverdine in the context of an experimental infection model (Meyer et al., 1996) reported that pyoverdine is essential for virulence. Although this experiment and its message have been widely cited (including by ourselves), it may no longer be the strongest representative of the accumulated body of research on this topic. As we see in Figure 1, the effect size it reports is associated with a high uncertainty due to a comparatively low sample size. Moreover, the observed effect cannot unambiguously be attributed to pyoverdine because an undefined UV-mutagenized mutant was used. We highlight this example not to criticize it, but rather because it serves to demonstrate why drawing inferences from (appropriately weighted) aggregations of all available evidence is preferable to focusing solely on the results of a single study.

Areas of Concern
Our meta-analytic approach not only provides information on the overall importance of pyoverdine for P. aeruginosa virulence, but it also allows us to identify specific gaps in our knowledge. For example, let us consider which types of studies were conspicuously absent from our dataset. First, most experiments in our dataset employed acute infection models, even though P. aeruginosa is well known for its persistent, hardto-treat chronic infections. This raises the question to what extent insights on the roles of virulence factors important in acute infections can be transferred to chronic infections. In the case of pyoverdine, we know that in chronically-infected cystic fibrosis airways, pyoverdine production is often selected against (Wiehlmann et al., 2007;Jiricny et al., 2014;Andersen et al., 2015). Although the selective pressure driving this evolutionary loss is still under debate (current explanations include pyoverdine disuse, competitive strain interactions and/or a switch to alternative iron-uptake systems, Marvig et al., 2014;FIGURE 3 | Association between effect sizes and their standard errors across 81 experiments examining the role of pyoverdine production for virulence in P. aeruginosa. In the absence of bias, we should see an inverted funnel-shaped cloud of points, more or less symmetrically distributed around the mean effect size (vertical dotted line). Instead, we see an over-representation of low-certainty experiments associated with strong (negative) effect sizes. This suggests a significant publication bias: experiments with low-certainty and weak or contrary effects presumably do exist, but are under-represented here (note the absence of data points in the cross-hatched triangle). Effect sizes are given as log-odds-ratio. Each symbol represents a single experiment. Symbol colors and shapes stand for different host organisms (red circles, invertebrates; blue squares, mammals; green diamonds, plants). Large symbols denote the experiments included in the core dataset. The solid shaded area represents the 95% confidence interval for the weighted linear regression using the complete dataset. Note that due to the stronger weights accorded to high certainty experiments (i.e., the points toward the top of the plot), many of the lower-weighted (higher-uncertainty) points toward the bottom of the plot lie quite far from the regression line and also outside the confidence interval. Andersen et al., 2015;Kümmerli, 2015), this example illustrates that the role of pyoverdine might differ in acute vs. chronic infections.
Second, our comparative work shows that experiments were predominantly carried out with the well-characterized strains PAO1 and PA14. While these strains were initially isolated from clinical settings, they have subsequently undergone evolution in the laboratory environment (Bragonzi et al., 2009;Klockgether et al., 2011;Frydenlund Michelsen et al., 2015), and might now substantially differ from the clinical strains actually causing acute infections in hospitals. Therefore, while we found no overall differences between the lab strains used in our data set, we argue that it would still be useful to carry out additional studies on a range of clinical isolates to be able to make firm conclusions on the general role of pyoverdine as a virulence factor.
Finally, our data analysis revealed that low-certainty studies showing no or small effects of pyoverdine on virulence were under-represented in our data set, which points toward a systematic publication bias. It remains to be seen whether such biases are common with regard to research on virulence factors, and whether they result in a general overestimation of the effect these factors have on host survival or tissue damage. With regard to pyoverdine, further studies are clearly needed to obtain a more accurate estimate of the true effect size.
In addition to these issues of data availability, all metaanalyses unavoidably involve intrinsic assumptions and subjective decisions that can further influence the resulting outputs. For instance, although we have used the standard log-odds-ratio as our common metric, related metrics like risk ratios, while typically highly correlated, can sometimes produce different patterns in a meta-analysis. In the present case, however, using risk ratios instead does not qualitatively affect the patterns we observed nor alter our conclusions. Furthermore, to facilitate calculation of a log-odds ratio in cases where zero counts appear as denominators, we have in this study adopted the common, yet ultimately arbitrary, convention of replacing these zero denominators with 0.5. Had we used a different value as a substitute, say 0.1 or 0.9, the estimates from our resulting models would have been different-at least quantitatively. A more fundamental issue is that when estimating population-level statistics across a collection of experiments, we typically assume that we are comparing like with like. Here, we have intentionally brought together a very diverse set of studies, and even though we have translated their individual effect sizes into a common metric and also stratified them by some of their major defining characteristics, their individual effect sizes nonetheless remain highly heterogeneous. In effect, we are knowingly combining apples and pears, because we think that the resulting fruit salad is still something that is worth taking a look at. Alternative or additional ways of slicing up the data could yield models with lower residual heterogeneity, but then poor data coverage in combinations of subcategories could limit the accuracy of any parameter estimates we want to extract from such models. In light of these issues, we advise readers to focus on the overall patterns our models reveal, rather than the specific values of the estimates they generate.

Guidelines for Future Studies
While our study demonstrates the strength of quantitative comparative approaches, it is important to realize that extracting effect sizes is one of the biggest challenges in any meta-analysis. This challenge was particularly evident for the experiments we found, which profoundly varied in the way data was collected and reported. As a consequence, we had to exclude many studies because they used measures of virulence that were only reported by a minority of studies, or because their reporting of results was unclear (for a selected list of examples, see Table S3 in the Supplemental Material). To amend this issue for future studies, we would like to first highlight the problems we encountered and then provide general guidelines of how data reporting could be improved and standardized. One main problem we experienced was incomplete data reporting (i.e., mean treatment values, absolute values and/or sample size was not reported), which prevents the calculation of effect sizes and uncertainty measures. Another important issue was that different studies measured virulence using very different metrics. Some measured virulence at the tissue level (i.e., the extent of damage inflicted), while others focused on the whole host organism. Others focused on the dynamics of the bacteria themselves, taking this as a proxy for the eventual damage to the host. There were both quantitative measures (e.g., extent of damage), and qualitative measures (e.g., assignments to arbitrary categories of virulence). Survival data was sometimes presented as a timecourse, sometimes as an endpoint; sometimes as raw counts, sometimes as proportions. In most cases, the time scales over which survival was assessed were fairly arbitrary. Compiling such diverse measures of virulence is not simply time consuming, but it also generates extra sources of heterogeneity in the dataset, which might interfere with the basic assumptions of meta-analytical models (Borenstein et al., 2009(Borenstein et al., , 2010. How can these problems be prevented in future studies? We propose the following. (a) Whenever possible, time-to-event data (e.g., death, organ failure, etc.) should be recorded in a form that preserves both the outcome and the times to event per subject. (b) The number of replicates used (hosts) and a measure of variance among replicates must be provided to be able to calculate a confidence estimate for the experiment. (c) If data are scaled in some way (e.g., relative to a reference strain), the absolute values should still be reported, because these are crucial for the calculation of effect sizes. Finally, (d) studies leading to unexpected or negative results (e.g., no difference in virulence between a wildtype and a mutant) should still be published, as they are needed to estimate a true and unbiased effect size. In summary, all findings, irrespective of their magnitude or polarity, should be presented "as raw as possible" (e.g., in Supplementary files or deposited in online data archives). This will make comparisons across studies much easier and will provide a useful resource for future meta-analytic studies.

CONCLUSIONS
Currently, bacterial traits are subject to a binary categorisation whereby some are labeled as virulence factors while others are not. We demonstrate that traits' effects on virulence are anything but binary. Rather, they strongly depend on the infection context. Our study affirms meta-analysis as a powerful tool to quantitatively estimate the overall effect of a specific virulence factor and to compare its general importance in infections across different bacterial strains, hosts, and host organs. Such quantitative comparisons provide us with a more complete picture on the relative importance of specific virulence factors. Such knowledge is especially valuable for opportunistic pathogens, which have a wide range of virulence factors at their disposal, and infect a broad range of host organisms (Kurz et al., 2003;He et al., 2004;Calderone and Fonzi, 2001). Meta-analytical comparisons could thus inform us on which traits would be best suited as targets for anti-virulence therapies. Ideal traits would be those with high effect sizes and general importance across pathogen and host organisms.

AUTHOR CONTRIBUTIONS
EG, FH, RK, and AR-G conceived the study; EG conducted the literature search and compiled the data set; AR-G conducted statistical analysis; EG, FH, RK, and AR-G interpreted the data and wrote the paper.

FUNDING
Forschungskredit Candoc of the University of Zurich (EG). Swiss National Science Foundation PP00P3-139164 (AR-G, EG, and RK). Novartis Foundation for Medical and Biological Research (AR-G and RK). University of Warwick (FH).