Federated Statistical Analysis: Non-parametric Testing and Quantile Estimation

The age of big data has fueled expectations for accelerating learning. The availability of large data sets enables researchers to achieve more powerful statistical analyses and enhances the reliability of conclusions, which can be based on a broad collection of subjects. Often such data sets can be assembled only with access to diverse sources; for example, medical research that combines data from multiple centers in a federated analysis. However these hopes must be balanced against data privacy concerns, which hinder sharing raw data among centers. Consequently, federated analyses typically resort to sharing data summaries from each center. The limitation to summaries carries the risk that it will impair the efficiency of statistical analysis procedures. In this work we take a close look at the effects of federated analysis on two very basic problems, nonparametric comparison of two groups and quantile estimation to describe the corresponding distributions. We also propose a specific privacy-preserving data release policy for federated analysis with the $K$-anonymity criterion, which has been adopted by the Medical Informatics Platform of the European Human Brain Project. Our results show that, for our tasks, there is only a modest loss of statistical efficiency.


INTRODUCTION
The ability to analyze large sets of medical data has clear potential for improving health care.Often, though, a large patient base is available only by combining data Federated data analysis addresses the privacy concerns by limiting data release from a center to summary statistics, without revealing the raw data.The analysis must then rely on the summary statistics.Methods for federated analysis have been proposed in the machine learning literature, but little research has been done to examine the consequences of the methods for statistical inference.Our goal in this paper is to fill some of the gap, comparing federated approaches for some basic statistical analyses.A simple example may help to set the stage.The Kaplan-Meier estimator is one of the most widely-used tools in the analysis of medical data.The resulting survival curves show the event times of all subjects and thus compromise privacy.
The roots of our work are in the European Human Brain Project ("HBP").Data sharing is a major priority for the HBP, but must be fully consistent with the GDPR.Salles et al. (2018) spelled out a detailed Opinion and Action Plan on 'Data Protection and Privacy' for the HBP.The plan gives important guidelines and a sound administrative framework for data protection, but does not provide technical solutions.One such solution was adopted in the design of the Medical Informatics Platform ("MIP"), the HBP vehicle for federated, multi-institutional, data analysis.Specifically, any data table exported from a member institution for use in federated analysis on the MIP must have at least 10 subjects in any cell of the table.The Kaplan-Meier curve, cited earlier, requires for each event time a table with the size of the risk set and the number who had an event; the latter will usually be less than 10 and so will not satisfy the privacy constraint.
In section 3 we propose a method for data summary that works specifically with the MIP restriction, generating tables in a federated manner.We then address two particular statistical problems: (i) use of the nonparametric Mann-Whitney U test (henceforth "MWU") Mann and Whitney (1947) to test the hypothesis that there is no difference between two groups and (in section 4) (ii) quantile estimation to describe the corresponding distributions (in section 5).Discussion and conclusions are in section 6.

RELATED WORK
Most of the research on federated data analysis has focused on algorithmic issues, under the general header of Federated Learning.These works emphasize aspects of computation and communication efficiency, security, and the adjustment of machine learning algorithms to federated settings.Several recent surveys provide good summaries (Yang et al., 2019) (Li et al., 2021) (Kairouz et al., 2021).Notable examples are (McMahan et al., 2016), who presented the "FederatedAveraging" algorithm, which combines local stochastic gradient descent on each client with a server that performs model averaging and "FedProx" suggested by (Li et al., 2019), which deals also with heterogeneous data.Nasirigerdeh et al. (Nasirigerdeh et al., 2020) created sPLINK, a system used to conduct Genome-Wide Association studies in a federated manner while respecting privacy.Algorithms such as linear and logistic regression were adjusted to the federated setting using data summaries from different data centers.Duan et al. (Duan et al., 2019a), (Duan et al., 2019b) presented privacy-preserving distributed algorithms ("ODAL" and "ODAL2") to perform logistic regression.With a focus on efficient communication, they made these one-shot algorithms, i.e. using only one information transfer from each center; by contrast, most algorithms are iterative and require multiple transfers.Liu and Ihler (Liu and Ihler, 2014) considered federated maximum likelihood estimation for parameters in exponential family distribution models.Their idea was to combine local maximum likelihood estimates by minimizing the Kullback-Leibler divergence.Their method yields a federated estimator that outperforms any other linear combination in various scenarios and is equivalent to the global MLE when the underlying distribution belongs to the full exponential family.Some related statistical literature is concerned with distributed computing, in which the data is centralized but so large that calculations are split over multiple servers in parallel to accelerate calculations.For example, Rosenblatt and Nadler (Rosenblatt and Nadler, 2016) showed that the estimator from averaging estimates from m servers is as accurate as the centralized solution when the number of parameters p is fixed and the amount of data n → ∞.
Many data sets contain sensitive private information that must legally and ethically remain unexposed.Several measures of privacy have been proposed, one of which is the degree of anonymization -the extent to which one is able to identify an individual from the records in the data and link the sensitive information to her.One well-known criterion for anonymization is K-anonymity (Samarati and Sweeney, 1998).A dataset is K-anonymous if each data item cannot be distinguished from at least K − 1 other Becher et al.
data items.Fulfilling this criterion introduces fuzziness into the data that makes it less likely to expose a certain individual.Another popular criterion is differential privacy, in which querying a data-base must not reveal too much information about a specific individual's record in it (Dwork, 2006).
One of the techniques to achieve K-anonymity is generalization.For example, one could release that K patients were between age 10 and 30 instead of releasing the ages of each of these patients.This appears to be the motivation for the privacy rule of the MIP, mentioned earlier, that any data table exported from a member institution for use in federated analysis must have at least 10 subjects in any cell of the table.

THE BINNING ALGORITHM
This section describes a procedure for constructing a K-anonymous federated summary table when two groups are compared with respect to a numerical variable.We denote the groups by x and y and use the terms control and treatment for them.The summary table will have B bins, with the bth bin given by (c b−1 , c b ], and observation frequencies f bx for the control group and f by for the treatment group.The table preserves Kanonymity in that it is constructed from frequency tables released from the centers in which all cell counts are either 0 or are ≥ K. Here is an outline of our table construction process.We proceed sequentially to add information from each center, beginning with the largest center and proceeding in decreasing order of sample size.The initial summary table meets the cell count constraint while attempting to minimize the width of the cells.Data from the other centers are then added, generating new bins if it is possible to do so without violating the privacy constraint.Existing bins are never removed.When cell counts from a new center are between 0 and K, neighboring bins are combined and their total count is redistributed among the bins that were combined (See Algorithm 1 for details).

Binning the largest center
The process proceeds (arbitrarily) from small to large values.The first bin is, initially, from a 0 to a 1 , where a 0 is the minimal value in the data and a 1 , is the smallest data value for which [a o , a 1 ] has at least K observations from one group and either 0 or at least K observations from the other group.The next tentative bin limit, a 2 , is found in the same way, looking at the interval (a 1 , a 2 ].This continues so long as a new bin limit can be found.When a limit cannot be found, the number of unbinned data in at least one group is between 0 and K. Tentatively extend the upper limit of the previously formed bin to the maximal value of this group as the next limit.The unbinned data Becher et al.

Binning the largest center
from the other group might permit continuation of the process, blocking off new bins in which that group has counts of at least K, versus counts of 0 for the first group.When that group has fewer than K unbinned data, replace the last bin limit by the maximal value in the second group.(See Algorithm 2 in the Appendix for details.) The initial bin boundaries a 0 , . . ., a B produced by the algorithm above are actual data values and, unless many subjects share the same value, violate the privacy condition.There is a simple fix for a 1 , . . ., a B−1 .All values in the jth bin are ≤ a j and all values in the j + 1st bin are > a j .So we can replace a j by c j = wa j + (1 − w)v j+1 where v j+1 is the smallest value in the (j + 1)st bin and w is a uniform random variable on (0, 1).The extreme boundaries a 0 and a B are the minimum and maximum in the data, so a different approach is needed.One option is to take c 0 = −∞ and c B = ∞.Another option is to impose natural limits; for example, if by definition a variable cannot assume negative values, we could choose c 0 = 0.A final option is to extend the bin limits by "privacy buffers".To make these reasonably close to the data, we base them on the observed gaps between successive observations in the extreme bin.For example, compute c B as a B + dB , where dB is the mean difference between consecutive data points in the last bin.(If dB = 0, c B = a B , but this is now privacy preserving, as all observations in the last bin are equal to one another, with more than K in each group that has data.)Similarly, compute c 0 as a 0 − d1 .

Joining additional centers
A new algorithm is needed to add the data from a new center, preserving all bin boundaries from the first center.The simple option of increasing the frequency counts in each current bin is not an option, as the incremental table from the new center will typically not be K-anonymous.Further, the incremental counts for some existing bin might be so large that data from the new center could actually be used to split it into two or more bins.
Algorithm 4 is used to add the information from a new center to an existing summary table.We first iterate over the current bins, creating finer bins if possible.Then we remove any counts that are not K-anonymous by combining and redistributing data from adjacent cells.Pseudo-code for Algorithm 4, and for two algorithms called by it, are given in the Appendix.
Splitting an existing bin into two bins forces us to reallocate the previous frequencies.We do so proportionally to the relative frequencies from the new center.For example, suppose a bin with a current count of 27 for one group is split into two new bins, which have equal counts at the new center.Then we split the 27 equally to the two new groups, adding 13.5 to each.Note that this procedure can result in counts that are not integers.
After creating new bins wherever possible, we iterate again and fix bins where the new center has frequencies between 0 and K. Proceeding from bin 1 to bin B, these non-private bins are combined with the next bin to the right until all counts from the new center are either 0 or at least K. Then the total counts are distributed among the original bins proportionally to the relative frequencies of the bins in the current table.Table 4 in the Appendix shows an example that illustrates how the algorithm works.

Joining the last bin
The extreme bin limits c 0 and c B must be compared with the minimum and maximum values, respectively, in the new center.If the new center has a more extreme data value, we need to revise these bin limits.We do so by applying the buffer method that was used to find C 0 and c B in the largest center, but now adding buffers that depend only on the data in the extreme bin from the new center.

TESTING
This section considers the problem of hypothesis testing with federated data, studying the common problem of determining whether numerical outcomes from two groups come from the same distribution (the null hypothesis, H 0 ); or whether one group has larger values than the other.The standard choice is the independent samples t-test, which requires the mean, the standard deviation and the number of observations in each group.All of these are privacy-preserving summary statistics, so the t-test can still be used with federated data.However, the t-test relies on the assumption, often invalid, that the data are normally distributed.We consider here the standard non-parametric alternative, the Mann-Whitney U ("MWU") test (Mann and Whitney, 1947) (or, equivalently, the Wilcoxon rank sum test).

The Mann-Whitney U test
The MWU statistic can be defined as follows.Denoting the observations in the two groups by X 1 , . . ., X n and Y , where N = n + m, D is the number of distinct values in the data, and t r is the number of observations that share the rth distinct value.The second term corrects the variance for the presence of ties in the data.If Y d = c+X, c ∈ R, the distribution of U is stochastically increasing as a function of c.The power of the test depends on P (Y > X) and is high when this probability differs from 0.5.
The MWU test involves direct comparison of each data point in one group with each data point from the other group.As this includes comparisons of observations from different centers, it is impossible to compute the MWU statistic for a federated analysis.Two broad options are possible for federated analysis.
• Compute the MWU statistic separately for each center and then combine them across centers.

Sum of U -statistics
• Generate a federated table summarizing the data from all the centers and then compute the MWU statistic on the federated table.
The next subsections present options for combining center-specific MWU statistics and the second analysis option, used in conjunction with our federated binning algorithm.

Sum of U -statistics
Denote by U l the MWU from the lth center, based on n l and m l observations from the two groups, with N l = n l + m l ; and denote by V l its variance under H 0 .A simple way to form a federated test statistic is to sum the individual statistics over the centers and normalize them by their standard deviation, leading to (1)

Weighted average of U -statistics
A simple generalization is to replace the sum of the statistics by a weighted sum, with an optimal choice of weights.It is convenient to do this using the normalized test statistics for each center, Z l = U l /V 0.5 l .The weighted test statistic is then The choice of weights can be made to maximize the power of the test when the null hypothesis is not true, using the fact that where δ l is the standardized effect in center l.For the MWU statistic, the standardized effect can be expressed as

Fisher's method
Becher et al.
where P + l = P (Y > X) and P − l = P (Y < X) in center l.Although the formula permits the probability difference to vary over centers, the natural basis for defining the weighted sum statistic is to assume a constant difference, in which case the optimal weights depend on the sample sizes and, if present, the extent of tied data.See equation ?? in the Appendix for derivation of the weights.

Fisher's method
Fisher's method (Fisher, 1932) combines the p-values from independent samples.The corresponding statistic is where p l is the p-value from the MWU test result in the lth center.

Federated table MWU statistic
We can compute the MWU statistic from the federated summary table generated by the algorithm described in section 3. The table will have B bins whose frequencies are f x i and f y i .The frequencies sum to the total amount of data over all the centers, but need not be integers.
The MWU statistic for the federated table compares observations on the basis of their bins and is given by where c 0 < c 1 < • • • < c B are the endpoints of the bins and The variance of U f ed can be computed from the formula in section 4.1, keeping in mind that all observations in the same bin are tied.

Comparison of the tests
A simulation study was used to compare the different federated MWU tests to an analysis of the combined data.Our goals are to assess how the federated analysis affects the power of the tests, and to use the power analysis to compare the testing methods.
Becher et al.

Comparison of the tests
We also vary the simulation settings to examine how the results and comparisons are affected by the number of centers in the study and by heterogeneity across centers.
We simulated situations with 1500 observations in each group, divided over 3, 5 or 10 centers, with the number of observations unbalanced among the centers (see Table 1).
The possibility that centers may differ from one another is represented by α l ∼ N (0, σ 2 α ).The difference between treatment and control at center l is β l ∼ N (δ, σ 2 β ) where δ is the overall difference, and σ β represents heterogeneity of the treatment effect across centers.The terms ϵ il , ϵ jl ∼ N (0, 1) are random errors.All random variables are independent of one another.
We simulated experiments with several different combinations of input parameters.We chose σ α ∈ {0, 0.1, 0.2} and σ β ∈ {0, 0.05, 0.06} to achieve between center variance, and δ ∈ {0, 0.05, 0.1}.Including δ = 0 allowed us to verify that the tests remain reliable when both groups have the same mean.Note, however, that the variance is slightly larger for the treatment group if σ β > 0, so that this setting does not fully match the null hypothesis of identical distributions.that actual type 1 errors are inflated from their nominal values.The fraction of pvalues below 0.05 (0.01) was approximately 0.08 (0.025).The inflation was slightly weaker when more centers were included and slightly larger only for Fisher's test.The additional bias of Fisher's test is not surprising, as it is sensitive to the existence of an effect within a center, but not to having a consistent direction of the effect.
Figure 1 compares methods when δ ̸ = 0 across different parameters and numbers of centers.See also Supplementary Table S1.The federated table and weighted tests have p-value distributions that are very similar to those from combining all the data, indicating almost no loss of power.The sum test has higher p-values, hence consistently lower power.The p-values with Fisher's method are a bit higher when the treatment effect is consistent across centers (σ β = 0).When the effect is not consistent, they are lower.However, as already seen, Fisher's test in this case fails to preserve type 1 error, with a bias toward low values.
Becher et al. Figure 2 focuses on how closely the federated test results compare with those from the combined test (i.e. using the full data) by comparing the p-values of each method on the same simulated data set.The Y axis presents log(p iv /p is ) where i represents the simulation number, s is the combined test and v is the federated test.

Comparison of the tests
Across all the settings, the weighted test most closely replicates the p-value of the combined test.The federated table is also similar, but more variable, especially when δ ̸ = 0.In the top left panel, where H 0 is true, all methods are similar to the combined test.However, adding treatment heterogeneity (top right panel) induces negative bias in the p-values from Fisher's test and increases the variance of the log ratio for that test and for the sum.In all the settings with center heterogeneity (σ α > 0), the sum test gave, typically, slightly higher p-values than the combined test, hence had lower power.

Comparison of the tests
To assess the power of the tests as a function of the effect size, we measured the p-values over a set of 4 increasing values of δ, when σ α = 0.1 and σ β = 0.05. Figure 3 compares the methods to the unconstrained test using log(p iv /p is ) (Y -axis) where i represents the simulation number, s is the unconstrained method and v is the other method.Again the weighted test is most similar to the combined test, followed by the federated table.Table ?? shows quantiles of the p-value distributions with 10 centers.The quantiles for the weighted test are consistently the lowest ones; with even the modest heterogeneity present here, they are lower even than those for the combined test.Similar quantiles were found for 3 and for 5 centers, indicating that, for the settings we examined, the number of centers has little effect on power.

ESTIMATION
This section considers the problem of quantile estimation when data are located in different centers.Quantile estimates are valuable for directing visual summaries of data distributions such as histograms or Kaplan-Meier plots.Standard methods for computing sample quantiles cannot be used, as they begin by ordering all the data, violating privacy.We propose and compare several methods for federated quantile estimation.Throughout we denote by F (x) the CDF and by Q p = F −1 (p) the pth quantile of the distribution.

Federated estimates using the quantile loss
A quantile can be estimated as the solution to a minimization problem, Qp,Loss = arg min q   (p − 1) where the target function is the quantile loss function.The optimization can be carried out on federated data by returning function and gradient values from each center, proceeding iteratively to compute Qp,Loss .The need for an iterative algorithm to minimize the loss, has the drawback of communication inefficiency.
A more serious concern is that the quantile loss compromises privacy.The loss function within each center is piecewise linear with a change in derivative at each data value in the center.Thus the information from a collection of calls can be used to recover the original data values at the federated node.
Despite the privacy violation, we will include Qp,Loss in the subsequent comparisons as a benchmark.
It is possible to exploit the loss function to compute approximate quantile estimators that are differentially private ( (?)).

Estimating quantiles from the federated data using the Yeo-Johnson transformation
The binning algorithm we introduced in Section 3 can be used to compute a federated estimate of Q p that is K-anonymous.A naive estimate is the smallest bin limit with cumulative frequency greater than 100p% of the data.However, restricting Q p to the set of bin limits is an obvious drawback, especially for quantiles in the tails of the distribution.A simple improvement is to interpolate the estimated CDF from one bin limit to the next.Linear interpolation corresponds to the assumption of a uniform distribution within each bin.That may be reasonable for bins in the center of the data.Howver, it is not likely to work well in the tails, especially in the most extreme bins.We did attempt to use linear interpolation, but the results were poor and are not reported here.
We propose here a more sophisticated interpolation method based on the Yeo-Johnson transformation ("YJ") (Yeo and Johnson, 2000), a power transformation used to achieve a distribution that is closer to the normal.The approach extends the wellknown Box Cox (Box and Cox, 1964) transformation to also handle variables that can take on negative values.The transformation is defined by if λ = 2 and x < 0 (5)

YJ Table method
In this method the goal is to find values of λ, a 0 and a 1 for which the transformed bin limits approximately match a normal distribution with mean a 0 and standard deviation where b k is a bin limit, F is the estimator of the distribution function from the federated table and h λ (x) is the ("YJ") (Yeo and Johnson, 2000) transformation.The quantile Given λ, we can compute a 0 , a 1 using linear regression.To estimate λ, we use the idea that an effective transformation h λ should have transformed quantiles that are linearly related to the YJ-estimated quantiles.This can be achieved by choosing λ to Becher et al.

5.2
Estimating quantiles from the federated data using the Yeo-Johnson transformation maximize the correlation between them, where the values of X we use are the interior bin limits b 1 , . . ., b K−1 .
Note that the range of the inverse transformation in 5 is given by To ensure that the inverse transformation has values in R we set the constraint 0 ≤ λ ≤ 2 for equation 8.
The YJ Table method is a "one pass" algorithm, calling the data only to produce the federated summary table.Thus it enjoys full communication efficiency.

YJ Likelihood method
The parameters in the YJ transformation can also be estimated by maximum likelihood.Denoting by x il the observations from center l and by N the total number of observations, the log likelihood is where For a fixed value of λ, the log-likelihood requires only summary statistics from each center, so can be computed in a federated manner.This can be embedded in a simple optimization routine that maximizes the log likelihood over λ.
As with the quantile loss, the YJ likelihood method employs an iterative algorithm, and thus is not communication efficient.However, unlike the quantile loss, the YJ log likelihood for each center is not a simple function of the data that can be immediately

Constructing summary tables from quantile estimates
Becher et al.
inverted to recover data values.Thus the privacy violations of the quantile loss do not occur here.
Once we have λ, we can again use summary statistics from the centers to compute μλ , σλ .The resulting quantile estimator is The likelihood maximization is iterative, so requires multiple communication steps with each center.By contrast, the methods based on the federated table are "one pass", requiring just one call to each center.This communication inefficiency of the maximum likelihood method can be improved by submitting to each center a grid of possible λ values.The centers then return the moments needed to compute the log likelihood for each value in the grid.The resulting estimate of λ can either be the best value among those in the grid or the maximizer of an empirical fit to the relationship between the log likelihood and λ.The result is an approximate, one pass MLE.

Constructing summary tables from quantile estimates
Federated quantile estimates can be used to generate an alternative summary table, which presents a collection of quantiles.See table ?? for an example, with estimates from optimizing the quantile loss and the Y J likelihood.

Simulation Results
We compared the three quantile estimators using a simulation configuration similar to that in the testing chapter.As the quantiles are univariate summaries, we generated data and estimated quantiles only in one group.Another important difference is that the form of the underlying distribution affects the estimation results.In particular, methods may vary when faced with long rather than short tails.To gain insight into this issue, we chose the Gamma as the base distribution for assessing the quality of quantile estimation.
Each simulated data set included 1500 observations, spread across 3, 5 or 10 centers exactly as described in Table 1.The observations were generated from the following model: x il = ϵ il exp(α l ) where x il is observation i at center l with α l ∼ N (0, σ 2 α ) and ϵ il ∼ Gamma(r, 1) r ∈ {4, 10}.The skewness of Gamma is 2 √ r , so the smaller value for r has a longer right tail.
Becher et al.

Simulation Results
For the Gamma data, heterogeneity across centers was induced using scale rather than location shifts.The value of σ α was chosen to achieve between center heterogeneity similar in extent to that in section 4.There the key term was the ratio σ α /σ ϵ , which was taken to be 0, 0.1 or 0.2.With Gamma data, the standard deviation of the homogeneous data is proportional to the median, so the analogous choice is to set , with ϕ similar to the values chosen above.We used only ϕ = 0.1 in our simulations for quantile estimation.
For each combination of the parameters, 2000 simulations were run.The true quantile Q p for each simulation was computed from the mixture (over centers) distribution by solving the equation below with l as the center index.
where N is the number of observations from all centers, n l the observations in center l and Γ r is the standard Gamma CDF with shape parameter r.A dominant part of the quantile estimation errors is the natural variability of the underlying Gamma distribution.As the standard deviation for Gamma(r, 1) is √ r, we summarized results via the normalized estimation error Qp − Q p / √ r where Qp is the estimator of Q p .
The simulation results for estimating Q 0.98 are shown in Figure ??.This quantile is presented separately, as it is the most challenging case, in the right tail of a rightskewed distribution.Results for Q 0.02 ,Q 0.25 ,Q 0.5 , Q 0.75 are depicted in Figure ??.Further detail is provided in Tables S.2, S.3, S.4 and S.5 (supplementary file), which give, respectively, the estimated bias and standard deviation, the mean squared error (MSE), and the ratio of squared bias to variance for all the methods and all the quantiles.
The YJ data estimator achieved lower MSE than the quantile loss estimator.For the extreme quantiles, the decrease in MSE ranged from 14% to 44%.The "one pass" YJ table estimator was very accurate for estimating the median and the quartiles, but lost efficiency for the extreme quantiles with the more skewed of the two Gamma distributions and when the number of centers was large.In that setting, the estimator for Q 0.02 suffered from negative bias and its MSE was almost 3 times as large as for the quantile loss estimator; the MSE for Q 0.98 was about 80% larger.
For the settings we studied, variance was the dominant component of MSE.Bias was a substantial problem only in a small number of cases.The YJ methods had large Becher et al.
positive bias for Q 0.98 when r = 4; however, when r = 10, and the distribution is itself closer to normal, the bias was negligible.

SUMMARY
In this work we presented novel methods for federated data analysis and investigated their statistical properties.We proposed a simple algorithm for creating K-anonymous data tables in one-and two-group problems and we compared federated approaches for the nonparametric Mann-Whitney U (MWU)test and for estimating quantiles.Our federated data table is created in a "one pass" format, so that it is communication efficient.
For the MWU test, we found that the most powerful method was the weighted average of the MWU statistics from the individual centers, with weights reflecting the sample sizes.This statistic is also communication efficient, gives very similar p-values to those from the combined data and has the advantage of adjusting for inter-center heterogeneity, effectively treating each center as a block.The test based on our federated table was less effective.However, the p-value distributions from the table were only slightly worse than those from the combined data and the weighted average, indicating only a small loss of statistical power.
The fully optimized YJ method consistently had the lowest MSE of the methods we compared.For the extreme quantiles, it improved by 14% to 44% over the quantile loss estimator.The "one pass" YJ table estimator had almost identical MSE for estimating the median and the quartiles, but lost efficiency for the extreme quantiles when the number of centers was large.The increase in MSE was more substantial (almost 80%) with the more skewed of the two Gamma distributions we studied.This is not surprising: our method exploits a transformation to normality and is less successful when the distribution is further from the normal.
It is important that research on federated data analysis will relate to statistical efficiency and not just to algorithmic efficiency.Our work opens this avenue, but much more could be done.Here are some examples.Our construction method for a federated summary table could be extended to multiple variables and to higher dimensions; our method creates the bins in a way fitted to a one-dimensional variable.This would be needed, for example, to produce a federated analogue of a scatter plot.Our findings suggest that heterogeneity can harm the federated analysis.Methods are needed to identify heterogeneity and to account for it in the analysis.The investigation of quantile estimators could be extended to a wider class of distributions.Our implementation of the YJ method applies a single transformation to the distribution.For quantiles in the end if 24: end for 25: f1 student = fix student frequencies(f1 teacher, f1 student) 26: f2 student = fix student frequencies(f2 teacher, f2 student) 27: Output{new bins, f1 student + f1 teacher,f2 student + f2 teacher} ▷ + here means element-wise sum

Maximizing power for the weighted MWU statistic
The power expression for the weighted MWU statistic is derived below.Equation (1) follows because the vector most correlated with δ is δ itself.

Illustration of federated table construction
The example in table 2 illustrates the algorithm with K = 10.In the first table we see current bins with limits 0 < 2 < 4 < 6 < 8 and columns f x, f y from binning the data at the new center with those limits.
In the top panel 2 the new center has at least 20 observations between 4 and 6 in both groups and can potentially be divided into two new bins with privacy preserving counts that are closer to 10.In this case we apply algorithm 1 to the observations from the new center and succeed in creating new bins as illustrated in the middle table of 2. The new bin limit is found from the new center's data, in exactly the same way as the bin limits were found for the first center.Splitting (4, 6] into two bins forces us to reallocate the previous frequencies, 10 and 14.We do so using the relative frequencies of the new center.The relative frequencies are 10 20 = 1 2 for each bin so we allot 10 • 1 2 = 5 to both new bins in group x and 14 • 1 2 = 7 to both bins in group y.  7. Ratio: the estimated ratio between the squared bias and the variance for the quantile estimators across the different simulation settings.

Figure S1 (
FigureS1(in the supplementary file) shows the distributions of p-values for all the tests in the null setting δ = 0.The left panel panel includes heterogeneity across centers (σ α = 0.1), but no effect heterogeneity, and shows a uniform distribution for all the tests, as desired.The right panel adds a small amount of effect heterogeneity (σ β = 0.05.This results in a slightly wider spread of p-values for all the tests, so

Figure 1 .
Figure 1.Comparison of methods across different parameters and number of centers.Panels represent the number of centers, the Y -axis presents the p-values and the X-axis the parameters (δ, σ α , σ β ).The different methods are color-coded.

Figure 2 .
Figure 2. The figure shows log(p iv /p is ) on the Y axis, where s is the combined data analysis.The panels correspond to the different parameter settings for (δ, σ α , σ β ).The number of centers is on the X axis and the methods are color-coded.

4. 6
Comparison of the tests Becher et al.

Figure 3 .
Figure 3.Comparison of methods across different parameters and number of centers.Rows correspond to the number of centers and segments within rows represent hyperparameter configurations.The Y -axis represents the p-values.The methods are colorcoded.
Binning the largest center Becher et al.

Table 1 .
Observations per center For the Mann-Whitney test, only the order of the observations is important, so any distribution can be used to simulate the data.Our model generates control group observations at center l asx il = ϵ il + α l and treatment group observations as Estimating quantiles from the federated data using the Yeo-Johnson transformation Becher et al. b 2 < • • • < b B and frequencies f x,k .Let Fx,i denote the cumulative distribution for the federated table at b i .
Here we apply the single group version of the algorithm which gives a summary table that has B bins with endpoints b 0 < b 1 < 5.2

Table 2 .
Illustration of the algorithm for joining a new centerAfter creating new bins wherever possible, we iterate again and fix bins containing nonprivate frequencies that are less than K and have non-zero counts.This is illustrated in table 2. The middle table, obtained after splitting the bin (4, 6], contains non-private counts from the new center: 3,7 in (0, 2], (2, 4] from x and 4 in (0, 2] from y.This is solved in the bottom table of 2 by summing the frequencies in both cells and then using the relative frequencies from the current table to reallocate them.Note that the "count" added to an existing cell will not necessarily be an integer.

Table 4 .
Bias: the estimated bias of the quantile estimators across the different simulation settings.

Table 5 .
SD: the estimated standard deviation of the various quantile estimators across the different simulation settings.

Table 6 .
Error: the estimated MSE of the various quantile estimators across the different simulation settings.