Optimal blending of multiple independent prediction models

Taraba, Peter

doi:10.3389/frai.2023.1144886

HYPOTHESIS AND THEORY article

Front. Artif. Intell., 24 February 2023

Sec. Machine Learning and Artificial Intelligence

Volume 6 - 2023 | https://doi.org/10.3389/frai.2023.1144886

Optimal blending of multiple independent prediction models

Peter Taraba^*

Independent Researcher, Fort Lauderdale, FL, United States

We derive blending coefficients for the optimal blend of multiple independent prediction models with normal (Gaussian) distribution as well as the variance of the final blend. We also provide lower and upper bound estimation for the final variance and we compare these results with machine learning with counts, where only binary information (feature says yes or no only) is used for every feature and the majority of features agreeing together make the decision.

Introduction

Participants of the Netflix competition used model blending heavily—refer to, for example Töscher et al. (2009), Amatriain (2013), Xiang and Yang (2009), Coscrato et al. (2020), Koren (2009), Jahrer et al. (2010), and Bothos et al. (2011). Ensemble modeling (blending) was popular not only in Netflix competition but is used also on other machine learning problems such as image processing, for example for CIFAR-10 dataset, refer to Abouelnaga et al. (2016) and Bruno et al. (2022), for the MNIST dataset refer to Ciresan et al. (2011). Ensemble modeling is used also in many other fields, for example, refer to Schuhen et al. (2012) and Ardabili et al. (2020). In this study, we derive blending coefficients based on variances of different models with only the assumption of model independence. While the formula for the final variance of the blended model and its coefficients is already derived in Kay (1993) without a proof (Equations 6.7 and 6.8 in chapter 6.4), we provide proofs both for the formula for blending coefficients and the variance of the combined model as well as the lower and upper bound estimate for the final variance based on the minimal and maximal variance of all the combined models. We also compare these results with machine learning with counts, where only binary information is used from the features to make the decision, in the last section and show very similar conclusions.

Let ŷ_k,j be a prediction of model k ∈ [1, N] for element j ∈ [1, M], where N is the number of different independent models and M is the number of measurements we have:

\begin{array}{l} ŷ_{k, j} = y_{j} + r_{k, j}, \end{array}

where y_j is an expected prediction and r_k,j is a random variable with normal distribution $R_{k} ~ N (0, σ_{k}^{2})$ , which has a zero average (the expected value of the variable is 0). In this study, we derive optimal blending coefficients α_k such that the blended prediction ŷ_B is optimal:

\begin{array}{l} ŷ_{B, j} = \sum_{k = 1}^{N} α_{k} ŷ_{k, j} = y_{j} \sum_{k = 1}^{N} α_{k} + \sum_{k = 1}^{N} α_{k} r_{k, j} = y_{j} + \sum_{k = 1}^{N} α_{k} r_{k, j} \end{array}

with minimum variance $σ_{B}^{2}$ , where $\sum_{k = 1}^{N} α_{k} = 1$ .

Blending two independent models

Here we present two independent models

\begin{array}{l} ŷ_{1, j} = y_{j} + r_{1, j} \\ ŷ_{2, j} = y_{j} + r_{2, j}, \end{array}

where $R_{1} ~ N (0, σ_{1}^{2})$ and $R_{2} ~ N (0, σ_{2}^{2})$ . We derive $\hat{α} \in [0, 1]$ for which we get the optimal blending model

\begin{array}{r} ŷ_{B, j} = α (y_{j} + r_{1, j}) + (1 - α) (y_{j} + r_{2, j}) = y_{j} + α r_{1, j} \\ + (1 - α) r_{2, j} . \end{array}

It is well-known fact that a random variable combining two random variables αR₁ + (1 − α)R₂, where $R_{1} ~ N (0, σ_{1}^{2})$ , $R_{2} ~ N (0, σ_{2}^{2})$ and R₁ and R₂ are independent, has a normal distribution $N (0, σ_{B}^{2})$ , where $σ_{B}^{2} = α^{2} σ_{1}^{2} + {(1 - α)}^{2} σ_{2}^{2}$ . For the mean we get:

\begin{array}{l} E (Y_{B}) = \frac{1}{M} \sum_{j = 1}^{M} (α r_{1, j} + (1 - α) r_{2, j}) \\ = α E (R_{1}) + (1 - α) E (R_{2}) = 0 \end{array}

and for the variance we get:

\begin{array}{l} E (Y_{B}^{2}) = \frac{1}{M} \sum_{j = 1}^{M} {(α r_{1, j} + (1 - α) r_{2, j})}^{2} = \\ = α^{2} \frac{1}{M} \sum_{j = 1}^{M} r_{1, j}^{2} + 2 α (1 - α) \frac{1}{M} \sum_{j = 1}^{M} r_{1, j} r_{2, j} + {(1 - α)}^{2} \frac{1}{M} \sum_{j = 1}^{M} r_{2, j}^{2} . \end{array}

Finally as R₁ and R₂ are independent (covariance $\frac{1}{M} \sum_{j = 1} r_{1, j} r_{2, j} = 0$ is zero), we can write:

\begin{array}{l} σ_{B}^{2} = E (Y_{B}^{2}) = α^{2} σ_{1}^{2} + {(1 - α)}^{2} σ_{2}^{2} . \end{array}

To find the optimal (we are looking for minimal value and function is convex with one minimum as we have only α⁰, α¹, and α² dependencies—quadratic function and $σ_{1}^{2} + σ_{2}^{2} > 0$ ) blending parameter, we compute where a partial derivative of the new variance of the blended model is zero:

\begin{array}{l} \frac{\partial σ_{B}^{2}}{\partial α} = 2 \hat{α} σ_{1}^{2} - 2 (1 - \hat{α}) σ_{2}^{2} = 0, \end{array}

from which

\begin{array}{l} \hat{α} = \frac{σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} & (1) \end{array}

and the optimal variance will be:

\begin{array}{l} σ_{B}^{2} (\hat{α}) = \frac{σ_{1}^{2} σ_{2}^{4}}{{(σ_{1}^{2} + σ_{2}^{2})}^{2}} + \frac{σ_{2}^{2} σ_{1}^{4}}{{(σ_{1}^{2} + σ_{2}^{2})}^{2}} = \frac{σ_{1}^{2} σ_{2}^{2} (σ_{1}^{2} + σ_{2}^{2})}{{(σ_{1}^{2} + σ_{2}^{2})}^{2}} \\ = \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} & (2) \end{array}

In Figure 1, we show how the variance is changing for different blending parameters α. Script is in Appendix 1. Blue dot—optimal value of blending parameter α matches simulation (minimal value for variance).

FIGURE 1

Figure 1. Red line—variance for different α. Green line—optimal variance $σ_{B}^{2}$ . Blue dot—optimal α with its value $σ_{B}^{2} (\hat{α})$ . Python script is in Appendix 1.

Blending three independent models

Now, we consider three independent models

\begin{array}{l} ŷ_{1, j} = y_{j} + r_{1, j} \\ ŷ_{2, j} = y_{j} + r_{2, j}, \\ ŷ_{3, j} = y_{j} + r_{3, j}, \end{array}

where $R_{1} ~ N (0, σ_{1}^{2})$ , $R_{2} ~ N (0, σ_{2}^{2})$ , and $R_{3} ~ N (0, σ_{3}^{2})$ .

Here, we blend optimally the first two models from the previous section:

\begin{array}{l} ŷ_{4, j} = y_{j} + \hat{α} r_{1, j} + (1 - \hat{α}) r_{2, j} = y_{j} + r_{4, j}, \end{array}

where $R_{4} ~ N (0, \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}})$ and then we find the blending parameter $\hat{β}$ for ŷ_3,j and ŷ_4,j such that

\begin{array}{l} ŷ_{B, j} = y_{j} + \hat{β} r_{3, j} + (1 - \hat{β}) r_{4, j} . & (3) \end{array}

Based on the Equation (1), we get

\begin{array}{l} \hat{β} = \frac{\frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}}}{σ_{3}^{2} + \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}}} = \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} . \end{array}

Plugging this back into the Equation (3), we get

\begin{array}{l} ŷ_{B, j} = y_{j} + \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} r_{3, j} \\ + (1 - \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}}) r_{4, j} \end{array}

\begin{array}{l} ŷ_{B, j} = y_{j} + \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} r_{3, j} \\ + \frac{(σ_{1}^{2} + σ_{2}^{2}) σ_{3}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} (\hat{α} r_{1, j} + (1 - \hat{α}) r_{2, j}) \end{array}

\begin{array}{l} ŷ_{B, j} = y_{j} + \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} r_{3, j} \\ + \frac{(σ_{1}^{2} + σ_{2}^{2}) σ_{3}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} (\frac{σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} r_{1, j} + (1 - \frac{σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}}) r_{2, j}) \end{array}

\begin{array}{l} ŷ_{B, j} = y_{j} + \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} r_{3, j} \\ + \frac{σ_{2}^{2} σ_{3}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} r_{1, j} + \frac{σ_{1}^{2} σ_{3}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} r_{2, j}, \end{array}

which is symmetrical, meaning model combination order is irrelevant. Finally for ${\hat{α}}_{1}$ , ${\hat{α}}_{2}$ , ${\hat{α}}_{3}$ , we get:

\begin{array}{l} {\hat{α}}_{1} = \frac{σ_{2}^{2} σ_{3}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} \\ {\hat{α}}_{2} = \frac{σ_{1}^{2} σ_{3}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} \\ {\hat{α}}_{3} = \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} . \end{array}

Combining the second and third models first and then combining the result with the first model would lead to the same optimal blending parameters. The order of the combination is inconsequential. Additionally, for the final variance, we get from Equation (2)

\begin{array}{l} σ_{B}^{2} (\hat{α}) = \frac{σ_{3}^{2} \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}}}{σ_{3}^{2} + \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}}} = \frac{σ_{1}^{2} σ_{2}^{2} σ_{3}^{2}}{σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{3}^{2} + σ_{2}^{2} σ_{3}^{2}} . \end{array}

In Figure 2, we show how variance is changing for different blending parameters α₁ and α₂ and α₃ = 1 − α₁ − α₂. The script is in Appendix 2. Blue dot—the optimal value of blending parameters (α₁, α₂, 1 − α₁ − α₂) matches the simulation (minimal value for variance).

FIGURE 2

Figure 2. Grid consisting of red dots—variance for different α₁ and α₂. Grid consisting of green dots—optimal variance $σ_{B}^{2}$ . Blue dot—optimal α₁ and α₂ and 1 − α₁ − α₂ with its value $σ_{B}^{2} (\hat{α})$ . Python script is in Appendix 2.

Blending N independent models

Now that we have formulas for two and three different models, we prove formulas for N independent models with normal distributions:

\begin{array}{l} ŷ_{k, j} = y_{j} + r_{k, j}, \end{array}

where $R_{k} ~ N (0, σ_{k}^{2})$ . We combine these models as follows:

\begin{array}{l} ŷ_{B, j} = \sum_{k = 1}^{N} α_{k} ŷ_{k, j}, \end{array}

where $\sum_{k = 1}^{N} α_{k} = 1$ and α_k > 0 for k ∈ [1, N].

First, we show model independence is still needed.

Lemma 1. Having N independent models with normal distributions $R_{k} ~ N (0, σ_{k}^{2})$ for k ∈ [1, N], when combined as $r_{B, j} = \sum_{k = 1}^{N} α_{k} r_{k, j}$ , variance of R_B is $σ_{B}^{2} = \sum_{k = 1}^{N} α_{k}^{2} σ_{k}^{2}$ .

Proof.

\begin{array}{l} σ_{B}^{2} = E ({(\sum_{k = 1}^{N} α_{k} r_{k, j})}^{2}) \\ = \frac{1}{M} (\sum_{k = 1}^{N} \sum_{j = 1}^{M} α_{k}^{2} r_{k, j}^{2} + 2 \sum_{k = 1}^{N} \sum_{\begin{matrix} l = 1 \\ l \neq k \end{matrix}}^{N} \sum_{j = 1}^{M} α_{k} α_{l} r_{k, j} r_{l, j}) \\ = \sum_{k = 1}^{N} α_{k}^{2} \frac{1}{M} \sum_{j = 1}^{M} r_{k, j}^{2} + 2 \sum_{k = 1}^{N} \sum_{\begin{matrix} l = 1 \\ l \neq k \end{matrix}}^{N} α_{k} α_{l} \frac{1}{M} \sum_{j = 1}^{M} r_{k, j} r_{l, j} \end{array}

As models are independent (covariance $\frac{1}{M} \sum_{j = 1}^{M} r_{k, j} r_{l, j} = 0$ is zero for l ≠ k), we get

\begin{array}{l} σ_{B}^{2} = \sum_{k = 1}^{N} α_{k}^{2} \frac{1}{M} \sum_{j = 1}^{M} r_{k, j}^{2} = \sum_{k = 1}^{N} α_{k}^{2} σ_{k}^{2}, \end{array}

which ends the proof.

Theorem 2. Having N independent models with normal distributions $R_{k} ~ N (0, σ_{k}^{2})$ for k ∈ [1, N], we get an optimal blend with parameters

\begin{array}{l} \hat{α_{k}} = \frac{\prod_{\begin{matrix} j = 1 \\ j \neq k \end{matrix}}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}}, \end{array}

and these independent models form normal distribution $N (0, σ_{B}^{2})$ , which has variance

\begin{array}{l} σ_{B}^{2} = \frac{\prod_{j = 1}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}} . \end{array}

Proof. For N = 2, we have shown it in Section 2. Now we use induction, if it is true for N, then it is true also for N + 1.

Remark. We have shown this also for three models in Section 3, but as for induction it is not needed, Section 3 is only a motivational section for how to derive final formulas for N models.

We combine two normal distributions $N (0, \frac{\prod_{j = 1}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}})$ (assuming it is true for N) and $N (0, σ_{N + 1}^{2})$ . From Equation (2), (lemma 1 is incorporated in this equation) we get

\begin{array}{l} σ_{B}^{2} = \frac{\frac{\prod_{j = 1}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}} σ_{N + 1}^{2}}{\frac{\prod_{j = 1}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}} + σ_{N + 1}^{2}} \\ = \frac{σ_{N + 1}^{2} \prod_{j = 1}^{N} σ_{j}^{2}}{\prod_{j = 1}^{N} σ_{j}^{2} + σ_{N + 1}^{2} \sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}} \\ = \frac{\prod_{j = 1}^{N + 1} σ_{j}^{2}}{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2}} \end{array}

and hence, we have shown optimal variance is valid for N + 1. Now we must show the same for the optimal coefficients. From Equation (1), we get

\begin{array}{l} \hat{α} = \frac{σ_{N + 1}^{2}}{\frac{\prod_{j = 1}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}} + σ_{N + 1}^{2}} = \frac{σ_{N + 1}^{2} \sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2}} \end{array}

and hence,

\begin{array}{l} {\hat{α}}_{N + 1} = 1 - \hat{α} \\ = 1 - \frac{σ_{N + 1}^{2} \sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2}} \\ = \frac{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2} - σ_{N + 1}^{2} \sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2}} \\ = \frac{\prod_{\begin{matrix} j = 1 \\ j \neq N + 1 \end{matrix}}^{N + 1} σ_{j}^{2} + \sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2} - σ_{N + 1}^{2} \sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2}} \\ = \frac{\prod_{\begin{matrix} j = 1 \\ j \neq N + 1 \end{matrix}}^{N + 1} σ_{j}^{2}}{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2}}, \end{array}

which proves ${\hat{α}}_{N + 1}$ . Finally to show the same for $\hat{α_{k}}$ for k ∈ [1, N]:

\begin{array}{l} \hat{α_{k}} = \hat{α} \frac{\prod_{\begin{matrix} j = 1 \\ j \neq k \end{matrix}}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}} \\ = \frac{σ_{N + 1}^{2} \sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2}} \frac{\prod_{\begin{matrix} j = 1 \\ j \neq k \end{matrix}}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}} \\ = \frac{σ_{N + 1}^{2}}{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2}} \prod_{\begin{matrix} j = 1 \\ j \neq k \end{matrix}}^{N} σ_{j}^{2} \\ = \frac{\prod_{\begin{matrix} j = 1 \\ j \neq k \end{matrix}}^{N + 1} σ_{j}^{2}}{\sum_{i = 1}^{N + 1} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N + 1} σ_{j}^{2}}, \end{array}

which ends the proof.

Going to infinity

If we can generate infinite independent models with distributions $R_{i} ~ N (0, σ^{2})$ (same variance), the final variance will be

\begin{array}{l} σ_{B}^{2} = lim_{N \to + \infty} \frac{\prod_{j = 1}^{N} σ^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ^{2}} = lim_{N \to + \infty} \frac{σ^{2 N}}{N σ^{2 (N - 1)}} \\ = lim_{N \to + \infty} \frac{σ^{2}}{N} = 0, \end{array}

which means we can combine all these models to get a perfect prediction with no errors. Naturally, creating an infinite amount of independent models (with covariances zero) is a difficult if not impossible task in real applications.

Theorem 3. Having N independent models with normal distributions $R_{k} ~ N (0, σ_{k}^{2})$ for k ∈ [1, N] and their variances $σ_{k}^{2} \leq σ_{M}^{2}$ , where $σ_{M}^{2}$ is their maximum variance, combining them optimally with coefficients from the theorem 2, their combined variance is $σ_{B}^{2} \leq \frac{σ_{M}^{2}}{N}$ .

Proof. We use induction again. For N = 2, we get

\begin{array}{l} σ_{B}^{2} = \frac{σ_{1}^{2} σ_{2}^{2}}{σ_{1}^{2} + σ_{2}^{2}} \leq \frac{σ_{M}^{2}}{2} \end{array}

This is true as

\begin{array}{l} σ_{1}^{2} σ_{2}^{2} + σ_{1}^{2} σ_{2}^{2} \leq σ_{M}^{2} σ_{1}^{2} + σ_{M}^{2} σ_{2}^{2}, \end{array}

because

\begin{array}{l} σ_{1}^{2} σ_{2}^{2} \leq σ_{M}^{2} σ_{1}^{2} \end{array}

and

\begin{array}{l} σ_{1}^{2} σ_{2}^{2} \leq σ_{M}^{2} σ_{2}^{2} . \end{array}

Now if it is true for N, then it is true also for N + 1. If

\begin{array}{l} σ_{B, N}^{2} \leq \frac{σ_{M}^{2}}{N}, \end{array}

then

\begin{array}{l} σ_{B, N + 1}^{2} \leq \frac{σ_{M}^{2}}{N + 1} . \end{array}

That is true as

\begin{array}{l} σ_{B, N + 1}^{2} = \frac{σ_{B, N}^{2} σ_{N + 1}^{2}}{σ_{B, N}^{2} + σ_{N + 1}^{2}} \leq \frac{σ_{M}^{2}}{N + 1}, \end{array}

because

\begin{array}{l} N σ_{B, N}^{2} σ_{N + 1}^{2} + σ_{B, N}^{2} σ_{N + 1}^{2} \leq σ_{M}^{2} (σ_{B, N}^{2} + σ_{N + 1}^{2}) \end{array}

as both - this

\begin{array}{l} N σ_{B, N}^{2} σ_{N + 1}^{2} \leq σ_{M}^{2} σ_{N + 1}^{2} \end{array}

and this

\begin{array}{l} σ_{B, N}^{2} σ_{N + 1}^{2} \leq σ_{M}^{2} σ_{B, N}^{2} \end{array}

are true, which ends the proof.

This proof means, that if we combine infinite independent models with distributions $R_{i} ~ N (0, σ_{i}^{2})$ , where variance $σ_{i}^{2} \leq σ_{M}^{2}$ , we get variance:

\begin{array}{l} σ_{B}^{2} = lim_{N \to + \infty} \frac{\prod_{j = 1}^{N} σ_{j}^{2}}{\sum_{i = 1}^{N} \prod_{\begin{matrix} j = 1 \\ j \neq i \end{matrix}}^{N} σ_{j}^{2}} \leq lim_{N \to + \infty} \frac{σ_{M}^{2}}{N} = 0 . \end{array}

Combining infinite independent models with bounded variances from above leads to perfect prediction with variance zero.

It can be shown the same way as in Theorem 3 that combined variance is bounded also from below (as the proof is almost identical we avoid it here). If all distributions $R_{k} ~ N (0, σ_{k}^{2})$ for k ∈ [1, N] have their variance in interval $σ_{k}^{2} \in [σ_{m i n}^{2}, σ_{m a x}^{2}]$ for k ∈ [1, N], then their combined variance will be in interval $σ_{B}^{2} \in [\frac{σ_{m i n}^{2}}{N}, \frac{σ_{m a x}^{2}}{N}]$ .

Similar conclusion with machine learning with counts

When it comes to using only counts (feature says yes or no only) in machine learning for predictions, as it is shown in Taraba (2021) (see section 7) on a nine-features example, we can come to the same conclusion as in the previous chapter that an infinite amount of features can lead to perfect prediction with no error. While the previous approach is statistical, machine learning with counts uses Pascal's triangle and binomial raised to infinity to show this. We use the binomial expansion

\begin{array}{l} 1 = {(p + (1 - p))}^{n} = \sum_{i = 0}^{n} (\binom{n}{i}) p^{n - i} {(1 - p)}^{i}, \end{array}

where p is the probability of features to be correct. As we want to have an odd amount of features to be able to make a decision purely on the counts (feature says yes or no), we will replace n with 2k + 1

\begin{array}{l} 1 = {(p + (1 - p))}^{2 k + 1} = \sum_{i = 0}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} . \end{array}

This can be split into two parts, one with probability when the majority of features are correct P_correct and one with probability when the majority of features are incorrect:

\begin{array}{l} 1 = {(p + (1 - p))}^{2 k + 1} = P_{c o r r e c t} + P_{i n c o r r e c t}, \end{array}

where

\begin{array}{l} P_{c o r r e c t} = \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} \end{array}

and

\begin{array}{l} P_{i n c o r r e c t} = \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} . \end{array}

To show that an infinite amount of features can lead to perfect prediction, we have to show that P_correct with the majority of features correct (at least k + 1 of them correct) goes to 1 for all p ∈ (0.5, 1]

\begin{array}{l} lim_{k \to \infty} \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} = 1 \end{array}

We start by showing the simpler case first and that is when p = 0.5 then P_correct and P_incorrect are equal and P_correct = P_incorrect = 0.5. To show this, we can write

\begin{array}{l} P_{c o r r e c t, p = 0.5} = \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) 0 . 5^{2 k + 1} = 0 . 5^{2 k + 1} \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) \end{array}

and

\begin{array}{l} P_{i n c o r r e c t, p = 0.5} = \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) 0 . 5^{2 k + 1} = 0 . 5^{2 k + 1} \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) \end{array}

and those are equal as $\sum_{i = 0}^{k} (\binom{2 k + 1}{i}) = \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i})$ , because $(\binom{2 k + 1}{i}) = (\binom{2 k + 1}{2 k + 1 - i})$ for i ∈ {0, 1, …, k}. As P_{correct,p = 0.5} = P_{incorrect,p = 0.5} and their sum is 1 it follows that

\begin{array}{l} 1 = P_{c o r r e c t, p = 0.5} + P_{i n c o r r e c t, p = 0.5} = 2 P_{c o r r e c t, p = 0.5} \end{array}

and hence,

\begin{array}{l} P_{c o r r e c t, p = 0.5} = P_{i n c o r r e c t, p = 0.5} = 0.5 . \end{array}

Now that we have shown what happens when p = 0.5, we show the main limit theorem for p ∈ (0.5, 1].

Theorem 4. $lim_{k \to \infty} \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} = 1$ for all p ∈ (0.5, 1].

Proof. To show that $lim_{k \to \infty} \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} = 1$ for all p ∈ (0.5, 1], we will show instead that $lim_{k \to \infty} \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} = 0$ and as their sum is 1, $lim_{k \to \infty} \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} = 1$ will follow.

First, we can rewrite $\sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i}$ as

\begin{array}{l} \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} \\ = \sum_{i = 0}^{k} (\binom{2 k + 1}{k + 1 + i}) p^{2 k + 1 - (k + 1 + i)} {(1 - p)}^{k + 1 + i} . \end{array}

It is obvious that the numbers in the pascal triangle are decreasing when starting after the middle:

\begin{array}{l} \frac{(\binom{2 k + 1}{k + 1 + i + 1})}{(\binom{2 k + 1}{k + 1 + i})} = \frac{\frac{(2 k + 1)!}{(k + 2 + i)! (k - i - 1)!}}{\frac{(2 k + 1)!}{(k + 1 + i)! (k - i)!}} = \frac{k - i}{k + i + 2} < 1 \end{array}

for i ∈ 0, 1, …, k − 1, and hence, we can write

\begin{array}{r} \sum_{i = 0}^{k} (\binom{2 k + 1}{k + 1 + i}) p^{2 k + 1 - (k + 1 + i)} {(1 - p)}^{k + 1 + i} \\ < (\binom{2 k + 1}{k + 1}) p^{k} {(1 - p)}^{k + 1} \sum_{i = 0}^{k} {(\frac{1 - p}{p})}^{i} . \end{array}

As p is in p ∈ (0.5, 1], then $\frac{1 - p}{p} \in [0, 1)$ , and hence, we can write

\begin{array}{r} (\binom{2 k + 1}{k + 1}) p^{k} {(1 - p)}^{k + 1} \sum_{i = 0}^{k} {(\frac{1 - p}{p})}^{i} \\ = (\binom{2 k + 1}{k + 1}) p^{k} {(1 - p)}^{k + 1} \frac{1 - {(\frac{1 - p}{p})}^{k + 1}}{1 - (\frac{1 - p}{p})} . \end{array}

With that we can finally look at the original limit and write

\begin{array}{l} lim_{k \to \infty} \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} \leq lim_{k \to \infty} \\ (\binom{2 k + 1}{k + 1}) p^{k} {(1 - p)}^{k + 1} \frac{1 - {(\frac{1 - p}{p})}^{k + 1}}{1 - (\frac{1 - p}{p})} . \end{array}

As $lim_{k \to \infty} {(\frac{1 - p}{p})}^{k + 1} = 0$ for p ∈ (0.5, 1] as $\frac{1 - p}{p} \in [0, 1)$ , we can write

\begin{array}{l} lim_{k \to \infty} \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} \\ \leq \frac{1}{1 - (\frac{1 - p}{p})} lim_{k \to \infty} (\binom{2 k + 1}{k + 1}) p^{k} {(1 - p)}^{k + 1} \\ = \frac{p}{2 p - 1} lim_{k \to \infty} (\binom{2 k + 1}{k + 1}) p^{k} {(1 - p)}^{k + 1} . \end{array}

Now we look at $lim_{k \to \infty} (\binom{2 k + 1}{k + 1}) p^{k} {(1 - p)}^{k + 1}$ . We can take a member $(\binom{2 (k + r) + 1}{(k + r) + 1}) p^{(k + r)} {(1 - p)}^{(k + r) + 1}$ and compare it with r′ = r + 1 follower $(\binom{2 (k + r + 1) + 1}{(k + r + 1) + 1}) p^{(k + r + 1)} {(1 - p)}^{(k + r + 1) + 1}$ by division

\begin{array}{r} \frac{(\begin{matrix} 2 (k + r + 1) + 1 \\ (k + r + 1) + 1 \end{matrix}) p^{(k + r + 1)} {(1 - p)}^{(k + r + 1) + 1}}{(\begin{matrix} 2 (k + r) + 1 \\ (k + r) + 1 \end{matrix}) p^{(k + r)} {(1 - p)}^{(k + r) + 1}} = \frac{\frac{(2 (k + r) + 3)!}{(k + r + 2)! (k + r + 1)!}}{\frac{(2 (k + r) + 1)!}{(k + r + 1)! (k + r)!}} \\ p (1 - p) = = \frac{(2 (k + r) + 3) (2 (k + r) + 2)}{(k + r + 2) (k + r + 1)} p (1 - p) \\ < 4 p (1 - p) < 1, \end{array}

for all r ≥ 0 and p ∈ (0.5, 1] as p(1 − p) ∈ [0, 0.25). This means we are multiplying a finite number $(\binom{2 k + 1}{k + 1}) p^{k} {(1 - p)}^{k + 1}$ , which is in interval [0, 1], infinitely many times with a number larger or equal than 0 and smaller than 1, hence

\begin{array}{l} lim_{k \to \infty} (\binom{2 k + 1}{k + 1}) p^{k} {(1 - p)}^{k + 1} = 0 \end{array}

and hence, the original limit

\begin{array}{l} lim_{k \to \infty} \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} \leq 0 \end{array}

as well. As $\sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} \geq 0$ , it follows that

\begin{array}{l} lim_{k \to \infty} \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} = 0 . \end{array}

Remark. It is worth mentioning that p is fixed and chosen from the interval (0.5, 1], and we are not looking at the limit of p → 0.5, but the limit of k → ∞. For p = 0.5, we already know that P_{correct,p = 0.5} = P_{incorrect,p = 0.5} = 0.5.

\begin{array}{l} lim_{k \to \infty} \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} + \\ + lim_{k \to \infty} \sum_{i = k + 1}^{2 k + 1} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} = 1, \end{array}

it follows that $lim_{k \to \infty} \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) p^{2 k + 1 - i} {(1 - p)}^{i} = 1$ , which ends the proof.

This could be summarized for all p ∈ [0, 1] in Figure 3 as

\begin{array}{l} P_{c o r r e c t} = {\begin{matrix} 0 & for p \in [0, 0.5) \\ 0.5 & for p = 0.5 \\ 1 & for p \in (0.5, 1] \end{matrix} . \end{array}

FIGURE 3

Figure 3. Different outcomes for P_correct for different p intervals when k is going to ∞.

This can be understood intuitively by plotting P_correct with increasing k in Figure 4.

FIGURE 4

Figure 4. Different outcomes for P_correct for different p ∈ [0.0, 1] with increasing k ∈ [0, …, 51]. Python script is in Appendix 3.

It is worth mentioning that with weak features with p = 0.52, and only n = 15001 of them (no need to go to infinity), P_correct is already around 1 (see Figure 5).

FIGURE 5

Figure 5. Different outcomes for P_correct for different p ∈ [0.5, 0.6] with increasing n for n ∈ {5001, 10001, 15001}. Python script is in Appendix 4.

As next, we show that P_correct increases, if any of the probability of the features increases its value $p_{i}^{'} > p_{i}$ . For case k = 0 (n = 1), this is simple, as

\begin{array}{l} P_{c o r r e c t} (n e w) = p_{1}^{'} > p_{1} = P_{c o r r e c t} (o l d) . \end{array}

For three features, it is also trivial as

\begin{array}{l} P_{c o r r e c t} (n e w) = p_{1}^{'} p_{2} p_{3} + p_{1}^{'} p_{2} (1 - p_{3}) + p_{1}^{'} (1 - p_{2}) p 3 \\ + (1 - p_{1}^{'}) p_{2} p_{3} \end{array}

as it can be re-written to $p_{1}^{'}$ independent part and $p_{1}^{'}$ dependent part

\begin{array}{l} P_{c o r r e c t} (n e w) = p_{2} p_{3} + p_{1}^{'} (p_{2} (1 - p_{3}) + (1 - p_{2}) p_{3}) \end{array}

and from that, it immediately follows that P_correct(new) > P_correct(old) as $0 \leq p_{1}, p_{1}^{'}, p_{2}, p_{3} \leq 1$ and $p_{1}^{'} > p_{1}$ . To show this for the general case, not only for cases k ∈ {0, 1} (n = 2k + 1 ∈ {1, 3}), we first write the general formula:

\begin{array}{l} 1 = \sum_{i_{1} = 0}^{1} \sum_{i_{2} = 0}^{1} \dots \sum_{i_{2 k + 1} = 0}^{1} \prod_{j = 1}^{2 k + 1} (p_{j} {(- 1)}^{i_{j} + 1} + (1 - i_{j})) . & (4) \end{array}

This can be re-written into two parts once again, P_correct, when the majority of features say yes—are correct, and P_incorrect, when the majority of features say no—are incorrect.

\begin{array}{l} P_{c o r r e c t} (p_{1}, \dots, p_{2 k + 1}) = \sum_{i_{1} = 0}^{1} \dots \sum_{i_{2 k + 1} = 0}^{1} m (i_{1}, \dots, i_{2 k + 1}) \\ \prod_{j = 1}^{2 k + 1} (p_{j} {(- 1)}^{i_{j} + 1} + (1 - i_{j})), \end{array}

where

\begin{array}{l} m (i_{1}, \dots, i_{2 k + 1}) = {\begin{matrix} 0 & for \sum_{j = 1}^{2 k + 1} i_{j} \leq k \\ 1 & for \sum_{j = 1}^{2 k + 1} i_{j} > k \end{matrix} . \end{array}

Theorem 5. Increasing the probability of one of the features $p_{1}^{'} > p_{1}$ increases final correct probability of all features, $P_{c o r r e c t} (p_{1}^{'}, p_{2}, \dots, p_{2 k + 1}) > P_{c o r r e c t} (p_{1}, p_{2}, \dots, p_{2 k + 1})$ .

Proof. We show this for increasing the first probability, as probabilities can be re-ordered and their order does not matter for the final correct probability.

Now again, as with example k = 1, n = 3, P_correct can be split into $p_{1}^{'}$ independent and dependent part. The dependent part will only contain $p_{1}^{'}$ and no $(1 - p_{1}^{'})$ as for every $(1 - p_{1}^{'})$ there is also the exact same case with $p_{1}^{'}$ , where even more features are correct and those two can be joined and hence it belongs to the independent part. Hence,

\begin{array}{l} P_{c o r r e c t} (p_{1}^{'}, p_{2}, \dots, p_{2 k + 1}) = P_{I} + p_{1}^{'} P_{D}, \end{array}

and

\begin{array}{l} P_{c o r r e c t} (p_{1}, p_{2}, \dots, p_{2 k + 1}) = P_{I} + p_{1} P_{D}, \end{array}

where P_I and P_D are fixed and non-negative (as all p_i ≥ 0 and (1 − p_i) ≥ 0) and dependent on p₂, …, p_2k+1 and hence $P_{c o r r e c t} (p_{1}^{'}, p_{2}, \dots, p_{2 k + 1}) = P_{I} + p_{1}^{'} P_{D} > P_{I} + p_{1} P_{D} = P_{c o r r e c t} (p_{1}, p_{2}, \dots, p_{2 k + 1})$ as $p_{1}^{'} > p_{1}$ , which ends the proof.

Theorem 6. $P_{c o r r e c t} (p_{1}, p_{2}, \dots, p_{2 k + 1}) \geq \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) p_{m i n}^{2 k + 1 - i} {(1 - p_{m i n})}^{i}$ , where p_min = min{p₁, p₂, …, p_2k+1}.

Proof. This proof directly follows from using the Theorem 5 multiple times by increasing every probability one at a time from the initial value p_min:

\begin{array}{r} P_{c o r r e c t} (p_{1}, p_{2}, \dots, p_{2 k + 1}) \geq \\ \geq P_{c o r r e c t} (p_{m i n}, p_{2}, \dots, p_{2 k + 1}) \geq \\ \geq P_{c o r r e c t} (p_{m i n}, p_{m i n}, \dots, p_{2 k + 1}) \geq \\ ⋮ \\ \geq P_{c o r r e c t} (p_{m i n}, p_{m i n}, \dots, p_{m i n}) = \\ = \sum_{i = 0}^{k} (\binom{2 k + 1}{i}) p_{m i n}^{2 k + 1 - i} {(1 - p_{m i n})}^{i}, \end{array}

which ends the proof.

Theorems 6 and 4 were needed in order to be able to say, that combining an infinite amount of features with their probabilities p_i ∈ (0.5, 1] (min{p₁, …, p_2k+1, …} > 0.5) will lead to perfect prediction with no error for machine learning with counts, as it was in previous Section 5 when looking at the same problem from the statistical point of view (Gaussian distributions). It is important to say once again that these separate probabilities p_i have to be independent, as otherwise Equation (4) would not be valid.

Conclusion and discussion

We have derived blending coefficients for the ensemble of multiple independent prediction models with normal error distribution. This manuscript was mainly inspired by a Netflix competition, in which in the final stages of competition multiple teams joined their efforts to increase the accuracy of their final predictor and blending turned out to be essential to win the competition in a very short time during the final stage. This method was not only used in the Netflix competition but is used for other datasets in machine learning, such as MNIST and CIFAR-10 for image processing. We have also shown that having an infinite amount of independent predictors with their variances bounded from above is sufficient to achieve perfect prediction. While deep learning is very popular these days, one should not forget to include more features (going wider) when in need of improvement in accuracy.

Looking at a similar problem and more specifically machine learning with counts, where we only count how many features are for and against and make a decision based on a voting mechanism and a majority vote winner, we have shown once again, that an infinite amount of independent features will lead to perfect prediction when using only features which have >50% of accuracy.

Naturally, independent features in practice are hard to find and further the study could be made on how to convert dependent (correlated) features into independent in order to achieve as high accuracy as possible, or how to combine these dependent features together and what theoretical accuracies can be achieved. It could be also of interest how to combine features, which are not binary (for and against features), but have more than two possible outcomes.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Acknowledgments

We would like to thank the reviewers for their suggestions and comments which led to the improvement of the manuscript.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frai.2023.1144886/full#supplementary-material

References

Abouelnaga, Y., Ali, O. S., Rady, H., and Moustafa, M. (2016). “Cifar-10: KNN-based ensemble of classifiers,” in 2016 International Conference on Computational Science and Computational Intelligence (CSCI) (Las Vegas, NV), 1192–1195.

Amatriain, X. (2013). “Big & personal: data and models behind netflix recommendations,” in BigMine '13 (Chicago, IL).

Ardabili, S., Mosavi, A., and Várkonyi-Kóczy, A. R. (2020). “Advances in machine learning modeling reviewing hybrid and ensemble methods,” in Engineering for Sustainable Future, ed A. R. Várkonyi-Kóczy (Cham: Springer International Publishing), 215–227.

Google Scholar

Bothos, E., Christidis, K., Apostolou, D., and Mentzas, G. (2011). “Information market based recommender systems fusion,” in Proceedings of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems, HetRec '11 (New York, NY: Association for Computing Machinery), 1–8.

Google Scholar

Bruno, A., Moroni, D., and Martinelli, M. (2022). Efficient Adaptive Ensembling for Image Classification. Technical report, ISTI Working Paper, 2022. Consiglio Nazionale delle Ricerche.

Google Scholar

Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2011). “Convolutional neural network committees for handwritten character classification,” in 2011 International Conference on Document Analysis and Recognition, 1135–1139.

Google Scholar

Coscrato, V., de Almeida Inácio, M. H., and Izbicki, R. (2020). The NN-stacking: feature weighted linear stacking through neural networks. Neurocomputing 399, 141–152.

Google Scholar

Jahrer, M., Töscher, A., and Legenstein, R. (2010). “Combining predictions for accurate recommender systems,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '10 (New York, NY: Association for Computing Machinery), 693–702.

Kay, S. M. (1993). Fundamentals of Statistical Signal Processing: Estimation Theory. Prentice-Hall, Inc.

Google Scholar

Koren, Y. (2009). The bellkor solution to the netflix grand prize. Netflix Prize Docu. 81, 1–10.

Google Scholar

Schuhen, N., Thorarinsdottir, T. L., and Gneiting, T. (2012). Ensemble model output statistics for wind vectors. Month. Weath. Rev. 140, 3204–3219. doi: 10.1175/MWR-D-12-00028.1

CrossRef Full Text | Google Scholar

Taraba, P. (2021). Linear regression on a set of selected templates from a pool of randomly generated templates. Mach. Learn. Appl. 6:100126. doi: 10.1016/j.mlwa.2021.100126

CrossRef Full Text | Google Scholar

Töscher, A., Jahrer, M., and Bell, R. M. (2009). The bigchaos solution to the netflix grand prize. Netflix Prize Docu. 1–52.

Google Scholar

Xiang, L., and Yang, Q. (2009). “Time-dependent models in collaborative filtering based recommender system,” in 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, Vol. 1 (Milan), 450–457.

Keywords: blending of independent models, normal distributions, machine learning with counts, Gaussians, going wider

Citation: Taraba P (2023) Optimal blending of multiple independent prediction models. Front. Artif. Intell. 6:1144886. doi: 10.3389/frai.2023.1144886

Received: 15 January 2023; Accepted: 06 February 2023;
Published: 24 February 2023.

Edited by:

Georgios Leontidis, University of Aberdeen, United Kingdom

Reviewed by:

Kristina Sutiene, Kaunas University of Technology, Lithuania
Jolita Bernatavičienė, Vilnius University, Lithuania

Copyright © 2023 Taraba. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Peter Taraba, yes dGFyYWJhLnBldGVyQG1haWwuY29t

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.