^{*}

Edited by: Juergen Prestin, University of Lübeck, Germany

Reviewed by: Nadiia Derevianko, University of Göttingen, Germany; Michael Gnewuch, Osnabrück University, Germany

This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The distribution of data points is a key component in machine learning. In most cases, one uses min-max-normalization to obtain nodes in [0, 1] or _{2} space of functions with the standard normal distribution as integration weight. Subsequently, we are able to apply the explainable ANOVA approximation for this basis and use

In machine learning, the scale of our features is a key component in building models. When we work with data from applications, we have to accept it as it is. In most cases, we cannot control where the nodes are lying. Let us, e.g., take recommendations in online shopping. We are only able to analyze the customers that actually exist and what they bought in the shop. However, the features may lie on immensely different scales. If we measure, e.g., the time a customer spent in the shop in seconds, as well as their age in years, the result will be a scale that contains values with thousands of seconds and a scale ranging from up to 90 years. Bringing those features on similar scales trough normalization may significantly improve performance of our model.

Two common methods for data normalization are min-max-normalization and

The explainable ANOVA approximation method introduced in [

We aim to achieve this by using the transformation ideas from [

with the probability density of the standard normal distribution

The cumulative distribution function of the standard normal distribution is given by

(see

Combining this transformation with the half-period cosine basis allows for fast multiplications in the grouped transformations and makes the ANOVA approximation method applicable for Z-score normalized data.

Cumulative distribution function Φ of the standard normal distribution from Equation (3).

As an example, we apply this approach to a dataset about the detection of forest fires, see [

In this section, it is our goal to construct a complete orthonormal system in the space

We aim to construct the basis using transformation ideas from [

with ||_{0} := |supp _{s} ≠ 0} and |supp ^{d} → ℝ, ^{d}. As transformation from the interval [0, 1] to ℝ, we apply the inverse cumulative distribution function Φ^{−1} in each variable to obtain

with the inverse transformation being

The transformation is related to inverse transform sampling, see, e.g.,[

Commutative diagram of the function and the transformations.

^{−1} as in Equations (5) and (6), respectively. Then

As functional determinant we obtain

This proves the first equality. For the second equality, we use an analogous procedure.

Lemma 2.1 is not surprising since we have based the transformation on the cumulative distribution function Φ from Equation (3). In the following, we obtain the new orthonormal system.

^{−1} is an isometric isomorphism between

In summary, we have constructed a complete orthonormal system

Transformed basis functions

In this section, we briefly summarize the interpretable ANOVA (analysis of variance) approximation method and the idea of grouped transformations, see [^{d} → ℝ from

and through Parseval's identity

The classical ANOVA decomposition, c.f. [

The function can then be uniquely decomposed as

into

It is our goal to obtain information on how important the ANOVA terms _{u} are with respect to the function

Note that, we have the special case

From the GSI we get a motivation for the concept of effective dimensions, specifically the superposition dimension as one notion of effective dimension. For a given α ∈ [0, 1] it is defined as

The superposition dimension d^{(sp)} tells us that we can explain the α-part of the variance of _{u} with ^{(sp)}.

Using subsets of ANOVA terms

A specific idea for the truncation comes from the superposition dimension d^{(sp)} in Equation (10). The idea is to take only variable interactions into account that contain _{s} or less variables, i.e., the subset of ANOVA terms is

Here, we call _{s} a superposition threshold. Since _{s} does not necessarily have to coincide to the superposition dimension d^{(sp)}, we call it _{s}) grows only polynomially in _{s}<

which has reduced the curse of dimensionality.

In the following, we argue why the truncation by a superposition threshold _{s} works well in relevant cases. For the approximation of functions that belong to a space _{k}(^{(sp)} for α ∈ [0, 1], see, e.g.,[

In terms of real data from applications, the situation is much different. Here, we cannot make the assumption that in complete generality we have a low superposition dimension. However, there are many application scenarios where numerical experiments successfully showed that this is indeed the case, see, e.g., [_{s} for truncation and validate on our test data.

In this section, we briefly discuss how the approximation is numerically obtained and how we can interpreted the results. In this section, we assume a given subset of ANOVA terms _{s}). We have given scattered data in the form of a set ^{M}, _{i}) ≈ _{i} which we want to approximate.

First, we truncate _{U}

with order-dependent parameters _{|u|} ∈ ℕ, |_{s}, for every ANOVA term _{u}, ^{c}: = [

Now, taking the union

However, the coefficients c_{k}(

c.f. [

We solve problem Equation (13) using the iterative LSQR solver [

with ^{|I(U)|, |I(U)|} the identity matrix. Note that, we always have a unique solution in this case since the matrix

has full column rank. However, the solution depends on the regularization parameter λ.

We apply the matrix-free variant of LSQR, i.e., we never explicitly construct the matrix _{s}-dimensional NFCTs which results in an efficient algorithm. For more details we refer to [

One key fact is that the nodes

We use the global sensitivity indices

for any pair _{1}, _{2} ∈

In order to rank the influence of the variables _{1}, _{2}, …, _{d} we use the ranking score

for

In this section, we describe how to obtain a set of ANOVA terms ^{M}, _{s} ∈ [_{|u|}, |_{s}, c.f. Equation (12), to obtain

From the approximation _{s}), and an attribute ranking _{s}).

One obvious method is the truncation of an entire variable _{i},

A different method is

Here, ε_{|u|} denotes the |

In summary, it is necessary to interpret the information from the approximation

We now apply the previously described method to the dataset [

We group the 12 attributes into 4 categories as in [

Attributes of the forest fires dataset and their corresponding groups.

1 | spatial (S) | X | x-coordinate (1 to 9) |

2 | Y | y-coordinate (1 to 9) | |

3 | temporal (T) | month | month of the year (1 to 12) |

4 | day | day of the week (1 to 7) | |

5 | FWI | FFMC | FFMC code |

6 | DMC | DMC code | |

7 | DC | DC code | |

8 | ISI | ISI index | |

9 | meteorological (M) | temp | outside temperature in °C |

10 | RH | outside relative humidity in % | |

11 | wind | outside wind speed in km/h | |

12 | rain | outside rain in mm/m^{2} |

Codes of the FWI with their base components from the weather data according to [

Fine Fuel Moisture Code (FFMC) | temperature, relative humidity, wind, rain |

Duff Moisture Code (DMC) | temperature, relative humidity, rain |

Drought Code (DC) | temperature, rain |

Initial Spread Index (ISI) | wind, FFMC |

In terms of pre-processing, we apply a ^{M}. In the following, we do not use all of the variables, but build models based only on some groups as denoted in

_{s} = 2, c.f. Equation (11), and, therefore, needed to detect optimal choices for the parameters _{1} and _{2} from Equation (12), (see

with

We are able to outperform the previously applied method for every subset of attributes in both MAD and RMSE error. Notably, the difference in the RMSE that penalizes larger deviations in the burned area stronger than the MAD is much more significant.

MAD and RMSE (in brackets) for the best performing model in the corresponding attribute subset (

Naive | 18.61 (63.7) | 18.61 (63.7) | 18.61 (63.7) | 18.61 (63.7) |

MR | 13.07 (64.5) | 13.04 (64.4) | 13.00 (64.5) | 13.01 (64.5) |

DT | 13.46 (64.4) | 13.43 (64.6) | 13.24 (64.4) | 13.18 (64.5) |

RF | 13.31 (64.3) | 13.04 (64.5) | 13.38 (64.0) | 12.93 (64.4) |

NN | 13.09 (64.5) | 13.92 (68.9) | 13.08 (64.6) | 13.71 (66.9) |

SVM | 13.07 (64.7) | 13.13 (64.7) | 12.86 (64.7) | 12.71 (64.7) |

ANOVA |

Optimal parameter choices for the experiments from

_{1} |
_{2} |
|||
---|---|---|---|---|

S T FWI | 2 | 6 | 149 | e^{9} |

S T M | 2 | 10 | 261 | e^{10} |

FWI | 2 | 4 | 23 | e^{8} |

M | 2 | 8 | 47 | e^{7} |

While we replicated the setting of [_{1} = _{2} = 2 and λ = 1.0.

Attribute ranking _{1} = _{2} = 2 and λ = 1.0.

Global sensitivity indices _{1} = _{2} = 2 and λ = 1.0 (sorted). The green indices belong to sets

The attributes 3, 7, and 9 are clearly the most important. They represent the month of the year (3), the DC code of the FWI (7), and the outside temperature (9). Using only these three attributes and superposition threshold _{s} = 2, we computed an approximation with _{1} = 2, _{2} = 10, and λ = e^{8}. The resulting model yielded an MAD of 12.64 and an RMSE of 45.57 with 30 times of 10-fold cross validation as before. In summary, we know that the most important information of our problem is contained in only three attributes and we also obtained a better performing model using only those three attributes.

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Both authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

DP acknowledges funding by Deutsche Forschungsgemeinschaft (German Research Foundation)—Project–ID 416228727—SFB 1410. MS was supported by the German Federal Ministry of Education and Research grant 01|S20053A. The publication of this article was funded by Chemnitz University of Technology.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

The authors thank their colleagues in the research group SAlE for valuable discussions on the contents of this paper. Moreover, we thank the reviewers for their valuable comments and suggestions.

^{s}in weighted spaces with POD weights

_{2}-norm sampling discretization and recovery of functions from RKHS with finite trace.