Performance Assessment of Metaheuristic Algorithms for Structural Optimization Taking Into Account the Influence of Algorithmic Control Parameters

Metaheuristic optimization algorithms are strongly present in the literature on discrete optimization. They typically 1) use stochastic operators, making each run unique, and 2) often have algorithmic control parameters that have an unpredictable impact on convergence. Although both 1) and 2) affect algorithm performance, the effect of the control parameters is mostly disregarded in the literature on structural optimization, making it difficult to formulate general conclusions. In this article, a new method is presented to assess the performance of a metaheuristic algorithm in relation to its control parameter values. A Monte Carlo simulation is conducted in which several independent runs of the algorithm are performed with random control parameter values. In each run, a measure of performance is recorded. The resulting dataset is limited to the runs that performed best. The frequency of each parameter value occurring in this subset reveals which values are responsible for good performance. Importance sampling techniques are used to ensure that inferences from the simulation are sufficiently accurate. The new performance assessment method is demonstrated for the genetic algorithm in matlab R2018b, applied to seven common structural optimization test problems, where it successfully detects unimportant parameters (for the problems at hand) while identifying well-performing values for the important parameters. For two of the test problems, a better solution is found than the best solution reported so far in the literature.


INTRODUCTION
Metaheuristic algorithms are widespread in the literature on discrete and combinatorial optimization. They operate by applying stochastic (as opposed to deterministic) operators to explore the design space and to guide the search toward optimal designs, implying the need for statistical techniques to properly assess their performance. Nowadays, formal statistical analysis is considered a prerequisite for papers on non-deterministic optimization algorithms (Le Riche and Haftka, 2012;Haftka, 2016).
Metaheuristic algorithms are typically controlled by a number of algorithmic parameters, which can be modified to tune the search procedure to the optimization problem at hand. These parameters are referred to as control parameters in this article. The performance of a metaheuristic algorithm depends on three factors: 1) the specific optimization problem at hand, 2) the values of the control parameters, and 3) the random variability inherent to stochastic algorithms.
In the structural optimization community, it is common practice to benchmark newly proposed algorithms against one or more reference algorithms in order to assess their performance. Such benchmark studies are usually based on a simulation in which multiple optimization runs are performed (to compensate for the effect of the random seed) on a series of representative test problems (to compensate for the effect of the optimization problem). However, the effect of the control parameters is mostly ignored: their values are either chosen by intuition or convention, taken from other studies, or motivated by ad hoc experimentation, usually limited to changing one parameter value at a time (Chiarandini et al., 2007). From that moment on they remain fixed, although Bartz-Beielstein (2006) has shown that conclusions regarding algorithm performance do not necessarily apply to other parameter values.
Considering the wide range of metaheuristic algorithms currently in use, it is important to be able to assess their performance in an objective manner. For example, in a benchmarking context it would be fair to tune the control parameter values of all algorithms under consideration-using comparable resources-before moving on to the actual comparison (Eiben and Smit, 2011). Moreover, data on optimal control parameter values across multiple studies might reveal search strategies that work particularly well for a specific kind of optimization problem. In this regard, Hooker (1995) made the distinction between competitive and scientific testing, which, in the context of metaheuristic algorithms, is still very relevant today. Besides finding out which algorithm is better (competitive testing), it is important to investigate why this is the case (scientific testing). Hooker describes the former as "antiintellectual," as he believes that the emphasis on competition does not contribute to the sort of insight that will lead to the development of better algorithms in the long run.
Various methods exist to assess the performance of metaheuristic algorithms, including performance profiles (Dolan and Moré, 2002), data profiles (Moré and Wild, 2009), a probabilistic performance metric based on the effect size in statistics (Gomes et al., 2018) and statistical analysis methods by e.g. Charalampakis and Tsiatas (2019) and Georgioudakis and Plevris (2020). However, these methods are rarely used in a structural optimization context and there is no consensus on which method to use instead. A common approach is to perform several independent runs of the algorithm and to summarize the results using descriptive statistics, but the number of different runs and the statistics reported vary from study to study.
There is a trend in metaheuristic optimization to either reduce the number of (external) control parameters, as for example in the Newton metaheuristic algorithm by Gholizadeh et al. (2020), or to develop algorithms that update their parameters on-the-go, although these often have fixed parameters as well (Hutter et al., 2009b). Whether this is beneficial depends on the robustness of the algorithm with respect to the control parameters. An algorithm with many control parameters, whose values have little effect on performance or can be easily selected using rules of thumb, may still be preferable to an algorithm with few control parameters that need careful tuning. Therefore, a good performance assessment method should integrate the sensitivity of the algorithm with respect to its control parameter configuration.
This article presents a new method, designed to assess the performance of a metaheuristic algorithm in relation to the values of its control parameters. The scope is limited to control parameters whose values are constant throughout the optimization. The method is based on Monte Carlo simulation. The general idea is to perform multiple optimization runs, where the values of the control parameters vary randomly. From the resulting dataset, a subset of well-performing runs (e.g. the 10% best) is studied. The occurrence frequency of the various parameter values in this subset indicates which parameter values give rise to good performance. The threshold for "good" performance is gradually tightened, making the results more relevant, but also less accurate, as the well-performing subset will contain fewer samples. An adaptive importance sampling technique is used to sample good parameter values more often, ensuring sufficient accuracy throughout the procedure.
The proposed method is mathematically rigorous and has the advantage of being able to process any number of control parameters, both numerical and categorical, at the same time, presenting the results in a visual and intuitive way. It is intended as a tool to help developers of metaheuristic algorithms to test and benchmark their algorithm. Four possible applications are given below.
• The method can be used to investigate the sensitivity of an algorithm with respect to the control parameters. It identifies parameters that do not have a significant effect on performance and detects well-performing parameter values for the important parameters, helping developers to focus on the relevant parts of their algorithm. • The method can also be used to select appropriate default values for the control parameters of a new algorithm, similar to the way in which the optimal and the default matlab values are compared for the genetic algorithm in Section 4. • The method records the probability of achieving different levels of performance when parameter values are randomly selected from a set of user-defined "reasonable" options. The results can be compared with other algorithms to make statements about the overall performance in case no good parameter values would be known and the parameter configuration would be chosen arbitrarily. • As a by-product, the method may generate new benchmark results for the community, as evidenced by the improvement over the best known solution for two of the test cases in Section 4.
The remainder of the article is structured as follows. Section 2 gives an outline of the state of the art on parameter tuning methods, providing an overview of ways to account for the influence of the control parameters. Section 3 presents the problem definition and describes the proposed methodology. Section 4 presents two numerical experiments, based on the genetic algorithm that is built into matlab, for seven test problems that are commonly used in the structural optimization literature. Section 5 provides final conclusions.

STATE OF THE ART IN PARAMETER TUNING
Automated parameter tuning is a well-studied problem in the field of computational intelligence and machine learning. The following paragraphs summarize the state of the art.

Early Work
Research on the influence of control parameters started not long after the introduction of genetic algorithms and was initially mainly theoretical. Examples include Bäck and Schwefel (1993) and Rechenberg (1973). However, such theoretical models were either overly complex or required severe simplifications, making them of limited use in practical settings (Myers and Hancock, 2001). It became customary to study the effect of the control parameters through experimentation, using empirical models to test different parameter settings. Some authors, such as De Jong (1975), attempted to formulate universally valid control parameter configurations, until the No Free Lunch theorems (Wolpert and Macready, 1997) shattered the illusion of finding parameter values that are universally applicable.

Meta-Optimization
Alternatively, automatic tuning methods have been studied that tune the parameters for a single specific problem. A popular example is meta-optimization, which follows from the observation that choosing control parameters for maximal performance is an optimization problem in itself. In the metalevel problem, the control parameters are the design variables, and the performance of the algorithm is the objective function (often called the performance landscape). Depending on the nature of the control parameters, the meta-level problem is a mixed variable stochastic optimization problem that can be solved for example by iterated local search (ParamILS by Hutter et al. (2009b)) or by another metaheuristic algorithm (Nannen and Eiben, 2007;Ansótegui et al., 2009). The meta-optimizer can be applied either directly to the performance landscape (Grefenstette, 1986) or to a surrogate model of it (Audet and Orban, 2004).

Design of Experiments, Model-Based Optimization and Machine Learning
A popular class of tuning methods is based on the principles of Design of Experiments (DOE), which aims to maximize the information obtained from experiments in highly empirical fields of science such as biology and psychology. Common approaches include factorial designs, coupled with analysis of variance or regression analysis, e.g. Adenso-Diaz andLaguna (2006), Coy et al. (2001) and Rardin and Uzsoy (2001). The DOE methodology was later replaced by the Design and Analysis of Computer Experiments (DACE) methodology (Sacks et al., 1989), specifically created for deterministic computer algorithms (Bartz-Beielstein et al., 2004). Bartz-Beielstein et al. (2005) introduced Sequential Parameter Optimization (SPO), which uses Kriging techniques to build a predictive model of the performance landscape based on an increasing number of observations that are selected through an expected improvement criterion. Hutter et al. (2009a), Hutter et al. (2010) gave suggestions to further improve the method. Hutter et al. (2011) later introduced Sequential Model-based Algorithm Configuration (SMAC), extending model-based approaches to categorical parameters and sets of multiple problem instances using random forests (collections of regression trees). Independently, Dobslaw (2010) combined the DOE methodology with artificial neural networks as a basis for a parameter tuning framework.

Model-free Algorithm Configuration
In a separate line of work, statistical hypothesis testing has been used to compare different control parameter settings, for example in Czarn et al. (2004), Castillo-Valdivieso et al. (2002) and Xu et al. (1998). In this approach, evidence is gathered on the basis of experimental runs to test whether the considered parameter settings show a difference in performance that is significant enough to be distinguishable from stochastic noise. Nevertheless, the number of experimental runs that is required for sufficient statistical accuracy is often quite large (Yuan and Gallagher, 2007). Birattari et al. (2002) presented an efficient alternative with the F-Race algorithm, which progressively tests a number of (predefined) parameter configurations on one or more benchmark problems and eliminates inferior ones as soon as significance arises. Balaprakash et al. (2007) presented two improvement strategies for the F-Race algorithm, called sampling design and iterative refinement, and Yuan and Gallagher (2007) proposed a combination of F-Race and metaoptimization. More recently, López-Ibáñez et al. (2016) presented the irace package, offering iterated racing with a restart mechanism to prevent premature convergence and elitist racing to ensure that the best parameter configurations are tested most thoroughly. Li et al. (2017) introduced Hyperband, formulating the parameter tuning problem as an infinite-armed bandit problem, adaptively allocating more resources to promising parameter configurations in order to enhance random search. Falkner et al. (2018) combined Hyperband with Bayesian optimization in the BOHB (Bayesian optimization and Hyperband) method.

Parameter Tuning in Structural Optimization
In a structural optimization context, research into parameter tuning has received little attention so far. However, there have been several studies on parameter control, where parameter values are updated during the optimization process. For a general overview of parameter control methods, the reader is referred to Karafotias et al. (2015). Examples of parameter control applied to structural optimization include Hasançebi (2008) for evolutionary strategies, Hasançebi et al. (2009) for harmony search, Nickabadi et al. (2011) for particle swarm optimization and Kaveh and Farhoudi (2011) for metaheuristic algorithms in general.
The present authors have recently proposed a new parameter tuning method for structural optimization (Dillen et al., 2018). The method uses Monte Carlo simulation to estimate the distribution of performance of a metaheuristic algorithm under the condition that control parameter x has value y. Although this method is able to correctly reflect the effect of the dominant control parameters, it has difficulty in coping with effects that are due to parameter interactions: if a particular value of one control parameter only performs well in combination with a particular value of another control parameter, this will not be detected by the method. Table 1 presents an overview of the symbols that are used in the text. By convention, vectors are represented by bold characters. A function X(θ) of θ that is denoted by an upper-case letter represents a random variable. The lower-case equivalent x represents a specific value of X. A hat means that the variable x is an estimate of x. Finally, the subscript k or superscript (k) means that the variable is computed on the basis of the samples from the kth stage in the simulation.

General Approach
The following paragraphs establish a solid mathematical basis for the performance assessment problem and continue with the derivation of a Monte Carlo estimator and an importance sampling estimator for the statistic of interest. Next, the relevant properties of ratio estimators are briefly discussed. After that, the characteristics of a good importance sampling distribution are described. Finally, a method is given to iteratively update the importance sampling distribution based on adaptive importance sampling.

Problem Definition
To model the uncertainty in the simulation, we introduce a probability space (Ω, S, Pr), where Ω represents the sample space, S the set of events, and Pr a probability measure that assigns probabilities to the events (Kolmogorov, 1956). The elements θ of Ω are considered to be selected by the true randomness that underlies all natural processes. Each element θ corresponds to a unique state of the random problem.
Suppose we want to assess the performance of a metaheuristic algorithm A on an optimization problem P. The outcome of a single run of the algorithm depends on the parameter configuration p (p 1 , p 2 , . . . , p n ) T , and a random component H(θ) that is inherent to the class of metaheuristic algorithms 1 . Each component of the parameter configuration vector p represents the value of a single control parameter. In this study, the value of p is considered to be a realization of an n-variate random vector P(θ), where is a function that maps the sample space Ω to the set of n-dimensional real vectors R n . Each component P i (θ) of P(θ) is a univariate discrete random variable, whose range R Pi {p i,1 , p i,2 , . . . , p i,n i } is the set of n i possible values that the corresponding control parameter can take. If the parameter is continuous, it is assumed to be discretized in a finite number of representative values. 2 For the sake of readability, the event that Adaptive importance sampling weight factor k Stage in the simulation K Number of stages in the simulation 1 In a computer implementation, the random behavior of the algorithm will be determined by the seed that initializes the pseudorandom number generator. In this case, H(θ) is the function that maps each elementary event θ to a specific seed number h. Depending on the software, the number of possibilities for the random seed can be finite or infinite. Throughout this study, we assume that the number of possible seeds is either finite or countably infinite, in which case H(θ) is a discrete random variable. The case where H(θ) is continuous is similar, but it requires integrals instead of sums. 2 The current methodology is based on discrete random variables. The use of continuous variables in combination with kernel density estimation may be interesting for further research, but it is beyond the scope of this article.
Frontiers in Built Environment | www.frontiersin.org March 2021 | Volume 7 | Article 618851 P i (θ) p i,j will be referred to as S ij ∈ S in the sections that follow, for all i 1, 2, . . . , n and j 1, 2, . . . , n i . In order to proceed with the assessment, an appropriate performance measure is required. This is typically a scalar that is a function of either the objective function value, the number of function evaluations, or a combination of both, see e.g. Eiben and Smit (2011). Let "perf(·)" be the function that determines the performance measure, and write M perf (A(P, H), P), where a low value of M corresponds to good performance. In the case where both the algorithm A and the optimization problem P are given, the performance measure is denoted more briefly as M perf (P, H).
We proceed by defining the statistic of interest, which we call r ij . The objective is to perform multiple independent runs of the algorithm with random values for the control parameters, and to study the distribution of parameter values in the subset of results where performance is good. Let m * denote the threshold for "good" performance. The statistic of interest r ij is formally defined as the probability that parameter P i has value p i,j in the case where performance is better than the threshold value m * . It is computed using Bayes' theorem: This equation can be interpreted as a Bayesian inference problem, where an initial guess for good control parameter values Pr(S ij ) (the prior) is updated on the basis of data from a simulation Pr(M ≤ m * S ij ) (the likelihood) to obtain a better understanding of the distribution of parameter values in case of good performance Pr(S ij M ≤ m * ) (the posterior). If there would be any prior knowledge on good parameter values (e.g., from previous experiments) it can be included in the prior distribution. In the absence of such knowledge, one can simply use a uniform prior distribution that makes all parameter values equally likely, following the principle of indifference (Jaynes, 2003). This is equivalent to simple random sampling.

Monte Carlo Estimator
We proceed by deriving a Monte Carlo estimator for the denominator Pr(M ≤ m * ) in Eq. 3. Let f P (p) and f H (h) denote the probability mass functions of P and H, respectively. Assuming both variables are statistically independent, their joint probability mass function f (p, h) is given by with R H and R P denoting the range of H and P, the denominator in Eq. 3 can be written as where m perf (p, h) and 1 m ≤ m * (m) is the indicator function that equals 1 when m ≤ m * and 0 otherwise. The right-hand side of Eq. 5 corresponds to the definition of the expected value of 1 m ≤ m * (M), making it equivalent to where E f denotes mathematical expectation with respect to f (p, h). It is not possible to evaluate this expression directly, but it is possible to construct a Monte Carlo estimate of the form where M l perf (P l , H l ) and {(P l , H l ) : l 1, 2, . . . , N} is a set of N independent and identically distributed (i.i.d.) samples drawn from f (Hammersley and Handscomb, 1964). The Monte Carlo estimator is unbiased and asymptotically normally distributed, The relative accuracy of the estimator is given by its coefficient of variation (c.o.v.), which is defined as the standard deviation relative to the mean: For different sets of samples, the value of the Monte Carlo estimate μ den will fluctuate around its true value μ den , and its accuracy will increase as the size of the sample set increases. However, when μ den is small (which is the case when the performance threshold m * is strict) the c.o.v. will be large, requiring a large number of samples to obtain sufficient accuracy.

Importance Sampling
The accuracy of the estimator can be improved by importance sampling, a variance reduction technique in which samples are drawn from a probability distribution other than the original distribution. The term "importance sampling" derives from the idea that it is better to concentrate the samples where they are most important to the target function, rather than spreading them out evenly (Hammersley and Handscomb, 1964). The mismatch between the original and the new sampling distribution is corrected by assigning importance weights to the individual samples, in order to avoid bias in inferences based on the simulation.
Let g(p, h) denote the new sampling distribution (usually called the proposal distribution) and w(p, h) f (p, h)/g(p, h) the importance weights. Eq. 5 can be modified in the following way: as long as g(p, h) > 0 when 1 m ≤ m * (m) f (p, h) ≠ 0. The associated importance sampling estimator is: with (P l , H l ) this time drawn i.i.d. from g (Owen and Zhou, 2000). As before, the estimator is unbiased and asymptotically normally distributed, with the same mean value μ IS,den μ den as the regular Monte Carlo estimator, and variance 1 where the expectation E is first taken with respect to g and then with respect to f. The expressions for the variance of the regular and the importance sampling estimator are identical except for the factor w in Eq. 16. In order for the variance of the importance sampling estimator to be lower, the proposal distribution g should be chosen such that w is on average smaller than one whenever 1 m ≤ m * (M) 1. The problem of choosing a good proposal distribution is not trivial and will be addressed in Section 3.7.

Ratio Estimator
From the previous Section we have obtained an unbiased estimator for the denominator of the statistic of interest in Eq. 3. The estimator for the numerator Pr(M ≤ m * S ij )Pr(S ij ) can be derived analogously. Using the definition of conditional probability, it is reformulated as This is further elaborated as follows: where 1 S ij (p) equals 1 when P i p i,j and 0 otherwise. The associated importance sampling estimator is computed as with (P l , H l ) again drawn i.i.d. from g. Combining the estimators for the numerator and denominator yields estimators r ij for the statistic of interest r ij , for all i 1, 2, . . . , n and j 1, 2, . . . n i . As these new estimators are computed as the ratio of the means of two random variables, they are usually called ratio estimators. The ratio estimators for the statistic of interest are defined as: The estimators are well-defined everywhere except when μ IS,den 0, in which case no samples fall in the wellperforming region, meaning that either the sample size N should be increased or the performance threshold m * should be relaxed. Two areas of concern remain: 1) ratio estimators are biased, and 2) there is no exact formula to compute the variance of a ratio estimator, which we would like to know in order to formulate confidence bounds.

Bias Correction
Ratio estimators are biased, even if the estimators in the numerator and denominator are not. Methods that correct for this bias are readily available, examples of which include Beale's estimator (Beale, 1962) and the jackknife estimator (Choquet et al., 1999). However, the bias will asymptotically approach 0 as N → ∞, making ratio estimators approximately unbiased for large sample sizes (Ogliore et al., 2011). Based on the numerical experiments in this study, it was concluded that the bias is not decisive, motivating the use of uncorrected ratio estimators in the following.

Variance Approximation
There is no exact expression for the variance of a ratio estimator. The following approximation is commonly used (Stuart and Ord, 1994): where s 2 num and s 2 den denote the sample variance of the data points in the numerator and the denominator, respectively, and s 2 num,den is the sample covariance. The sample variance and covariance are computed as and s 2 num,den where X ij,l 1 m ≤ m * (M l ) 1 Sij (P l ) w(P l , H l ) and Y l 1 m ≤ m * (M l ) w(P l , H l ) denote the sample values in the numerator and denominator, for all sample points l 1, 2, . . . , N.

Choosing an Appropriate Proposal Distribution
The effectiveness of the importance sampling approach depends directly on the choice of proposal distribution. The proposal distribution will be chosen to minimize the variance of the estimator in the denominator, as this is the same for all parameter/value combinations. Let the true value of the denominator be denoted by μ. The theoretical optimum for the proposal distribution is which gives σ 2 IS 0 and makes the importance sampling estimator in Eq. 14 exact (Hammersley and Handscomb, 1964). When restated as it becomes clear that the optimal proposal distribution has mass only at those values (p, h) for which the performance measure m perf (p, h) falls below the threshold m * . Unfortunately, this equation can not be used in practice, since it is not known in advance which parameter configurations p give rise to good performance. The equation does however state that a good proposal distribution must be similar to 1 m ≤ m * (m) f (p, h), which equals the integrand in Eq. 5.
In other words, the samples that are generated early on in the simulation can be used to construct increasingly better approximations to the optimal proposal distribution, which is the core concept of adaptive importance sampling (Cornuet et al., 2012). The optimal proposal distribution g * (p, h) has mass only in the region of good performance. Good approximations to g * (p, h) will favor good control parameter values over bad ones. Consequently, the proposal distribution g(p, h) should be a probability mass function that is uniform in the dimension of h and has relatively more mass near good parameter configurations p. Since the dataset from the Monte Carlo simulation will in general not contain a sample point for all possible vectors in R P , a way is needed to fill in the absent data. One way would be to compute g(p, h) through multivariate kernel density estimation, but the resulting distribution would depend heavily on the chosen bandwidth. Instead, g(p, h) is approximated as the product of the (univariate) marginal distributions of its components, an approach due to Lepage (1978): In this equation, g Pi (p) is the probability of the value p being sampled for control parameter P i (θ), which is computed as g Pi p max r ij , g min for p p i,j ∈ R Pi 0 elsewhere , with r ij the estimator from Eq. 20, calculated from previous samples, and g min a small lower bound that prevents g Pi (p) from becoming zero in the range of P i . A high value of g min makes the proposal distribution deviate more from the optimum, slowing down convergence, but a low value of g min might cause disproportionately high importance weights when a parameter value that is considered bad performs much better than suggested by the proposal distribution (which is an estimate), affecting the variance estimate. In this study, a lower bound of g min 0.03 is used. Due to the introduction of g min it is possible that p ∈ R P i g P i (p) becomes greater than 1, in which case g P i (p) should be properly normalized.
The proposal distribution is used to compute the importance weights for the individual samples. The value of the importance weight w(p, h) represents the probability of the observed control parameter configuration p being sampled from f, relative to the same configuration being sampled from g: The factors f H (h) in the numerator and the denominator cancel each other out. If a given parameter value is twice as likely to be sampled from g as compared to f, the sample's contribution to the estimator will be halved. As a result, the mean value of the estimator remains the same, but its variance decreases.

Updating the Proposal Distribution
Several updating schemes may be adopted for the proposal distribution g. It can be updated every time a sample is drawn, but then the samples will not be i.i.d., which means that no variance estimates can be computed. Alternatively, the simulation can be subdivided into multiple stages, where g is updated at the end of each stage. In the initial stage, the proposal distribution g is equal to the original distribution f. From then on, g is computed based on the samples in all previous stages. Within each stage k, the samples are i.i.d., and can be used to calculate in-stage estimates r (k) ij for the statistic of interest, as well as estimates for the variance V( r (k) ij ). The accuracy of the in-stage estimates grows with k, due to the proposal distribution becoming increasingly better.
This article uses a variant of the latter approach, as it allows the performance threshold value m * to be updated according to a continuation scheme. In the first stage of the simulation, there is no threshold on performance. In the stages that follow, the threshold is gradually tightened, making the results more and more relevant as the simulation progresses. Although this approach causes the statistic of interest (and thus the optimal proposal distribution) to vary from stage to stage, the samples from the initial stages will still provide a solid basis for calculating g later in the simulation, as long as m* does not change too abruptly.
The method proceeds as follows. First, let r (k) ij,m * denote the estimator corresponding to the performance threshold value m * , computed from the samples in stage k. The kth value in the continuation scheme for m * is indicated by m k . In the first stage of the simulation, samples are drawn from f, which are used to compute the estimates r ij,m 1 (for all parameter/value combinations) along with estimates for the variance. Samples continue to be drawn until all estimators have a variance lower than some predefined threshold: after which the first stage is completed. Next, the samples from stage 1 are used to compute a rough estimate r ij,m 2 , which is plugged into Eq. 29 to obtain the proposal distribution for the second stage. In the second stage, additional samples are drawn from the new proposal distribution to improve the estimate r ij,m2 , which is now computed as the weighted sum of the in-stage estimates in stages 1 and 2: where K 2 denotes the index of the current stage, and W k are weight factors that satisfy K k 1 W k 1. The variance of the resulting estimator is approximated by  (Owen and Zhou, 1999). As soon as this estimate is sufficiently accurate, a rough estimate r ij,m 3 is computed from the samples in stages 1 and 2, which in turn is used to obtain the proposal distribution for stage 3, and so on, until all stages have been completed. The continuation scheme for the performance threshold is based on percentile values, where m k equals the value of the performance measure that corresponds to the α k -th performance percentile (of the initial probability distribution f), with α k the kth element from the set {100, 50, 20, 10, 5, 2, 1}. Following this approach, in the first stage all algorithm runs are considered, in the second stage only the 50% best-performing runs are considered, then the 20% bestperforming runs, and so on. The performance percentile values are computed from the samples in the simulation as is the cumulative distribution function of M, which is calculated according to Eq. 14, and F −1 M (α) is the generalized inverse distribution function of M for any α ∈ [0, 1].
The optimal way to combine the in-stage estimates is to use weight factors W k equal to which will minimize the variance of the final estimator (Owen and Zhou 1999). However, the true variance of the in-stage estimates is unknown, and plugging in approximations might introduce bias. Instead, Owen and Zhou (1999) recommend to use the square root rule, a deterministic weighting scheme with weight factors proportional to k √ , for which performance is nearly optimal when the number of samples is the same in all stages. For the case where the number of samples is different in each stage, we propose the following variant: in which N k represents the number of samples in stage k. The efficiency of this weighting scheme is comparable to the original square root rule, as is shown in Supplementary Appendix SA. When the estimators r ij are undefined, the corresponding stage is disregarded. This typically occurs at the beginning of a stage, where the denominator in Eq. 20 is zero. Additionally, the variance approximation of the estimator V( r ij ) might be undefined for poorly performing parameter values, as these will have little or no samples in the well-performing region. Stages that end prematurely due to inaccurate variance approximations are avoided by the stopping criterion in Eq. 31, which uses the maximum variance across all parameter/value combinations, including the well-performing ones where variance approximations are more important and where they will be accurate.

NUMERICAL EXPERIMENTS AND DISCUSSION
As a test case, the new method is used to assess the performance of the genetic algorithm (GA) that comes with matlab R2018b. We consider six control parameters in total: four parameters specific to the GA (PopulationSize, EliteCountFraction, CrossoverFraction, and FitnessScalingFcn), one penalty parameter κ that determines the degree to which infeasible designs are penalized (following the approach by Rajeev and Krishnamoorthy (1992), see Supplementary Appendix SB)) and one Dummy parameter that has no effect and serves as a reference. Note that it is not required to know how the control parameters work in order to interpret the results, which is one of the strengths of the method. The GA starts by randomly creating an initial population (which is not necessarily feasible) and stops when the relative change of the objective function value over the last 50 generations becomes smaller than the tolerance value of 10 −6 (both are default values), or when a user-defined limit on number of function evaluations is exceeded. If the final solution is feasible, the performance measure is taken as the objective function value reported by the algorithm. If the final solution is infeasible, the performance measure is set to infinity, preventing the run from being included among the wellperforming results.
As test cases, we consider seven discrete truss sizing optimization problems for which the objective is to minimize structural weight. These are the 10-, 15-, 25-, 47-, 52-, 72-, and 200-bar truss problem, all of which have been used as test cases in the literature before, e.g. in Rajeev and Krishnamoorthy (1992), Camp and Bichon (2004), Lee et al. (2005), Ho-Huu et al. (2016, and Degertekin et al. (2019). Their respective problem formulations are given in Supplementary Appendix SB. All problems are subjected to constraints on member stresses and nodal displacements.
In the following subsections, three numerical experiments are discussed. The first experiment investigates whether wellperforming values for the control parameters depend on the computational budget that is given to the GA, which corresponds to the maximum number of function evaluations (structural analyses) it is allowed to perform within a single run. The second experiment compares the best-performing control parameter values for similar optimization problems of different sizes, and investigates whether any trends can be observed. In both experiments, the estimators r ij are accurate up to σ max 0.025. The third and final experiment compares the performance of the GA with default and optimized parameter values.

Impact of the Computational Budget
The first experiment investigates the impact of the computational budget of the GA when applied to the 10-bar truss problem. Three Monte Carlo simulations are performed, in which the algorithm is given a computational budget of 10 4 , 10 5 , and 10 6 function evaluations. The results are presented as heat maps in Figures 1-3. Each heat map represents the sequence of probability distributions, corresponding to the different stages in the simulation, for a single control parameter. The rows of the heat map show the distribution of control parameter values within a single stage, where the top 1% of results (highlighted in red) will be most representative of the optimum. The mode of this distribution (darkest in color) is the parameter value that occurs most often in this subset. The parameter value that is most likely to give good performance has the highest probability of occurring in the well-performing subset, the value that is second most likely has the second highest probability, and so on. When applicable, the matlab default values are shown in bold. 3 The width of the 95% confidence interval equals four times the standard deviation σ max 0.025 of the estimators-assuming that the sample is large enough for asymptotic normality to hold. If the confidence intervals of two bars are non-overlapping, the difference is statistically significant, but the converse is not necessarily true (McGill et al., 1978).
The results for the low-budget case (10 4 function evaluations) are shown in Figure 1. The GA finds the best known solution to the problem in 2% of runs, making the results for the top 1 and 2% identical. The best-performing value for the PopulationSize is 50, which corresponds to a maximum of 200 generations. The best-performing EliteCountFraction (the fraction of good designs that is maintained between generations) is somewhere in the range of 0.1-0.2, and the best-performing value for the CrossoverFraction is 0.6. All probability distributions in the top 1% appear to be unimodal, with the mode of the distribution for the PopulationSize being the most distinct. The distribution of values for the penalty parameter has a sharp peak near κ 1.1, which is in line with findings by e.g. Richardson et al. (1989) who argue that penalties that are too low unjustly promote infeasible designs, whereas penalties that are too high discard useful information from designs that are only slightly infeasible. No significant conclusions can be drawn for FitnessScalingFcn, although the second and fourth options seem to perform slightly better than the others, or for the Dummy parameter, which is to expected as it has no effect on the GA. The variability still present here shows the extent to which the algorithm has converged, illustrating the difference between control parameters that do have an effect and those that do not.
For the medium-budget case (10 5 function evaluations) in Figure 2, the best-performing value for the PopulationSize is 500, which again corresponds to a maximum of 200 generations. The mode of the other parameter value distributions is similar to the low-budget case, but the variance is generally larger: the bestperforming values are the same, but the GA is less sensitive to the control parameters. The same goes for the high-budget case (10 6 function evaluations) in Figure 3, where it is better to use a PopulationSize of 1,000 (and maybe larger), which still allows the algorithm to run for up to 1,000 generations. Unsurprisingly, a clear trade-off is observed between the size of the population on the one hand and the maximum number of generations on the other. Furthermore, the results confirm that the impact of the control parameters is largest when the computation budget is small and less pronounced otherwise, in line with classic (theoretical) proofs that many metaheuristic algorithms are guaranteed to find the global optimum if one lets them run indefinitely (Rudolph, 1994).
It should be noted that parameter dependencies cannot be read directly from the figures. However, the relevant dependencies (those that give rise to good performance) are implicitly included in the well-performing subset of results from the Monte Carlo simulation. Limiting inferences to this subset prevents unimportant parameter interactions from obscuring the conclusions.
The performance threshold values m k in each stage k of the low-, medium-and high-budget case are shown in Table 2. The best design obtained during the experiment has an objective function value of 5490.74 lb, equaling the best known solution to this problem in the literature. For the low-budget case, the GA will achieve this objective function value in 2% of runs, when parameter values are randomly drawn from f. For the mediumbudget case, the GA will achieve this value in 10% of runs, and for the high-budget case even in 20% of runs, showing that the GA is quite robust for its control parameters (for this problem) if the computational budget is large enough. The results in Table 2 can be used to compare the GA with other algorithms once a set of plausible parameter values is defined, which of course is still somewhat subjective.

Comparison Between Multiple Test Problems
The second experiment compares the best-performing control parameter values for all seven test problems and aims to determine whether any general conclusions can be drawn. In all test problems, the GA is limited to 10 5 function evaluations per run. The results for the 10-bar truss problem have already been presented in  is omitted. At the end of the section, the results for the complete set of test problems are analyzed for trends regarding wellperforming parameter values. The best-performing value for the PopulationSize varies between 200 and 1,000 for the different test problems. The best-performing value for the EliteCountFraction ranges from 0.1 to 0.5. The best-performing value for the CrossoverFraction ranges from 0.4 to 1. The results for the FitnessScalingFcn are inconclusive, although the third option ("fitscalingtop") consistently performs worse. For the small-scale problems (25bar truss and 52-bar truss) the performance is higher with high values for the penalty parameter κ, while for the large-scale   With enhanced control parameter values, the GA has succeeded in improving the best known solution to the last two (well-studied) test problems, providing the community with new benchmark results while demonstrating the strength of the proposed method.
The number of sample runs and objective function evaluations performed in this experiment are listed in Table 3. They are of the same order of magnitude as reported for existing parameter tuning methods (Montero et al., 2014), depending on the accuracy of the estimates and the computational budget of the algorithm. Nevertheless, the computing power required to make the assessment is high. Investing such effort may be particularly relevant for optimizing parameters for a set of similar problems, for finding new benchmark results (optimal solutions) for the test problems, or for looking for general trends in important parameters and well-performing parameter values for specific types of optimization problems, such as discrete truss sizing optimization problems. The latter is investigated in the next paragraph.  Frontiers in Built Environment | www.frontiersin.org March 2021 | Volume 7 | Article 618851 Figure 8 summarizes the top 1% parameter value distributions of all problems in the test set, shown in order of increasing problem size (number of discrete solutions in the design space). Some trends do seem to emerge. The best-performing value for the PopulationSize decreases with the problem size, while the function evaluation limit remains constant, corresponding to an increasing number of allowable generations. The best-performing values for the EliteCountFraction and the CrossoverFraction generally increase with the problem size, although often multiple options perform similarly (e.g. CrossoverFraction equal to 0.6, 0.8 or 1 for the 52-bar truss problem or EliteCountFraction equal to 0.2 or 0.5 for the 200-bar truss problem). The results for FitnessScalingFcn are not statistically significant, but the second option ("fitscalingprop") consistently performs well and the third option ("fitscalingtop") always performs poorly. Good values for κ cannot be predicted based on the size of the optimization problem alone, although its effect is less important when the problem size is large. Robust parameter values (i.e. values that perform well for all test problems) are 200 for the PopulationSize, 0.2 for the EliteCountFraction and 0.6 or 0.8 for the CrossoverFraction. Robust values for FitnessScalingFcn are "fitscalingrank" (the default value) and "fitscalingprop." Finally, a high value for κ is always a safe choice.
The conclusions for the second experiment are as follows. For all test problems, the impact of the PopulationSize, the EliteCountFraction and the CrossoverFraction is significant, and the mode of the corresponding parameter value distribution becomes clearer as the problem size increases (with a fixed computational budget). Similar to the first experiment, the impact of the control parameters is larger when the computational budget is small. The best-performing value for the EliteCountFraction is consistently higher than the default value in MATLAB. The best-performing value for the CrossoverFraction varies, and seems to be proportional to the problem size. The bestperforming value for the PopulationSize depends on the problem size and on the computational budget of the algorithm. The bestperforming control parameter values found by the proposed method do not always match the default values in matlab.

Default vs. Optimized Control Parameter Values
The best-performing control parameter values found in the previous subsection are generally not the default values in matlab. Hence, the performance of the GA with default parameter values (listed in Table 4) is compared to the performance of the algorithm with optimized control parameter values (well-performing values from the previous subsection, listed in Table 5).
The results of the comparison are summarized in Figure 9, using (custom) box plots. Each box plot represents 100 independent runs of the GA, with random starting points and a computational budget of 10 5 function evaluations. Infeasible solutions are given an objective function value of infinity. If (part of) a box plot is cut off at the top of the figure, it means that the corresponding objective function value goes to infinity.
The figure shows that the performance of the GA with optimized control parameter values is better than the performance with default values in almost all possible areas. For the 10-, 15-, 25-52-, and 72bar truss problems, the GA matches the best known solution in the literate in at least 90% of runs, and even in 100% of runs for the 15and 25-bar truss problems, performing significantly better than with default control parameter values. For the 47-bar truss problem, the best result in the simulation (2377.5 lb) is slightly less good than the best result in the literature (2372.2 lb in this article, 2376.0 lb otherwise), but again the performance of the algorithm is considerably better than with default parameter values. For the 200-bar truss, the extremes of performance are amplified: the minimum, 10-th percentile and median objective function values are a lot better with optimized parameter values, although the algorithm more often converges to an infeasible design. With default control parameter values, at least one infeasible design is present among the results for 6 out of 7 test problems, compared to 3 out of 7 test problems in the case of optimized control parameter values.

CONCLUSIONS
A method has been developed to assess the performance of a metaheuristic algorithm in relation to the values of the control parameters. It is based on Monte Carlo sampling of independent algorithm runs and uses importance sampling as a variance reduction technique. The method is demonstrated using the genetic algorithm (GA) built into MATLAB for seven representative discrete truss sizing optimization problems that are commonly used in structural optimization literature. The method successfully captured the sensitivity of the GA with respect to the control parameters, identifying relevant parameters whose values need careful tuning as well as parameters that have little impact for the problems at hand. For the important control parameters, wellperforming values are identified that consistently outperform the default matlab values for the benchmark problems considered.
The results in this study indicate that the impact of the control parameters is largest when the limit on the number of function evaluations is low. When the limit on the number of function evaluations is high, the algorithm becomes more robust to the values of its control parameters, affirming that longer computing times can (partially) compensate for poor parameter value choices.
The method described in this study can be used in a variety of ways. Information on the sensitivity of an optimization algorithm with respect to its control parameters can help developers to focus on the relevant part of their algorithm. In addition, the method can be used to select appropriate default values for the control parameters based on a representative set of test problems. The performance percentiles resulting from the simulation can be used to compare the algorithm with other metaheuristic algorithms. Finally, the Monte Carlo simulation may provide the academic community with new benchmark results, as illustrated by the improvements over the best designs reported so far in the literature for the 47-bar truss and the 200-bar truss test problems (found in Section 4 of this article).
Using the method to find good control parameter values is computationally expensive, as it requires several optimization runs to be performed. Therefore, it may be worthwhile to look for general guidelines regarding control parameter values that perform well. The results of all seven test problems have been compared to look for dependencies between well-performing parameter values and the number of discrete solutions in the design space. Robust parameter values (values that perform reasonably well for all problems in the test set) have been identified. For the test problems considered, it is found that the best-performing value for the PopulationSize of the GA heavily depends on the size of the optimization problem, and on the maximum number of function evaluations in a single run of the algorithm. Furthermore, it is found that the best-performing value for the EliteCountFraction is consistently higher than the default setting, and that the best-performing value for the CrossoverFraction is roughly proportional to the size of the design space.

DATA AVAILABILITY STATEMENT
The source code of the example problems presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.