The Influence of Sample Size on Parameter Estimates in Three-Level Random-Effects Models

In educational psychology, observational units are oftentimes nested within superordinate groups. Researchers need to account for hierarchy in the data by means of multilevel modeling, but especially in three-level longitudinal models, it is often unclear which sample size is necessary for reliable parameter estimation. To address this question, we generated a population dataset based on a study in the field of educational psychology, consisting of 3000 classrooms (level-3) with 55000 students (level-2) measured at 5 occasions (level-1), including predictors on each level and interaction effects. Drawing from this data, we realized 1000 random samples each for various sample and missing value conditions and compared analysis results with the true population parameters. We found that sampling at least 15 level-2 units each in 35 level-3 units results in unbiased fixed effects estimates, whereas higher-level random effects variance estimates require larger samples. Overall, increasing the level-2 sample size most strongly improves estimation soundness. We further discuss how data characteristics influence parameter estimation and provide specific sample size recommendations.


INTRODUCTION
In educational research and the field of psychology in general, researchers oftentimes face the statistical problem of nested data structures. Such structures occur if measured entities belong to superordinate groups. For example, researchers might examine children (lowest level) nested within classrooms (medium level) and schools (highest level). In statistical models, this nested structure has to be respected. One prominent approach is multilevel analysis (MLA). The fundamental assumption of MLA is "that there is a hierarchical data set, [. . .] with one single outcome or response variable [. . .], and explanatory variables at all existing levels" (Hox et al., 2017, p. 8). Hence, a system of regression equations at different hierarchical levels describes the influence of the explanatory variables. MLA is also useful for repeated measures of the same variable over a period of time within the same entities (e.g., diary studies).
Abbreviations: COM, simulated condition indicating complete data without missing values; DrOP2, simulation condition indicating data with a dropout pattern such that 20% of level-2 units do not produce data for the last two measurement occasions; DrOP3, simulation condition indicating data with a dropout pattern such that 20% of level-2 units do not produce data for the last three measurement occasions; ES, effect size; ICC, intraclass correlation coefficient; MCAR, simulation condition indicating data with 20% of level-1 values (measurement occasions) missing completely at random; MLA, multilevel analysis; Peb, parameter estimation bias.
In general, disregarding hierarchical data structures leads to biased fixed effects estimates and standard errors (McNeish and Wentzel, 2017), which in turn influence the accuracy of significance tests (e.g., Chen, 2012) and statistical power calculations (e.g., Konstantopoulos, 2008). It can also lead to incorrect conclusions: the ecological fallacy describes the incorrect inference that effects at the group level are the same for individual members of the group (Piantadosi et al., 1988). The atomistic fallacy describes the opposite case, where conclusions based on results for a subgroup are erroneously generalized (e.g., Hox, 1995). Applying MLA reduces the chances of drawing these false conclusions based on inaccurate analysis results.
Besides these general advantages of MLA, the quality of estimation results in multilevel models is influenced by different characteristics (e.g., the number of units at the different levels, scale-level of variables, presence or absence of interaction effects, violation of assumptions, or estimator properties). Comprehensive analytical solutions to evaluate specific estimation results are, however, rare. In the past years, authors have derived formulae for power or sample size calculations for different three-level models (Heo and Leon, 2008;Konstantopoulos, 2008;Cunningham and Johnson, 2016;Hooper et al., 2016), but in comparison, empirical power has been found to be slightly lower than theoretical power (Heo and Leon, 2008). As an alternative to analytical solutions, simulation studies examine the estimation quality under specific sampling and data conditions, especially varying sample sizes. While the majority of such studies is concerned with twolevel models (e.g., Hox, 2004, 2005), there are many research settings where multiple nesting structures frequently result in at least three-level data, such as the Program for International Student Assessment (PISA) or the Trends in International Mathematics and Science Study (TIMSS), which have been analyzed using two-level (Caponera and Losito, 2016) and three-level models (Webster and Fisher, 2000;Lamb and Fullarton, 2002). Alternatively, complex data structures easily require three-level MLA if a longitudinal component is added, such as schools (level 3) with students (level 2), which are studied on different measurement occasions (level 1). Notably, a distinction is to be made between the three-level nature of a dataset and the appropriate statistical model. Longitudinal dyadic data with distinguishable members, where members (level-2) of dyads (level-3) are measured over time (level-1), for example, have to be modeled as two-level data rather than three-level data (Laurenceau and Bolger, 2005;Kenny and Kashy, 2011;Planalp et al., 2017).
With the present simulation study, we highlight and address specific issues researchers in educational psychology or a setting with comparable samples, models, and effects face when considering MLA for three-level data. Specifically, we simulate and analyze prototypical data based on results of an empirical study in the field of educational psychology as a starting point to address the following questions: What is the overall required sample size and what is the optimal allocation of observational units across levels in order to obtain valid and reliable results? In particular, this study utilizes practice-relevant effects by generating nested data based upon the prototypical three-level regression model from an empirical study. Estimation quality of fixed effects and random effects variances (r.e. variances in the remainder) is assessed and compared by simulating various sample size conditions with different numbers of units at each data level, and analyzing these samples using MLA. By using data reflecting typical MLA-results and sample characteristics of an illustrative psychological study, these analyses are a practical approach to identify important factors in similar research settings when deciding on the appropriate overall sample size and allocation at the different levels.

THE MULTILEVEL MODEL
Formulae (1) to (3) below provide the notation for the most general three-level regression model with one predictor per level as well as cross-level interactions, random slopes, and intercepts. Depending on the data, the hypothesized effects, and their interactions, models may in practice contain less (e.g., no random effects) or more (e.g., additionally predictors) components on each level. Given k = 1,. . .,K level-3 groups, j = 1,. . .,J level-2 subgroups, and i = 1,. . .,n level-1 units, this model contains the dependent variable Y ijk , and predictors X ijk , Z jk , and W k .
Level 2 Level 3 Regression equation (1) corresponds to the first level of the ML-model, with intercept β 0jk , explanatory variable X ijk multiplied with the regression coefficient β 1jk , and residuals e ijk . Its components are composed of subsequent higher-level equations. The next two formulae represent the second data level. The first part of (2) depicts how the intercept β 0jk is decomposed into the level-2 intercept γ 00k , slope γ 01k times explanatory variable Z jk , and level-2 residual u 0jk . The second part shows how the level-1 slope β 1jk is decomposed into the level-2 intercept γ 10k , slope γ 11k times Z jk , and the residual u 1jk . Analogously, formula (3) builds the third data level with level-3 predictor W k , relative intercepts δ 000 , δ 100 , δ 010 , and δ 110 , slopes δ 001 , δ 101 , δ 011 , and δ 111 , and residuals v 00k , v 10k , v 01k , and v 11k .
The dependency in the data resulting from such a multilevel structure is measured by the intraclass correlation coefficient (ICC), indicating how strongly units of the same cluster are more similar to each other than units of different clusters. The ICC values express the variability on one data level in relation to the overall variability in the data, and are hence calculated as in (4) to (6), with ε 2 e (level-1 variance), τ 2 u0 (level-2 intercept variance), and σ 2 v0 (level-3 intercept variance) of the unconditional model (i.e., a model including no predictor variables).
(5) and (6) both present versions for the level-2 ICC, with (5) including the level-3 r.e. variance component as additional variability. As Hox et al. (2017, p. 21) argue, both equations can be used to evaluate dependencies in the data: While (5) evaluates the correlation between two level-1 units of the same level-2 unit, (6) is advantageous if the aim is to measure the proportion of the total unexplained variance that is situated at level-2. The MLA approach to clustered data is very flexible, yet these complex models require an adequate amount of observations at each level in order to obtain sound estimates of fixed effects and r.e. variances. In the remainder, we will first present coefficients representing the soundness of model-parameter estimates, review the literature on samplesize requirements in two-and three-level models, and finally describe the simulation study in detail. Throughout, we refer to the number of units on each level as follows: N 3 = level-3 sample size, N 2 = level-2 sample size, and N 1 = level-1 sample size or number of measurement occasions in longitudinal assessments.

ASSESSING QUALITY OF PARAMETER ESTIMATION
A common measure for assessing the accuracy of parameter estimates is the parameter estimation bias (peb) which calculates the fraction of under-or overestimation of a true value θ by its estimates θ k * in n samples using formula (7) (adapted from Bandalos and Gagné, 2014). Muthén and Muthén (2002) suggest that the bias should not exceed 0.10.
The confidence interval is commonly used to assess the accuracy of the estimation. The specific formula used to calculate the confidence interval depends on the estimation method and the (assumed) distribution of the parameter. Equation (8) expresses the general formula for regression coefficients with estimate θ * and its standard error s θ , assuming that the standard normal distribution with α/2-quantile z α/2 can be used for a specified type I error probability α (Davidson and MacKinnon, 1999): The coverage calculates the percentage of confidence intervals across analysis runs, which contain the true parameter. If the calculated coverage is lower (higher) than the proposed level, statistical tests will be conservative (liberal). For a proposed confidence level of 95%, Muthén and Muthén (2002) recommend a coverage rate between 91 and 98%.
Statistical power is "the likelihood that a researcher will be able to reject a specific null hypothesis when it is in fact false" (Murphy, 2011(Murphy, , p. 1082. In simulation studies with one sample generated and analyzed in each run, estimated power for a specific effect equals the percentage of runs yielding statistically significant estimates of this effect. As an alternative, the percentage of confidence intervals for this effect excluding zero can be used as a measure to indicate a significant effect. A power of at least 80% is regarded as acceptable for significance testing (e.g., Muthén and Muthén, 2002).

RESEARCH ON SAMPLE SIZE IN MULTILEVEL MODELS
Research on sample size requirements mainly focus on twolevel models, with fewer studies investigating general threelevel models. Therefore, we outline main findings with respect to two-and three-level models. Tables 1-4 summarize the simulation conditions or parameters under investigation (twolevel models: Table 1; three-level models: Table 2) and findings and recommendations (two-level models: Table 3; three-level models: Table 4) of previous studies.
three-way interactions, and random intercepts (level-1 residual variance ES = 0.5). Coverage rates for fixed effects were generally acceptable for at least 50 dyads. Moreover Spain et al. (2012) investigated dyadic data with a binary outcome based on existing data from the health sector (five level-1 predictors with true values between −0.05 and 0.19, τ 2 u0 = 0.43, no level-2 predictors, no interactions, no random slopes), and found considerable bias for fixed effects estimates for less than 100 dyads, and severe under-or overestimation of standard errors.
For three-level models, bias of level-1 and level-2 predictors (continuous and discrete predictors with ES = 0.1 or ES = 0.2, ICC 3 = 0.05 or 0.15, ICC 2 = 0.2) has been found to fall below peb = 0.10, even if the level-3 sample size is smaller than 10 units (McNeish and Wentzel, 2017).

Sample Size and Estimation Quality of Random Effects Coefficients
Previous simulation studies show that r.e. variance estimates are generally more severely biased than estimates for fixed effects. Most importantly, the severity of the bias depends on the sample sizes, the statistical model, and the level of the effect. For example, Maas and Hox (2005, see Table 1 for conditions) found a bias of peb < 0.001 for level-1 r.e. variance estimates (true value ε 2 e = 0.5) in samples with 5 to 50 level-1 units each in 30 to 100 level-2 V on level-1: power ≥ 0.9 for N 3 > 2-12 -V on level-2: power ≥ 0.9 for N 3 > 4-15 V on level-3: power ≥ 0.9 for N 3 > 20-30 de Jong et al. (2010) c V on level-2, N 1 = 11: Sample sizes should be large at the level where randomization takes place (i.e., level-2 randomization requires larger N 2 , level-3 randomization requires larger N 3 ) power ≥ 80% for N 3 = 17/N 2 = 8; N 3 = 33/N 2 = 4; N 3 = 66/N 2 = 2.
Numbers of measurements, adding a significant slope predictor or 25% dropout have negligible effect on power.
units, whereas Moineddin et al. (2007) used a two-level logistic regression model with one medium sized (ES = 0.3), normally distributed predictor at each level (ICC = 0.04 to 0.38), and found a bias of peb ≥ 1.00 for the level-2 random intercept variance estimate in small samples (30 to 50 groups with 5 units each). As a general rule, level-1 residual variance estimates are less biased than higher-level r.e. variance components, as shown by Maas and Hox (2004, see Table 1), who report sufficient coverage for level-1 residual variance estimates, but coverage below 80% for level-2 r.e. variance estimates in samples with at least 5 level-1 units each in 30 level-2 units. For analyses of dyadic data (Spain et al., 2012;Du and Wang, 2016; see Tables 1, 3), the inclusion of random effects is limited, since there need to be more units per cluster in the data than specified random variables in order for the model to be identified (Kenny et al., 2006;Du and Wang, 2016). In three-level models, McNeish and Wentzel (2017) found the level-2 random intercept variance estimate (true value τ 2 u0 = 5.5) to be unbiased (peb < 0.10), even in samples with less than ten level-3 units, while the level-3 random intercept variance estimate (true value σ 2 v0 = 0.5 for ICC 3 = 0.05, or σ 2 v0 = 2.5 for ICC 3 = 0.15) in such samples is more heavily biased (see Table 2 for further conditions).

Sample Size Allocation Across Levels
In addition to the overall sample size, the allocation of units across levels also influences the estimation results. Research on the sample size allocation has shown that generally, increasing the higher-level sample size leads to improved estimation accuracy in terms of both fixed effects and r.e. variance estimates on all levels (e.g., Hox, 2004, 2005, for two-level models; Konstantopoulos, 2008 for three-level models). If randomization is used to investigate a treatment effect, the total required sample size depends on the level of randomization. Generally, overall larger samples are required for higher-level randomization (Cunningham and Johnson, 2016;Hooper et al., 2016), and increasing the sample size at the level of randomization most strongly improves power (de Jong et al., 2010; longitudinal model with log-time at level-1, a level-2 covariate, ICC 3 = 0.18, and ICC 2 = 0.75, see Table 2 for further conditions). Heo and Leon (2008) derived formulae for power calculations to investigate necessary sample sizes to detect a level-1 effect in models with randomization at level three. They found that the required level-3 sample size is strongly (negatively) related to the size of the effect (ES = 0.3, 0.4, or 0.5), and showed that increasing the level-3 sample size is much more efficient to reliably (power ≥80%) detect the level-1 effect than increasing the level-2 sample size. Dong et al. (2018) further showed that power to detect a medium sized moderation effect in three-level models with a treatment-effect variable on level-3 (varying level of the moderator, ICC 3 = 0.15, ICC 2 = 0.08, including covariates on each level) generally depends on the level-3 sample size, although the level-1 sample size is also important for detecting lower level moderator effects. Aside from these findings, Snijders (2005) states, as a general rule, that the sample size of the level at which a particular variable is measured is most important for estimation accuracy of that variable's effect. Regarding missing values, Hox (2013) further states that missing data at level one poses no problem for the analysis results, but missing higher-level data may require additional steps, such as multiple imputation, for reliable inference.

Available Sample Size Recommendations
As multilevel analysis results are influenced by various factors, recommendations with regards to sample size strongly depend on the data, especially the sizes of the effects and model characteristics. Therefore, recommendations summarized in Table 3 (two-level models) and Table 4 (three-level models) relate to the simulation conditions in Tables 1, 2. In most simulation studies concerned with three-level models, findings were compared to existing recommendations for two-level models. Therefore, we briefly outline important results for two-level models, followed by results for three-level models. In summary, the most important findings are: (i) In two-level models: Maas and Hox (2005) suggest a higherlevel sample size of at least 100 for accurate estimation of higher-level r.e. variance components and their standard errors (see Table 1 for conditions). However, they argue that smaller level-2 sample sizes might be sufficient in practice, as they showed that sampling 50 level-2 units provides acceptable results for higher-level r.e. variance estimates, with a non-coverage rate of 7.3% as opposed to the nominal 5%. Mathieu et al. (2012) investigated crosslevel-interactions in models with one normally distributed predictor per level (ICC: 0.15 or 0.30; various ES ≤ 0.75), and recommend sampling more thoroughly on level-1, rather than on level-2, especially if lower level effects are of interest in addition to the cross-level-effect. For dyadic data, Du and Wang (2016) recommend at least 50 dyads in order to reliably estimate fixed effects and r.e. variance components in case of non-missing data (see Table 1 for conditions). In case of missing data (10 to 50% of singletons), they recommend at least 100 dyads. They point out, however, that such sample sizes might not be sufficient for adequate power to detect effects. Recently, Lane and Hennes (2018) developed and demonstrated a guide to determine power for a user specified dyadic data model. (ii) In three-level models: de Jong et al. (2010) showed that sufficient power for large fixed effects can be reached with relatively small level-2 sample sizes given that the level-3 sample size is large, such as 17 level-3 units with 8 level-2 units each, or 66 level-3 units with 2 level-2 units each (see Table 2 for conditions). However, they argue that estimates for both coefficients and their standard errors could be strongly biased for these small level-2 sample sizes. McNeish and Wentzel (2017) show that if the third level is not of research interest, but solely included due to the sampling procedure, four level-3 units can ensure sound estimation results and sufficient coverage rates at level-2 and level-1 (see Table 2 for conditions). To detect a level-1 effect of a dichotomous variable in a three-level model with randomization at level three (ICC 3 = 0.01, 0.1, ICC 2 = 0.4, 0.5, 0.6), Heo and Leon (2008) found that effects of size ES = 0.3 with fewer than 10 level-1 and level-2 units require more than 30 level-3 units for sufficient power. For larger level-2 sample sizes (N 2 = 25), 6 level-3 units can be sufficient. To find larger effects of size 0.5, overall small sample sizes can be sufficient (e.g., N 1 = 6, N 2 = 5, N 3 > 9). Cunningham and Johnson (2016) furthermore found that for an effect of ES = 1.8, a level-2 sample size of 4 to 8, and a level-1 sample size of 10 to 30, a power of 90% can be reached with less than 12 level-3 units when randomization takes place at level-1, and a few more required units in case of level-2 randomization, while between 20 and 30 units are required if randomization takes place at the highest level.

Aim of This Study and Research Questions
In summary, previous studies have focused mainly on estimation quality in two-level models, and recent findings on threelevel models reveal a complex interplay of factors influencing estimation quality. Hence, the soundness of fixed effects and r.e. variance estimates in three-level models is not sufficiently well known. In particular, it is not clear if there are differences in the estimation quality between level-2 and level-3 model parameters depending on the overall sample size and allocation of observational units to the different levels. Yet, this information is of high importance, since sample size recommendations for three-level models with repeated measures are sparse, with even fewer studies including a longitudinal trend (e.g., linear or quadratic trajectories) and higher order clustering. Moreover, most simulation studies have generated data following theoretical considerations rather than empirical findings. As a consequence, researchers who analyze empirical three-level data lack information on how large their sample sizes need to be for each data level, and how sample size recommendations from artificially created datasets apply to their research setting. The present simulation study helps researchers in planning their study with respect to the sampling procedure by comparing estimates obtained under various sample size conditions to known true values of population data. We use data from a typical research setting in developmental psychology, yielding practice-relevant samples, effects, and models. We examine how sample sizes and missing values affect estimation quality of the MLA for fixed effects (including cross-level-interactions) and r.e. variance estimates across the three data levels. By using data reflecting typical MLA-results and sample characteristics of an illustrative psychological study, we aim to help researchers in similar research settings decide on the appropriate sample sizes. As a consequence, results and their derived recommendations are intended for models and data which are comparable to the conditions in these analyses.

The Illustrative Study
The study published by Maulana et al. (2013) serves as a suitable basis for our simulation study for several reasons: (i) the research setting -observing children clustered within classes over a period of time -is common in educational research (ii) the study provides sufficient methodological information in order to recreate comparable data (iii) the multilevel model of the study uses three-level-data including predictors on each level, a continuous dependent variable, random slopes, and intercepts on levels two and three, and repeated measures at the first level (iv) the study depicts a realistic sampling process with typical samples and analysis results, which are transferable to similar research contexts In the study, the authors investigated the link between student-teacher interpersonal relationship, academic motivation, and its development over time by surveying 504 Indonesian first-grade secondary students (level 2) from 16 classes (level 3) about their academic motivation and perceived influence and proximity of their teachers over the course of one school-year with five measurement occasions (level 1). Our analyses are based on Model 2 in Maulana et al. (2013, p. 472), which assesses how motivation develops over time, and how it is affected by perceived teacher influence at each measurement occasion (level-1), gender (level-2), subject taught (level-3), and classtype (level-3). For our population data and model, we removed the predictor subject taught in order to obtain a realistic, yet concise model depicting a growth process with one covariate at every level. Table 5 depicts the analysis results as reported in Table 4 of Maulana et al. (2013) and indicates which variables are used for the population data and analyses. In line with the original study, we included random intercepts and random slopes for the linear effect of time at both higher levels. Equations (9) to (11) depict the resulting three-level model: Level 1: Level 2: β 0jk = γ 00k + γ 01k Gender jk + u 0jk β 1jk = γ 10k + γ 11k Gender jk + u 1jk Level 3: The Experimental Design In order to investigate how the parameter estimates in MLA are affected by the sampling plan and missing data, we vary three factors: The level-3 sample size (number of classrooms: N 3 = 15, 35, 55), the level-2 sample size (number of students per class: N 2 = 5, 15, 35), and the level-1 sample size by means of missing value patterns: (i) no missing values (complete data: COM), (ii) 20% of students missing level-1 reports completely at random at arbitrary measurement occasions (this pattern is referred to as: MCAR), and (iii) 20% of students dropping out and thus not reporting level-1 variable values for the last two measurement occasions (DrOP2), or (iv) the last three measurement occasions (DrOP3). The 20% of missing data in the drop-out conditions DrOP2 and DrOP3 are also missing completely at random. To assess how our sample size conditions compare to sample sizes of recent studies in the educational setting, we conducted a non-exhaustive literature search of relevant studies. We limited our search to include studies which (1) were published in academic journals in 2013 or more recently, (2) are listed in PsychINFO, (3) use either three-level multilevel analysis or multilevel growth curves, and (4) analyze data in the school context. We identified 20 studies which sampled and analyzed data according to our criteria and found median sample sizes of 49 schools (min = 10, max = 933), 68 classes (min = 16, max = 565), 1943 children or adolescents (min = 55, max = 29153), and four measurement occasions (min = 2, max = 14). We further identified 30 studies analyzing pre-existing, large-scale data (e.g., PISA, TIMSS) and found median sample sizes of 259 schools (min = 44, max = 11075), 594.5 classes (min = 103, max = 18761), 7718 children or adolescents (min = 1698, max = 470000), and four measurement occasions (min = 3, max = 5). Hence, our sample size conditions, ranging from 75 students in 15 classes to 1925 students in 55 schools, lie within the range of recently analyzed sample sizes in the field. The percentage of missing values was chosen based on typically reported attrition rates in research concerning the educational system (e.g., Wagner et al., 1997;Miles and Stipek, 2006;Zvoch and Stevens, 2006;Konstantopoulos and Chung, 2011). Table 6 describes the missing value patterns by depicting the percentage of data used for analysis at each measurement occasion. In total, the simulation study encompasses 3 × 3 × 4 = 36 conditions.

The Data Generating Process
Firstly, we generated the population data using version 14.1. of Stata (StataCorp, 2015) according to the information provided in the illustrative study. In total, we generated the population data to consist of 3000 classrooms, with 1000 classrooms for each of the three N 2 (5, 15, or 35) conditions. We used the variable means, standard deviations, and distributions reported in the illustrative study to generate values for the independent variables, and used the reported MLA results (i.e., fixed effects estimates and r.e. variance estimates) to generate values for the dependent variable (see Equations 9 to 11). The independent variables are gender (56% girls; binary), classtype (52% heterogeneous-ability classes; binary), time passed since baseline measure in months (0, 1.5, 4, 7, and 10), and interaction terms. Since the illustrative study did not report non-normal data, the metric variable influence was modeled as normally distributed with M = 0.42 and SD = 0.36. Random effects are also modeled as normally distributed variables with mean zero and variances equal to those reported in the illustrative study. T0, baseline measurement, T1 to T4 denote successive measurements. COM to DrOP3 correspond to missing value patterns described in the section "Materials and Methods." R = 20% of overall data is missing completely at random.
In a next step, we realized samples from the population data according to the N 2 (number of students per class) and N 3 (number of classrooms) conditions with full data (missing value pattern: COM). For each N 3 /N 2 combination, we drew the desired number of classes (i.e., 15, 35, or 55) from the 1000 classes in the population data that contained the desired number of students (i.e., 5, 15, or 35). The classes for each realized sample were drawn without replacement, such that no particular simulated sample contains duplicates of level-3 units. We repeatedly drew samples to obtain 1000 datasets for each of the 3 × 3 × 1 = 9 conditions (N 2 × N 3 conditions) with complete data. For the missing value patterns MCAR, DrOP2, and DrOP3, we repeated the data generation process described above separately for each of the three patterns, and in an additional step deleted values according to the desired condition (see section "The Experimental Design"). As a result, we obtained 1000 samples for each of the 3 × 3 × 3 = 27 conditions with missing values.

Analysis Procedure
All analyses were carried out using R (R Development Core Team, 2016). In a first step, the true parameters from the population were obtained by applying the lmer()-function of the LME4-package . To assess variability due to clustering in the population data, ICC values for levels two and three were computed as in (4) to (6). Notably, since the population r.e. variances are modeled according to the values reported in the illustrative study, the ICC values are evaluated but not varied as an additional analysis condition in this study. In order to more easily interpret the original study and evaluate its results, effect size values for fixed effects and r.e. variances are advantageous. However, the reduction of variability on each level as a measure of effect size is not applicable for longitudinal data (Hox et al., 2017), and available effect sizes in multilevel models have been developed for measures of mean differences between groups (e.g., Tymms, 2004;Hedges, 2011) only. In order to nonetheless provide a general understanding for the magnitude of effects in our analyses, we used the approach by Tymms (2004) to calculate effect sizes as the regression weight relative to the remaining residual variance of the full model. Hence, the resulting values are not exact effect sizes, and should therefore only serve as an approximate indicator of the magnitude of effects within this study. (12) and (13) provide formulae for dichotomous and continuous variables, respectively, for a given regression coefficient β, the predictor variable standard deviation SD Predictor , and the level-1 residual variance ε 2 e .
ES dichotomous = β ε 2 e (12) For the r.e. variances, Tymms (2004) recommends using their close relationship to the ICC and calculate the level-wise effect size for the random part of the model as in (14). This formula is applicable for ICC values (4) to (6) presented above.
After evaluating the population data, the samples were analyzed using the same model and lmer()-function. For each condition, the peb over all 1000 simulated samples was then calculated for all fixed effects and r.e. variance estimates. Since the LME4-package does not provide p-values for multilevel analysis results 1 , the 95%-confidence intervals were constructed (see Wald-method in Bates et al., 2016) for fixed effects. For each parameter of fixed effects individually, coverage was evaluated by whether it ranges between 91% and 98%, and power was assessed for each predictor in each condition by the percentage of simulated samples with the confidence interval not including the value zero.

Population Model
The data generation process yielded a population that corresponds very closely to the sample reported by Maulana et al. (2013) with respect to descriptive statistics (see Table 7) and 1 Following arguments by Bates (2006), the reliability of significance tests for coefficients of the MLA-model is controversial, as degrees of freedom cannot be conclusively determined. Hence, the lmer()-function does not calculate p-values. In order to provide an approximate measure for statistical significance, we calculated confidence intervals using the confint()-function for the lmer() model parameter estimates. Coverage and power results in this contribution hold only under the assumption that degrees of freedom are reliably estimated. model parameters of the MLA using the full population data (see Table 8). Perceived influence scores across measurements are equal to the values reported in the illustrative study (M = 0.42, SD = 0.36), and motivation scores differ only by 0.01 points in mean (M = 3.54) and 0.03 points in standard deviation (SD = 0.73). 55.67% of the population consists of girls (56% in the illustrative study), and 50.70% of the classrooms are heterogeneous (52% in the illustrative study). Table 8 presents the results of the MLA analysis of the whole population, which serve as true values for subsequent analyses. The largest deviations from the original study are 0.0101 points for fixed effect estimates (classtype: population coefficient = 0.2845) and 0.0004 points for r.e. variance estimates (level-3 intercept: population variance = 0.0045). Table 9 presents the r.e. variance estimates of the population empty model and the resulting ICC values. As can be seen, stochastic dependency in the data due to clustering mainly   occurs with regards to the level-2 intercepts (using equation (15): ICC = 0.44; using equation (16): ICC = 0.39). Table 10 presents the estimated effect sizes for each predictor following Tymms (2004). In a simulation study, Tymms (2004) showed that large effect sizes, such as for time, time × time, and classtype, are slightly underestimated (below 10% of underestimation). The results in Table 10 show a range of small and large effects for both main effects (e.g., ES gender = 0.02; ES time = 0.75) and interaction effects (e.g., ES gender × time = 0.08; ES time × time = −0.55). As the coefficients of this case study are subject of evaluation and the effect sizes cannot be calculated accurately, further analysis results will refer to regression coefficients rather than effect sizes.

Estimation Bias: Random Effects
For samples including missing values, most pebs do not change by more than 0.01, except for the level-2 random slope variance in samples with N 2 = 5 and MCAR (N 3 = 15: peb = 0.20; N 3 = 35: peb = 0.09; N 3 = 55: peb = 0.02) showing an increase of up to 0.09. Results for the level-3 r.e. variances are mixed. For the r.e. variance estimate of the level-3 intercept, mean bias decreases for N 2 = 35/N 3 = 55 and MCAR (peb = −0.07) with peb below 0.10. In most other conditions, peb of the r.e. variance estimate of the level-3 intercept becomes larger, with increases of more than 1.00 in peb for MCAR. Increases in peb for the r.e. variance estimate of the level-3 slope mainly concern sample sizes with N 2 = 5 and additionally N 2 = 15 in samples with MCAR. For N 2 = 5/N 3 = 15, peb reaches 0.69. In total, missing values in N 2 = 35 do not lead to considerable changes in the r.e. variance estimates, while missing values in N 2 = 15 mainly affect level-3 effects. Missing values in N 2 = 5 lead to stronger estimation bias in all conditions except for the level-1 residual variance estimates.

Estimation Fluctuations
The association of estimation fluctuations, measured by the parameters' standard deviation over all replications, and peb expresses if stronger biases are related to less stable estimates. Figure 1 depicts these associations for gender, classtype × time, and the level-3 random intercept variance across all conditions (for all other effects, which show consistently comparable findings, see Supplementary Figures A1-3). Increasing the number of students per classroom generally leads to substantial improvements in estimation quality in terms of less estimation fluctuations (more efficiency) and peb (less bias). Moreover, for the smallest sample size at level-2, we find the largest peb and the largest standard deviations across the replications.

Coverage
Coverage rates for all fixed effects fall within the acceptable range of 91% to 98%, with the lowest being cov = 91.5% for the level-3 variable classtype in N 2 = 5/N 3 = 55 with MCAR, and the highest being cov = 97.3% for the level-1 variable time in N 2 = 35/N 3 = 55 with DrOP2. Complete tabulated coverage values can be found in the Supplementary Tables A7-10.

Statistical Power
Overall, power for fixed effects is greater than 80% in all conditions for the level-1 regression intercept, but it is below 80% in any condition for the smallest effects: gender, gender × time  In complete samples (COM) with N 2 = 35, effects are most reliably detected, with power ≥80% for all remaining effects except influence × classtype in N 2 = 35/N 3 = 15 (power = 58.7%). Sampling either N 2 = 35/N 3 = 35 or N 2 = 35/N 3 = 55 results in power >90% for all remaining effects. Additionally, sampling N 2 = 15/N 3 = 35 or N 2 = 15/N 3 = 55 provides sufficient power for the linear and quadratic time effect, classtype, and influence, whereas the cross-level interaction influence × classtype reaches power >80% for N 2 = 15/N 3 = 55 only. In order to statistically detect the effect of classtype, influence, and the quadratic time effect, N 2 = 15/N 3 = 35 is required, and time requires at least N 2 = 5/N 3 = 55, or alternatively N 2 = 15 regardless of the N 3 condition.
In conditions with missing values, power generally decreases with the most pronounced decrease for the MCAR-condition. DrOP2 results in the smallest decrease, where only one additional effect (the linear time effect in N 2 = 15/N 3 = 15) has a power below 80% (complete samples: power = 83.8%, DrOP2: power = 79.8%). As in samples without missing values, conditions N 2 = 15/N 3 = 35, 15/55, and samples with N 2 = 35, provide sufficient power for the effects time, time × time, and classtype. This also holds true for the effect of influence, with the exception of N 2 = 35/N 3 = 15, where power = 79.3%. For classtype, however, power drops below 80% for MCAR in N 2 = 5/N 2 = 55 (power = 77.4%) and N 2 = 15/N 3 = 15 FIGURE 2 | Statistical power for the time effect, level-1 predictor influence, level-3 predictor classtype, and the cross-level interaction of influence × classtype. The sample size condition is displayed as "level-2 sample size/level-3 sample size," e.g., 5/15 corresponds to "5 students each in 15 sampled classrooms." The missing value conditions (see section "Materials and Methods" and Table 6) are displayed by the type and shade of lines. The light dotted line marks the minimum sufficient power of 80%.

SUMMARY AND DISCUSSION
This simulation study sheds light on the quality of parameter estimates in MLA for data situations that are comparable to the one encountered by Maulana et al. (2013). The data was hierarchically structured on three-levels, with random effects on levels two and three, five measurement occasions, four main effects (Level-1 β time = 0.05, ES = 0.75; Level-1 β influence = 0.15, ES = 0.21; Level-2 β gender = 0.01, ES = 0.02; Level-3 β classtype = 0.28, ES = 0.56), four interaction effects (β time × time < 0.01, ES = −0.55; β classtype × time = 0.01, ES = 0.13; β gender × time = 0.01, ES = 0.08; β influence × classtype = −0.14, ES = −0.18), and a continuous response variable. We realized conditions differing in overall sample size, unit allocation, and missing value patterns, and investigated differences in estimation quality using MLA. When comparing analysis results across conditions, it has become apparent that in this setting, the quality of the analysis results strongly depends on the sample size, particularly the unit allocation to level-2, which describes the number of students. In the following subsections, we summarize the required sample sizes for sound estimation results, derive sampling recommendations, and discuss limitations and future prospects. In this regard, the recommendations presented are based on the specific simulation conditions in this study.

Required Sample Sizes
The simulation study showed that all model parameters can be reliably estimated except for the small effect of gender, which requires sample sizes greater than those examined. This is most plausibly due to its standard error being equally large as the model parameter. Results for interaction effects are mixed, with some cross-level-interactions requiring at least 15 level-2 units (gender × time, classtype × time), and both time × time and influence × classtype being unbiased even in small samples with 5 level-2 and 15 level-3 units. The results for the r.e. variance estimates are mostly consistent with previous studies (e.g., Hox, 2004, 2005;Moineddin et al., 2007;McNeish and Wentzel, 2017). Bias is very high for level-3 r.e. variance estimates, even for large sample sizes, but smaller for level-2 r.e. variance estimates. Bias of level-1 residual variances is negligible with pebs < 0.015. Heavy fluctuations in the level-3 r.e. variance estimates might also be due to their small size.
The results show that there is a minimum number of classrooms (level-3 sample size) needed for stable results, as results for samples with only 15 classrooms are most strongly biased due to strong fluctuations in estimates. Notably, studies on different three-level models report far smaller level-3 sample sizes to be sufficient in terms of peb (Heo and Leon, 2008;Cunningham and Johnson, 2016;McNeish and Wentzel, 2017). Apart from that, the number of students per class (level-2 sample size) is consistently more important than adding more classrooms to the sample, with especially low estimation quality in samples with only five students per class. Specifically, the highest-level sample size in this study does not impact results most, as the majority of previous studies suggests (e.g., Hox, 2004, 2005;Konstantopoulos, 2008;Heo and Leon, 2008;Scherbaum and Ferreter, 2009;Dong et al., 2018). This conflicting result might be due to a less pronounced multilevel-structure, as the r.e. variance components on classroom-level and, consequently, the ICC, are very small. Furthermore, for the level-3 predictor, the number of classrooms seems to be as important as the number of students. This is at least in parts in line with previous findings stating that a parameter is estimated most accurately if the sample size at the corresponding level is sufficiently large (Snijders, 2005). In addition to the peb as a measure of bias, we evaluated the strength of the fluctuation in parameter estimates as a measure of statistical efficiency. Results show that efficiency and unbiasedness are closely related, such that in small samples, obtained estimation results might not just be biased, but also potentially different in size if the analysis is repeated with another sample from the same population. As the abovementioned studies did not evaluate the strength of estimation fluctuation, the ability to compare our results in this respect is limited.

Recommendations for Researchers
This simulation study is meant to help researchers decide on sufficient sample sizes for a three-level longitudinal study with (assumed) population and sample characteristics, effects and models similar to the simulated conditions.
Given that the proposed model and data are comparable to the one examined in this simulation, researchers should consider at least 15 students each in 35 classes for stable fixed effects results, and more students when the analysis of small effects is considered important. If r.e. variances are of interest, additional level-3 units are required, particularly for higher-level effects, where more than 55 classrooms are needed.
If primarily estimation bias is of concern, researchers should consider sampling at least 15 students per classroom regardless of coefficient size, in order to equally decrease bias and fluctuation in estimation results. If significance tests or confidence intervals are used for interpretation, sufficient power for relatively large effects is reliably reached by sampling at least 15 classes with 15 students each. Smaller effects, however, require larger samples with 35 students in each class, which is a difficult task given that such large classrooms are the exception (see OECD, 2013 for worldwide average class sizes).

Limitations and Future Prospects
Results of this case-study are naturally limited by the study design and sample characteristics of the illustrative study. Particularly, results and recommendations are specific to settings where researchers can assume comparable population and sample characteristics (e.g., distribution of variables, ICC values) and use comparable predictor and outcome variables with similar expected effect sizes. Additionally, recommendations are a result of the model fitted to the data [i.e., as in (9) to (11), including predictors on all levels, cross-level-interactions, random intercepts, and random slopes]. As a consequence, sample size recommendations are not ubiquitously generalizable. Especially if the model or data give reason to suspect lower statistical power due to smaller effect sizes or a more complex model than investigated in this study, the recommended samples sizes will most likely be too small. On the other hand, if the assumed power is higher, e.g., due to larger expected effect sizes, the recommendations presented might be conservative, and thus serve as an approximate upper bound for required sample sizes, as it is not necessary to increase sample sizes above the provided recommendations. In addition to that, we limited our analyses to data with balanced sample sizes, since in each sample, there is a fixed number of students per class, and students attend or fail to attend at a fixed number of measurement occasions. We therefore did not investigate how estimation quality might be different for unbalanced designs, for example in samples where classes differ with respect to the number of students. In practice, these unbalanced designs can occur either naturally due to the sampling process, or deliberately, e.g., by oversampling students in classes with higher expected attrition over time. Potentially, oversampling might prevent bias or loss of statistical power due to expected missingness, and the strength of the imbalance might further influence estimation quality on all levels. To assess the potential benefits and consequences of oversampling, future research might focus on comparing estimation quality of balanced samples to samples with unequal cluster sizes, but on average the same sample sizes per level as the balanced data.
Furthermore, we did not reduce the level-2 sample size to two students per class (dyadic data) due to the restricted number of admissible random effects at the higher level: data have to contain more units per cluster than random effects at this level (Kenny et al., 2006;Du and Wang, 2016). In order to scrutinize the extent of necessary simplification, we conducted additional analyses by drawing samples of two students per classroom (100 replications for each N 3 condition without missing values) for reduced models, i.e., (a) a three-level model with random intercepts at levels two and three, but not random slopes, and (b) a two-level model with a random intercept, but no random slope and no consideration of the classroom level. For both reduced models, more than a third of analysis runs resulted in r.e. variance components being estimated as close or equal to zero, leading to a singular fit. In conclusion, sampling only two students per class would require a much simpler analysis model for successful estimation, which would make evaluations of estimation quality incomparable to the initial sample size conditions, and inferred sample size recommendations would not be based on the same underlying model specifications.
While the complexity of the statistical model and the specificity of the data and analysis conditions are meant to ensure practical relevance for the field, there are a variety of additional potential factors which influence the required sample size but were not investigated in this study. For example, since we based our parameter values on one illustrative study, the ICC values were not varied as an additional simulation condition. As the ICC is a direct representation of the (unexplained) pronouncedness of the multilevel structure and hence relevant if the estimation of random effects is of interest, varying ICC values in three-level models relevant for applied psychological research can provide a better understanding of the importance of each data level for accurate estimation results. To ensure practical relevance, future studies might generate data according to reported ICC values in the field, or alternatively base their simulations on a variety of sample studies depicting different estimated r.e. variance components. For longitudinal studies, measurement occasions can be varied to more clearly examine the importance of the level-1 sample size, i.e., the longitudinal component. In this regard, future studies should consider examining systematic dropout in order to provide recommendations for samples with systematic missing values, as researchers oftentimes are unsure how to handle such missing data (cf. Jeličić et al., 2009, for an overview). Other characteristics, such as multicollinearity and heteroscedasticity, should be further examined, as such characteristics are not uncommon in applied research, but not comprehensively studied in three-level models (in twolevel models: cf. Korendijk et al., 2008, for heteroscedasticity;Shieh and Fouladi, 2003, for multicollinearity). In conclusion, analyses of a representative choice of study designs and settings might assist researchers in choosing the adequate sample sizes, sample size allocations, and variables, for a variety of research questions.

AUTHOR CONTRIBUTIONS
DK contributed by conducting extensive research on the topic of the manuscript, conducting statistical analyses, and writing the manuscript. FN contributed by providing expertise, supervision with regards to statistical analyses, and proofreading.

FUNDING
We acknowledge support for the Article Processing Charge by the Deutsche Forschungsgemeinschaft and the Open Access Publication Fund of Bielefeld University.