Edited by: Giada Pietrabissa, Università Cattolica del Sacro Cuore, Italy
Reviewed by: Dexin Shi, University of South Carolina, United States; Christian T. K.-H. Stadtlander, Independent researcher, United States; Hexuan Liu, University of North Carolina at Chapel Hill, United States
This article was submitted to Clinical and Health Psychology, a section of the journal Frontiers in Psychology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
It is practically impossible to avoid losing data in the course of an investigation, and it has been proven that the consequences can reach such magnitude that they could even invalidate the results of the study. This paper describes some of the most likely causes of missing data in research in the field of clinical psychology and the consequences they may have on statistical and substantive inferences. When it is necessary to recover the missing information, analyzing the data can become extremely complex. We summarize the experts' recommendations regarding the most powerful procedures for performing this task, the advantages each one has over the others, the elements that can or should influence our choice, and the procedures that are not a recommended option except in very exceptional cases. We conclude by offering four pieces of advice, on which all the experts agree and to which we must attend at all times in order to proceed with the greatest possible success. Finally, we show the pernicious effects produced by missing data on the statistical result and on the substantive or clinical conclusions. For this purpose we have planned to lose data in different percentage rates under two mechanisms of loss of data, MCAR and MAR in the complete data set of two very different real researchs, and we proceed to analyze the set of the available data, listwise deletion. One study is carried out using a quasi-experimental non-equivalent control group design, and another study using a experimental design completely randomized
The evaluation of the efficacy of the administration of a clinical treatment, or of a component of the clinical treatment, whether to resolve a physical or psychological health problem or a behavioral dysfunction, often involves the registration of different variables that indicate the treatment effect at all of the moments necessarily involved in its administration. It also involves their registration in different phases of follow-up in order to examine whether the results achieved are maintained over time or not. However, it is not always possible to obtain all of the measures (e.g., López et al.,
Missing data can occur at any time when carrying out any empirical research, and to a large extent, the more subjects we have, the less control we can exercise over them, the longer duration of the investigation, the more variables we record at each moment in time throughout the entire duration of the research, and the more distanced the records are (the aforementioned investigations attest to this). Missing data are
It is possible that for some subjects we only know their identification, or we lack the record of some variables simply because they did not occur naturally, or because the measuring instrument was insensitive to capture them, or due to the poor formulation of variables or a lack of data because the researcher removed them for some reason, or has neglected the record, or the record is wrong, or the sampling was unsatisfactory. All of these situations are specific cases, with specific solutions that we are not going to cover here. Thus, we feel it should be made clear that we start from the assumption that the person solely responsible for the answer not being recorded is the subject, and they may not have provided it voluntarily or involuntarily.
When the missing data do not have any relation to the actual or potential study variables (e.g., when they are due to a transfer of residence, to the forgetting of an appointment, or any other unforeseeable cause outside of the study), the losses are considered to be
The interest in missing data is old (Wilks,
The only way of knowing the consequences of data loss and their severity is through controlled experiments manipulating the data loss in the form of the mechanisms described by Rubin. This has been the means, by computer simulation, or using entire databases from real investigations, that has allowed us to verify that data loss is rarely innocent, and that the cost is both statistical and substantive. To sum up, the following are the five main consequences:
First: the representativeness of the population in the sample will disappear and the transferability of the results will be limited.
Second: selection bias. The data we lose contain important information and will inevitably lead to selection bias (Meng,
Third: statistical analysis techniques lose their effectiveness (multivariate techniques that require complete data would not even be implemented). Namely, the normal distribution of the data may not be maintained. Nor the homogeneity. The variability of the data will increase or decrease, and with it the standard error will increase or decrease seriously damaging the estimation of the parameters that in some occasions will be overestimated and in others will be underestimated.
Fourth: loss of data means loss of sample, and consequently involves the loss of power of the test.
Fifth: although not all data that are missing carry the same load and quality of information, analyzing data that have poor or impoverished information will make the researcher choose an estimation model that does not respond to reality because it omits relevant variables or includes irrelevant variables.
These five powerful reasons have concerned scientific organizations and drug regulatory agencies worldwide. In fact, since the 1990s, these entities have not ceased to give warnings or to offer recommendations for how to manage the problem of missing data in research. For example, the U.S. Food and Drug Administration (FDA) had called on the National Research Council (NRC) to bring together the most outstanding statisticians and mathematicians on this subject with the mission to provide guidelines for researchers, both to limit the data loss and to address the problem if it has already occurred (National Research Council,
We believe it is important to highlight, firstly, that it is not always appropriate to intervene to address data loss. If the loss is small or too large, one should not intervene in the first case as it is unnecessary, and in the second case because it would be reckless. It has been shown that if the data loss is no more than 5%, any technique that we use, whether simple or sophisticated, will lead to the same conclusions as those that would be found if we did not include the subjects whose data vector is incomplete (Little and Rubin,
No one will dare to state when a rate of data loss is sufficiently small that the consequences will not be felt (Schafer,
There are two practices that are most commonly used to treat the missing data: eliminating subjects with incomplete data (listwise deletion), and imputing (assigning a value to) the missing data with the mean of all of the values or with the prognosis derived from the regression analysis. However, these are almost never appropriate ways of proceeding (Rubin,
It is absolutely necessary to examine how missing data are distributed in the database to try to figure out which mechanism was responsible for the loss and how it was produced. The purpose is to determine whether the data loss is ignorable (MAR and MCAR) or non-ignorable (MNAR), and what the pattern that is made looks like, whether monotonous or arbitrary. Based on these two aspects we must choose the most appropriate technique for dealing with the loss of information. Other aspects, such as the number of variables in which there is loss and what type of variables they may be, are always subordinate to the above two considerations. The most appropriate option will be any of the following that we shall now discuss:
A. When we can determine that the data loss is MAR we are assuming that it is possible to recover the missing information taking advantage of all of the information contained in the cases that are complete. We can do this in two main ways: through methods based on direct estimates of maximum likelihood and using techniques of multiple imputation. Let's take a look.
A.1. — Methods that are developed directly maximizing the likelihood function (ML). These methods do not fill the gaps in our database, but they calculate the parameters that make the data more credible, maximizing the likelihood function of the complete data.
Many studies testify that longitudinal studies are the most common when the purpose is causal, and that repeated measures designs are most commonly used to collect the data (Fernández et al.,
MLM and SEM models perform the maximum likelihood estimates of the parameters using iterative algorithms. The most efficient, and the most popular too, are the EM (Expectation-Maximization) and FIML (Full Information Maximum Likelihood) algorithms in the MLM and SEM models respectively. They take into account all of the information contained in the complete data, so if the data loss is MAR, incorporating the cause of the loss into the analysis model provides unbiased estimates of the parameters (Little and Rubin,
A.2. — Multiple imputation methods. In the variables that are recorded on several occasions there is, obviously, a greater chance of losing data and it is always possible to assemble the missing data from a monotonous pattern of loss. In addition to the main variable of the study, in repeated measures designs it is common to record many other variables that may moderate, mediate or confuse the relationship studied, for example, quality of life, motivation, self-regulation, resilience, etc. (MacKinnon and Luecken,
Multiple imputation (MI) deals with missing data in three steps. First several imputations (plausible values) are performed for each missing datum in order to have as many replications of the data sample as the number of imputations we carry out. Then each data set is analyzed in the necessary way in order to answer the research hypotheses. Finally all of the results are combined into one using the formulas developed by Rubin (
Of the three steps, the most delicate one is the first, the generation of the imputations. Two things matter the most here: the quantity and quality of the imputations. As for the amount concerned, not so long ago it was considered that between three and five, or no more than 10 imputations (Schafer,
The imputed values are of quality if they are consistent with the values of the variable on which they are performed (the original distribution is not altered) and also with the other variables (the correlations between them are not altered). This can only be achieved if the imputation model is able to capture the true structure of the data, and for this it must necessarily contain the following. Firstly, in addition to the variables of theoretical interest, it must include auxiliary variables, that is, variables associated with the loss mechanism, and all those that are correlated with the previous variables (Carpenter and Kenward,
There are several ways to perform the MI, depending on whether the loss pattern is monotonous or arbitrary. When it is monotonous it is done using regression techniques. When the loss pattern is arbitrary, the MI is performed by the Markov Chain Monte Carlo (MCMC) procedure. This procedure contains two great properties. Firstly, it uses the power of Bayesian inference to estimate the posterior probabilities of the responses. This procedure is powerful for two reasons: it works on the likelihood function, albeit indirectly, by imposing a prior distribution to the variables and, in addition to the information contained in the sample, it can incorporate external information. Secondly, once the most plausible distribution has been obtained for the data that are complete, by Monte Carlo random sampling procedures, all of the desired imputations are made for each missing datum.
The high power of the techniques outlined in paragraphs A.1 and A.2 lies in two aspects that they have in common. First, they assume that the probability of complete data can be estimated from the observed data by controlling the effect of the missing data. This assumption is valid when the loss mechanism is MAR (Rubin,
However, these are different techniques with different characteristics and they obviously have their own properties which represent advantages over one another in certain circumstances. At some point, the researcher needs to decide which technique is more prudent to choose, regardless of the custom or expertise he or she may have. The following techniques are noteworthy:
Techniques of multiple imputation: these allow the imputation and analysis models to be different, thus, once the imputations have been done it is possible to carry out as many statistical analyses as you consider appropriate to test the relevant hypotheses without making imputations repeatedly in each analysis. Added to this, if we perform the imputations using techniques with Bayesian support we add even more value for three reasons: it is easy to introduce as many auxiliary variables as we consider appropriate; it is possible to incorporate external information (both are notable for gaining efficiency and accuracy values); and, due to the mechanism that approximates the posterior distribution, they are techniques that are robust to the violation of the assumption of normality.
Techniques that work directly on the likelihood function: these allow the comparison, using the likelihood ratio and/or information criteria of as many models as we consider reasonable. The advantage of adjusting random effects models, mixed models, and hierarchical linear models, whether the designs are repeated measures or not, balanced or unbalanced, with loss of data or complete data is noteworthy (Fernández et al.,
Despite the differences, it has been demonstrated empirically that the two procedures are equivalent in resolution when the sample size is large, the data loss is moderate, and the distribution is normal (Vallejo et al.,
B. — If, after detailed examination of the data matrix we conclude that the mechanism responsible for the loss of our data
To try to overcome this problem, it is recommended to use, together with the abovementioned procedures that assume MAR models, one or more of the MNAR models available, such as
Such an analysis should always be based on adequately supported (clinically plausible) substantive hypotheses which faithfully reflect the pattern produced by the loss mechanism that supports these hypotheses. Once we have obtained the results of all of the models we have tested, we must compare them primarily based on the bias, accuracy and coverage, and conclude with the best. For example, there is currently a broad consensus that it is most appropriate that the primary analysis of longitudinal data in clinical settings is effected with methods which assume MAR missing data, and that the robustness of the results obtained in this way is evaluated by sensitivity analysis using methods which assume MNAR missing data (National Research Council,
Because of the complexity of research projects today, researchers should listen to the advice of experts, which we summarize below:
If there is missing data it greatly complicates the task of analyzing the data for two main reasons. One reason is that the most robust techniques mentioned above are complex and carrying them out correctly involves great difficulty. The other reason is that there is no universally valid way for all situations, and succeeding in choosing the most appropriate procedure to retrieve information from our data often requires the skill of an expert. Both are compelling reasons to help us understand the insistence of mathematicians and statisticians (experts in the development of techniques to deal with this major problem of missing data) on the following four aspects:
First: the best solution is not technical or analytical, but tactical. We are referring to prevention. It is mandatory to take care in every aspect of the research and throughout the entire process, aiming, through the use of design strategies and with great persistence, to minimize every possible chance of losing data (National Research Council,
Second: if we take care with the design, we extend the window of opportunity to ensure the soundness of the inferences, because we can see which variables determine the subjects' responses and which variables determine their absence of response and we can include them as auxiliary variables in the models of imputation and/or analysis, making a MAR model more plausible. This renders the treatment of the data more successful (Little et al.,
Third: the process we use to solve the problem is not always a guarantee of a successful outcome. The solution will always be uncertain. We must not make the mistake of thinking that we have adequately addressed the problem, not even using the most sophisticated procedure possible. To date, there is no foolproof way that allows us to discern with absolute certainty whether the loss mechanism is MAR or MNAR, nor is there a method that allows us to reproduce the original data, nor can we be sure that the model we propose allows us to capture all the fine details that underlie the loss mechanism. For all these reasons it is highly recommended always to perform a sensitivity analysis (Enders,
Fourth: although we have decided to eliminate subjects who have empty records because we believe it is the best justifiable choice, we must always recognize the problem, communicate it and discuss the reasoning of our decision (Lang and Little,
In this section we show the pernicious effects produced by the loss of data on the statistical result and on the substantive or clinical conclusions. For this purpose we will act as if we lose data in the complete data set of two very different real researchs. Both researchs have two things in common, both have treatment group(s) and a control group, and in both cases the application of treatment has been extremely careful to guarantee the integrity of the treatment and internal validity, and data registration has been taken carefully with the purpose of not losing data. For this reason, both have the complete data set (with some nuances that we will detail later). We present both examples making use of the following sections: description of the research and objective, data analysis, results and conclusions with the complete data set, conditions of generation of lost data, and analysis, results and conclusions obtained with the set of the available data (listwise deletion) in the different data lost conditions. Finally, we present some molar conclusions and particular nuances about the empirical and substantive results derived from the loss conditions manipulated in both researchs. Both the analysis of the data and the loss of the data were made using the statistical package SPSS 25. Due to space limitations, results, tables referred to in the text, and added explanations of some paragraphs (indicated in the text as, Addenda 1*, 2*, etc.,) go in an attached file: Addenda.
A non-equivalent control group quasi-experimental research was carried out with 3rd and 4th year primary school children to evaluate the effectiveness of an intervention to enable or reinforce self-regulation strategies in learning. This research was presented at the
Both experimental conditions, control group (CG) and treatment group (experimental group, EG), were randomized. 925 children from 14 schools in Oviedo participated. Before the application of the treatment, different tests were administered in collective sessions to know the initial state of the students in different competences, abilities and attitudes in which the treatment should have a positive effect. The treatment program involved 12 intervention sessions, one 60 min session per week. After the intervention, 915 students were evaluated again with the same initial instruments. The data of 10 students were lost in a completely random way due to the change of residence of their families.
The PROLEC-R Battery was one of the instruments used to evaluate the effectiveness of the intervention (Addenda 1*). Some of the results obtained with the analysis of the pre- and post-measures, hereafter PR and post PR, were presented in the mentioned communications, and are the only ones to which we are going to refer. The full results of the research are in the process of being published in different works.
The analysis of the pre-PR measurement showed no statistically significant differences between the experimental groups. The “gross” effect of the treatment on the post-PR measurement was tested by the Variance Analysis Model, ANOVA (2 × 2 × 2) [EG, CG; boys, girls; 3°P, 4°P]. The analysis of the change experienced between the post-PR and pre-PR measures and the analysis of the maturation effect was made by the ANOVA on the change scores (post-PR–pre-PR), hereafter ANOVAChS (2 × 2 × 2). When there are no initial differences between the EG and CG groups, the ANOVA on the Post measure and the ANOVAChS on the change scores are two types of analysis useful and valid in the non-equivalent control group quasi-experimental design (see Fernández et al.,
See Tables
The ANOVAChS (2 × 2 × 2) showed that the non-additive model Treatment × Course [T + C + (T × C)] explains best the change experienced between the post-PR and pre-PR measures. Interaction (T × C) [FT×C = 7.85;
The ANOVA results of the post-PR variable have highlighted that the Treatment has been effective with a moderate effect size and that Sex and Course variables explain part of the variance observed in the post-PR measure, however, its effect size is small, due to the fact that the sex variable is smaller. The ANOVAChS shows that the significant change has only been experienced by the EG. The 3°P students have been the most benefited by the treatment. It has been shown that the change observed in the CG is a product of maturation and has not been statistically significant.
We have planned to lose data in five different conditions under two mechanisms of loss of data (hereafter McL), MCAR and MAR. In each of them we have planned to lose data in four percentage rates of data loss (hereafter PdL) 10, 20, 30, and 40% Addenda 3*.
We have manipulated the following conditions:
MCAR: completely random loss has been caused based on the total sample size without taking into account the EG and CC groups, Sex or Academic Course.
MAR1a: we have used the variable Sex to cause the loss of data. For each of the LpR, we plan to lose 80% of boys and 20% of girls.
MAR1b: we have used the pre PR measure to cause data loss. We calculate the P25 and P75 percentiles of the variable pre PR and segmented the variable into three categories, below P25, above P75 and between both percentiles. Subsequently, for each of LpR, the percentage of loss was 75% of those who had a measure below P25, 23% of those who had a measure in the middle segment, and 2% of those who had a measure above P75.
MAR2a and MNAR2b: the data loss has been the inverse of the manipulated in conditions MAR1a and MAR1b.
It is logical to think that those students who have worse initial performance will experience a poorer response to treatment. It is also logical to think that the response to treatment is will be similar in both boys and girls. Thus, the data loss MAR1b and MAR2b will have a more pernicious effect on the result than the data loss MAR1a and MAR2a Addenda 4*.
Having into account the McL in the 4 PdL, we examined the consequences of data loss on the empirical result of four statistics given by the two analysis models, ANOVA and ANOVAChS using the PR variable. In the ANOVA only on the main effect of the Treatment, and in the ANOVAChS only on the simple effect T × C [C] on the 3°P group. The observed statistics are the mean square error (MSe), F, η2, and MD.
In the first place we will observe what happens in each of the McL in terms of the PdL through the empirical value of the statistics and the percentage of bias (Addenda 5*) that occurs with respect to the empirical value obtained in the set of complete data (top left and top right of Tables
MSe: when the McL is MCAR the MSe remains close to the MSe obtained with the CD's, and stable (very small SD), for all PdL. However, it undergoes a progressive reduction with respect to the estimate with CD (hereafter, w.r.e CD) as the PdL increases when the McL is MAR1a and MAR1b, in the latter the reduction is grater. The percentage of bias highlights this behavior more clearly. In this particular case, as the McL is more severe and w.r.e CD, the smaller is the MSe, the more vulnerable is its estimate on the PdL.
η2 and the MD: when the McL is MCAR both statistics stay close to the value obtained with the CDs. They do not experience any tendency based on the rate of loss. The mean in the set of loss rates is the same in both, η2 and the MD, with the values obtained with CD. When the McL is MAR1a both statistics experience a reduction in their value, and although they do not draw a trend based on the loss rate (at least in a clear way), they do experience greater variability than in MCAR. When the mechanism is MAR1b the result is even more sensitive to the loss rate (higher CV than in MAR1a), but very similar in both. However, although both, η2 and MD, have a lower average estimate w.r.e CD, MD experiences a greater reduction for PdL ≥30% (see percentage of bias).
F: in all McL the F value undergoes a progressive reduction of its value as the PdL increases. The reduction with respect to the CDs is greater when the McL is MAR1a and MAR1b in the rates 10, 20, and 30%. When the PdL is 40% the percentage of bias is the same in all McLs. The CV is high in the three McL.
If we now focus our attention on the McL MAR2a and MAR2b, the empirical results are inversed to those observed in the McL MAR1a and MAR1b. The effect exerted by the PdL in each McL is the same, but empirically, the MSe increases as the PdL increases, experiencing greater magnitude in MAR2b. In the same way, both η2 and DM are greater.
The results with respect to MSe and F are similar to those obtained in the ANOVA previously described. However, the results in η2 and MD present some nuances with respect to the results of ANOVA. The detailed results are shown in the Addenda 7*.
The substantive conclusions of the effect of the Treatment when the McL is MAR1b and MAR2b vary in many nuances with respect to the conclusions derived from the analysis with the CDs, but in no case would they lead to conclusions opposite to those obtained with the CDs. The last section abounds in this aspect.
Experimental research completely randomized to study the efficacy of two psychological therapies in the treatment of substance use disorder (SUD), acceptance and commitment therapy (ACT) and cognitive-behavioral therapy (CBT) (Villagrá et al.,
The first step to test the working hypothesis was to examine the possible existence of selection bias. Next, assuming that the causal model is a model of change (Judd and Kenny,
The results of the research show that the women who received treatment benefited by the interventions. At post-treatment, CBT was more effective than ACT in reducing anxiety sensitivity, however, at follow-up, ACT was more effective than CBT in reducing drug use (43.8 vs. 26.7%) and improving mental health (26.4 vs. 19.4%).
In the quoted work you can find all details about the investigation and the results. In this case, to demonstrate how statistical results and substantive conclusions are altered when data loss occurs, we will only do so analysing three measures, ASI
With respect to the variables ASI
Results ASI
Results ASI
Results AAQ-II. See Addenda 9*.
The results obtained with the complete data in these three variables give us the same results as in the original research, and therefore can be conclude that ACT may be an alternative to CBT for treatment of drug abuse and associated mental disorders. In fact, at long-term, ACT may be more appropriate than CBT for women in prison with severe problems.
Being the sample so small, it is illogical to plan a loss rate of 40%. If this were the case, the researcher would be probably doing things wrong. (see Addenda 10*). Because the previous example showed clear differences between the PdL 10 and 30%, we decided to manipulate only these two PdL. So, we planned to lose data under three different conditions under the McL MCAR and MAR.
MCAR: completely random loss has been caused in the total set of the sample without taking into account the ACT, CBT, and CG groups.
As discussed above, 49.9% of people who used drugs refused to participate in the research. It was found that most of them had been using drugs for many years, so was the case of the person who abandoned the research. For this reason we planned to lose data based on the variable “years of dependence.” This variable is distributed in a normal way (P25 = 10, P50 = 16.5, and P75 = 20.25 are its percentile values).
We planned two PdL MAR, 10 and 30%. In both conditions, the loss in each group was made according to the percentage of sample that each group represented of the total sample. Details are shown in the lower part of Table
MAR 10%: the 10% loss occurs only among subjects who had been consuming more than 20 years (above P75).
MAR 30%: loss occurs in the full range of the variable “years of dependence,” but the greatest amount was lost in those who had a longer period of dependence.
We examined the consequences of the data loss according to McL in both PdL on the empirical result of four statistics provided by the ANOVAChS model, MSe, F, η2, and DM, in the same way as we did in the first research.
The results will be shown in a different way as we did in the first example. First we will comment on the empirical results and the statistical conclusions, and then, in block, we will comment on the bias that occurs in the estimation of MSe, F, η2, and MD. The small sample size forced us to focus attention, more intensively than in the previous example, on the substantive consequences of the statistical reading and on the variation of the magnitude of the means.
Variable ASI
However, it should be added that only in MCAR with PdL 10% the substantive conclusion is the same as with CDs (observe the change rates, hereafter ChR). Although in MAR with PdL 10% we arrive to the same statistical conclusion, we should admit, reading the ChR, that both ACT and CG get worse in the same way.
The conclusions we reached were the same for both studies: The one with complete data and the 6-month follow-up with manipulated conditions of data loss. There is not a statistically significant change in any of the groups. However, it is necessary to comment on the reading of CDs the tendency toward the improvement that occurs in ACT and the tendency to lose the effect gained with the treatment in CBT. Again we observe that it is only true for MCAR with PdL 10%. In the MCAR condition with PdL 30% we would conclude that GC remains almost as at the beginning, in the MAR condition with PdL 10% we would conclude that ACT gets worse, and in the MAR condition with PdL 30%, we would conclude that the three groups practically behave in the same way.
Variable ASI Cognitive: at post-treatment (see Table
At the 6-month follow-up of manipulated condition of data loss we reached the same conclusions as we did with CDs. However, when McL is MCAR we arrive to the same conclusions. When McL is MAR and PdL is 10%, although we conclude that there are differences between ACT and GC, as with CDs, we also conclude that there are differences between CBT and CG, and this is so because the GC continues to experience a progressive deterioration after 6 months (that is not appreciated in CD).
Variable AAQ-II. See Addenda 11*.
Observing the percentage of bias that occurs in the ANOVAChS (post-pre) estimates for ASI
Observing the percentage of bias that occurs in the ANOVAChS estimates (6m-pre) in the ASI
With regard to the AAQ-II variable see Table
It should be pointed out that:
1.—We have verified that MSe experiences variability both in function of PdL and in function of McL. The greater the bias the greater the PdL and the bias it is also greater if the McL is more severe. In the condition that both PdL and McL are more aggressive, the bias that MSe suffers is even greater. This happens in both models of data analysis, ANOVA and ANOVAChS, in the first research, and in the ANOVAChS in the second research.
2.—We have verified that for the ANOVA model the bias experienced by η2 and MD suffers the same tendency as the bias experienced by MSe, although to a lesser extent. The variability suffered by η2 and MD in ANOVAChS is milder.
3.—In both investigations we have verified that McL MCAR is the less aggressive, even innocuous when PdL is ≤ 30% in the first research (see some exception in Addenda 13*). However, in the second research, in MCL MCAR, a PdL 30% affects the result in a very important way. This inevitably highlights the importance of the sample size and arrangement. The first research has a very large and very homogeneous sample size and the second research has a very small and very heterogeneous sample size (see some reasons in Addenda 14*). This is the reason why, even when McL is MCAR, sometimes it is not inocuos for all variables with a loss of 10%, (see results ASI
4.—In both investigations we have verified that the MAR mechanism is more aggressive than MCAR. We have also checked the following:
—Their aggressiveness is directly related to PdL, that is, the results are more affected when the loss rate gets higher. We have seen that the model that best fits the data is different from that obtained with CDs, some-times in both investigations (see Addenda 15*). Again, we have to consider the importance of sample size and composition. While in the research carried out in schools the effect of the variable treatment is not modified when the model changes, and neither the substantive nor clinical conclusions (we have obviously seen that there are different nuances), in the research done with drug-dependent women the statistical result is radically modified, and the substantive conclusions too.
—The pernicious effects of data loss are greater when McL is MAR2 than when McL is MAR1. The reason is very simple. In MAR1 the loss of data is conditioned to the Sex variable, and although there are differences between boys and girls in the response to the dependent variable (the response is better and more homogeneous in girls), the response to Treatment is the same in boys and girls (see Addenda 16*). However, in MAR2 in the first research, and in MAR in the second research, something very different happens. The loss of data is conditioned to the initial PR measurement in the first research, and in the second research to the number of years that the women in prison have been consuming, and both variables are capable of determining the decision to abandon the investigation and thereby provoke selection bias exerting a very negative effect on both statistical results and on substantive or clinical results, see Addenda 17*. In MAR1b the available subjects are the students with the best initial PR measurements, their response to the treatment is expected to be more homogeneous, and that is what happens, that is why the MSe is very small with respect to CDs. Because the most advantaged subjects remain in both groups, EG and CG, η2 and MD are lower than those found with the CDs. In MAR2b the available subjects are the students with worse initial PR measurements, their response to the treatment is more heterogeneous, and that is why the MSe is greater with respect to CDs. Because the subjects are less advantaged in both groups, EG and CG, η2, and MD are greater than those found with CDs. In MAR in the second research, this is appreciated with both PdL, but much better with PdL 30%. The available subjects have been using drugs for less time and have a more homogeneous response to treatment, which is reflected in a smaller MSe than that found with CDs, and clinical conclusions that are absolutely contradictory to those that emerge from the results found with the complete data. This is shown, in a very aggressive way, in the results and conclusions of ASI
5.—Points 3 and 4 above show the serious effects that the loss of data can have on statistical results, and the distorted reality conclusions we arrive to when the loss of data is MNAR. When we have manipulated the conditions of data loss in the two previous examples, we have defined the MAR conditions, as MAR conditions, because we know to which variable the losses are conditioned. If these losses occur in our research and we do not know the causes that determine them, the losses MAR1a and MAR1b are MAR losses, and it is possible to extract the information they provide from the complete data set. But the losses MAR1b, MAR2b in the first research, and the MAR losses in the second one, are MNAR losses and the data analysis gets highly complicated. Success will depend on our skill when choosing the variables that determine the losses, and that we have taken a record of them that allows us to introduce them in the models of imputation or maximum authenticity (in this way we convert a MAR loss into an MNAR loss), in addition loss of data from other variables may depend on different causes, see Addenda 18*. The sample size is very important too (these techniques have a better behavior when the sample size is large) and it is also important the sensitivity of the dependent variables to other variables also related to the treatment (if the sample is homogeneous the analysis will be easier and the results will be better). In any case, we are obliged to perform complex sensitivity analysis. And even then, the result will always be uncertain.
Where there are no data, and should there be…there is uncertainty only…now I suggest that you should read again the experts' recommendations and advice. It is important that you take them into account next time you have to investigate.
It was notified.
The universe of missing data is large and complex. We have attempted to provide a simple, coherent and reasoned presentation of this enormous problem in order for it to be useful for applied researchers. With our paper, we attended to one of the requests presented in the manual The Prevention and Treatment of Missing Data in Clinical Trials (National Research Council,
Although we have avoided all mathematical formulation, we could not avoid the appearance of terms with which the applied researcher may not be too familiar (
We have noted that there is no perfect or infallible way to deal with the problem once the research has already been carried out, but without a doubt the best way to approach the objectives of our study, the best way to test the hypotheses will be by combining three aspects: the humility to recognize the problem, the time to devote to studying our database in depth, and the decisiveness to seek the methodological expertise of an expert in order to attempt to solve the problem together.
MF-G and GV-S developed the initial idea and design of the work and wrote the article. PL-R and ET-H were in change of drafting the manuscript and revised the manuscript critically for important intellectual content. All four authors provided final approval of the version to be published and agreed to be accountable for all aspects of the work in ensuring that questions related to the integrity of any part of the work were appropriately resolved. The four authors have read and followed the Frontiers in Psychology Instructions for Authors, and the paper has been seen and approved by all of us.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
The Supplementary Material for this article can be found online at: