The (Ir)Responsibility of (Under)Estimating Missing Data

It is practically impossible to avoid losing data in the course of an investigation, and it has been proven that the consequences can reach such magnitude that they could even invalidate the results of the study. This paper describes some of the most likely causes of missing data in research in the field of clinical psychology and the consequences they may have on statistical and substantive inferences. When it is necessary to recover the missing information, analyzing the data can become extremely complex. We summarize the experts' recommendations regarding the most powerful procedures for performing this task, the advantages each one has over the others, the elements that can or should influence our choice, and the procedures that are not a recommended option except in very exceptional cases. We conclude by offering four pieces of advice, on which all the experts agree and to which we must attend at all times in order to proceed with the greatest possible success. Finally, we show the pernicious effects produced by missing data on the statistical result and on the substantive or clinical conclusions. For this purpose we have planned to lose data in different percentage rates under two mechanisms of loss of data, MCAR and MAR in the complete data set of two very different real researchs, and we proceed to analyze the set of the available data, listwise deletion. One study is carried out using a quasi-experimental non-equivalent control group design, and another study using a experimental design completely randomized


6*
The average value allows us to compare the deviation that occurs with respect to the empirical value obtained when using complete data. The standard deviation allows us to examine the sensitivity that occurs on one McL in function of the loss rate. The coefficient of variation allows us to examine in which McL, in which observed statistic and in which analysis model (ANOVA or ANOVAChS) the vulnerability or sensitivity to the PdL is higher. Due to the fact research work has been very carefully done and because we know the results obtained with the complete data, it is possible to have a logical and coherent explanation of all the results found. And they fit well both with the methodological literature and with the substantive literature as it will be later shown in the text. In order to better understand this point, we will first focus on the results found in the MCAR, MAR1a and MAR1b conditions, and then in the MAR2a and MAR2b results. 7*With respect to the ANOVA on the change scores. See Table 3 MSe: when the McL is MCAR and MAR1a, the empirical estimation of the MSe remains close to the MSe obtained with the CDs, and we could say that it is relatively stable for all PdL. However, when the McL is MAR1b, the MSe undergoes a progressive reduction w.r.e CD as it increases the PdL. The percentage of bias highlights this behavior more clearly. In this particular case, the estimate of the MSe moves away from the MSe w.r.e CD and depends on the PdL (> CV) only when the McL is MAR1b. η2 and the MD: when the McL is MCAR both statistics stay close to the value obtained with the CD. They do not experience any tendency based on the PdL. The mean in the set of loss rates practically coincides in both, η2 and the MD, with the values obtained with the CD. When the McL is MAR1a both statistics experience an increase in their average value, and although they do not draw a clear trend based on the rate of loss, they do experience greater variability in function of it than in MCAR. When the mechanism is MAR1b the result is even more sensitive to the loss rate (higher CV than in MAR1a). This happens to a greater extent in η2 than in MD (there is a big difference in CV). However, although both have a lower average estimate w.r.e CD, both move away from that value to the same extent as under the McL MAR1a (compare percentages of bias).
F: in all McL the F value undergoes a progressive reduction of its value as the PdL increases. The reduction with respect to the CD is greater when the loss mechanism is MAR1a and MAR1b in the rates 10%, 20% and 30%. When the loss rate is 40% the bias rate is the same in all loss mechanisms. The CV is high in the three McL.
If we focus on the McL MAR2a and MAR2b, we observe that the average value of the MSe increases with respect to what happens in the analysis of the CD as it increases the PdL, experiencing greater increase in MAR2b. However, the measured values of η2 and MD experience very little variation. (Peterson & Reiss, 1992) and AAQ-II, Acceptance and Action Questionnaire II (Hayes, Follette, and Linehan, 2004 10* It was possible to test the treatment effect on the 50 subjects who participated in the research, but six months later 9 subjects dropped out. Only one of them was a non-random loss. The effort and care taken in measurements registration prevented from having a greater non-random loss. That was a difficult task.

8* ASI, Anxiety Sensitivity Index
11* Variable AAQ-II: Once the treatment is finished (  Table 3, the ANOVAChS in the first study concludes that the best adjusted model is the additive model for a 40% PdL.

13* See in
14* The first research works with a very large and homogeneous sample: students of the same age without serious learning difficulties who study and live in the same city. All of them receive treatment at school with the informed consent of parents and teachers. The second one works with a very small and heterogeneous sample. Women who participated in the research did so voluntarily. We should have in mind that the prison available population was 98 women, and half of them refused to participate. The variability of this sample can be seen in many variables, among them, the years of drug use, the type of drugs they consume, the living conditions they have had, the comorbidity with other physical and mental pathologies, etc.).

15*
In the first research, the results of the ANOVA conclude that the model best adjusted when the PdL is 40% is not the same as the model explained by the CDs in MAR1a, MAR2a and MAR2b. The results of the ANOVAPC in MAR2a and MAR2b when the PdL is 30% arrive to the same conclusions. The same happens in the second research in the ANOVAPC with the results of the ASI total and ASI Cognitive variables.

16*
Many studies have shown that both boys and girls have the same ability to learn. But, however, boys are less disciplined than girls at the age studied. For this reason, it is expected that in MAR1a, where we had mostly girls in the sample, the response to treatment would be more homogeneous than in MAR2a, where the available sample was composed by boys mostly, and this is what is reflected in MSe in both the ANOVA and the ANOVAChS, which in the first case, in MAR1a, the MSe is slightly lower than that found with the CDs and in the second case, in MAR1b, the MSe is slightly higher than that found with the CDs. This also explains why η2 and MD experience a very small bias with respect to the CDs in the ANOVA.
17* A low pre-PR measure may lead the students to consider that it is not worthwhile to undergo the treatment because they will not benefit from it. And therefore, the students may be discouraged and leave the research. A person who has been a drug addict for many years may think that at this point of life there is no use changing because there is nothing to win, and for this reason motivation is very low, and the chances of quitting are very high. Or simply, a person who has been using drugs for many years may not comply with the commitment "not to use drugs while the research goes on", and to avoid been caught does not take the blood test and thus we lose data (and therefore we could have an intermittent loss of data). All this causes selection bias. Selection bias has a very negative effect on both statistical and substantive or clinical results. This is what happens in MAR1b and in MAR2b in the first study, and in MAR in the second one.

18*
We have seen that the causes of data loss do not have to be the same for all the variables that have lost data. This means that in the estimation of lost data for each variable we probably need to use different imputation models when choosing that way to approach the problem. 3 3 2 Note. MmD= Mecanism missing data; N= Total sample size; n= sample size each group; np= number of students that are lost; 10%, 20%, 30% and 40%= missing rates planned; 1 = actual overall missing rates; P25 y P75= percentile values of the pre variable pre Prolec; *= the losses MAR2a and MNAR2b occur inversely to losses MAR1 y MNAR1; 2 = random loss based on N without taking into account the groups GE y CC; 3 = loss in each group according to the percentage that represents N. Note. Mr= Missing rates planned; CD= Complete data set (N=915); N=915; CME; F; η 2 y DM= Statistics defined in the text of the article; 1 = The model that best fits the data is not the Additive Model T+S+C (Treatment, Sex and Course), but the Additive Model T+C; 2 = Percentage of bias was calculated as the ratio of the difference between the incomplete data estimate and the complete data estimate divided by the complete data estimate; M = mean; SD = standard deviation; CV= coefficient of variation.  Note. ACT = acceptance and commitment therapy; CBT = cognitive-behavioral therapy; CG = control group.TC1 and TC2 = change rate between the post and pre measurements, and between the measurement recorded 6 months after the treatment and the pre measurement, respectively.