A Machine Learning Approach to Personalize Computerized Cognitive Training Interventions

Executive functions are a class of cognitive processes critical for purposeful goal-directed behavior. Cognitive training is the adequate stimulation of executive functions and has been extensively studied and applied for more than 20 years. However, there is still a lack of solid consensus in the scientific community about its potential to elicit consistent improvements in untrained domains. Individual differences are considered one of the most important factors of inconsistent reports on cognitive training benefits, as differences in cognitive functioning are both genetic and context-dependent, and might be affected by age and socioeconomic status. We here present a proof of concept based on the hypothesis that baseline individual differences among subjects would provide valuable information to predict the individual effectiveness of a cognitive training intervention. With a dataset from an investigation in which 73 6-year-olds trained their executive functions using an online software with a fixed protocol, freely available at www.matemarote.org.ar, we trained a support vector classifier that successfully predicted (average accuracy = 0.67, AUC = 0.707) whether a child would improve, or not, after the cognitive stimulation, using baseline individual differences as features. We also performed a permutation feature importance analysis that suggested that all features contribute equally to the model's performance. In the long term, this results might allow us to design better training strategies for those players who are less likely to benefit from the current training protocols in order to maximize the stimulation for each child.


INTRODUCTION
If you encounter an add claiming "Do you want to improve your memory? With these brain exercises you will see changes in less than x time! Scientifically tested method!", what would you think? Unfortunately, to date, there is still not a well described and thoroughly tested method that consistently improves cognitive processes (Dorbath et al., 2011;Au et al., 2014;Buttelmann and Karbach, 2017). Even though over the last 25 years many cognitive or brain training protocols have been put to the test and shown positive outcomes Hsu et al., 2014;Diamond and Ling, 2016;Klingberg, 2016;Buttelmann and Karbach, 2017), many other show the opposite results (Melby-lervåg et al., 2016;Simons et al., 2016;Sala et al., 2019). Hence, a consensus on a "brain training recipe" seems improbable, especially considering the plethora of divergent results (Schwaighofer et al., 2015;Aksayli et al., 2019;Vladisauskas and Goldin, 2021).
Many of the successful examples of how cognitive training can benefit cognition show that stimulation can have a positive impact on Executive Functions (EF) (Anguera et al., 2014;Goldin et al., 2014;Karbach and Unger, 2014;Segretin et al., 2014;Klingberg, 2016;Spencer-Smith and Klingberg, 2017;Wiemers et al., 2019). EF are a group of cognitive processes critical for purposeful, goal-directed behavior, such as the ability to set a goal, to make a plan and stick to it, and to have the flexibility of changing that plan, or even the original goal, if priorities change.
EF mature with the great variety of stimulus and experiences that we undergo from birth and continue to develop throughout life (Colé et al., 2014;Delalande et al., 2020;Johann and Karbach, 2020). While this implies that some aspects of life might act as negative modulators of EF development, such as early vulnerability and prenatal malnutrition (McDermott et al., 2012;McGaughy et al., 2014;Deater-Deckard et al., 2019;Howard et al., 2020), it also signifies that proper life experiences can have positive effects in EF and, consequently, improve academic performance and other general life outcomes. In fact, several studies show that EF development predicts not only school performance, but also a broad array of life outcomes such as mental and physical health (McDermott et al., 2012;Miyake and Friedman, 2013;Diamond, 2020). One of the most frequent cognitive training strategies is to specifically and progressively challenge EF through games, and it has proven to be a powerful positive modulator particularly relevant during childhood, when behavioral and neural plasticity are intense (Sigman et al., 2014;Steinbeis and McCrory, 2020).
Mate Marote is a free-open access cognitive-training software aimed at children between 4 and 8 years old. It consists of a set of computerized games specifically tailored to train and evaluate EF. During the last 13 years several supervised interventions were performed with this software inside the schools. The training has shown to improve EF (e.g., Goldin et al., 2013;Nin et al., 2019) and to elicit transfer to real-world measures of school performance .
The training process involves games developed to target individual EF. In each intervention, conducted always in educational settings, a particular set of cognitive skills is trained for 10-to-15 min, one-to-three times a week over several weeks. Performance on these and similar domains is measured before and after the training to test for cognitive changes and to evaluate the effectiveness of the training process. This evaluation of cognition is also assessed with games, which are adaptations of standardized cognitive tests that have been used extensively in the literature, such as a version of the Stroop test to assess inhibitory control and cognitive flexibility (Davidson et al., 2006), or the Child-ANT task to measure attentional networks (Rueda et al., 2004).
Our aspiration is that every child can get the most out of their cognitive training time, and, even though many training games have proven to be effective, the question of who benefits the most and why remains uncertain Albert et al., 2020;Steinbeis and McCrory, 2020). Recent research suggests that the conflicting reports on cognitive training could be caused by individual differences among the subjects that take part in each cognitive training intervention. For instance, the developmental age or the state of cognition prior to the stimulation could be key to understanding why cognitive training does not always work for everyone (Guye et al., 2017). It is therefore an intuitive idea to think about those potential modulators when building training protocols (Green et al., 2019). In this line of research we wonder: how can those differences be taken into account to elicit a better stimulation for every brain? (Jaeggi et al., 2013;Karbach et al., 2017;Shani et al., 2019;Rennie et al., 2020).
In this study, we propose an initial approach towards cognitive training personalization using machine learning algorithms to try to identify subjects that will (or will not) benefit from a certain protocol of cognitive stimulation. In other words, can baseline individual state of cognition predict how much a participant will benefit from a certain intervention?

METHODS
We aimed to take a first step in personalizing interventions by predicting the potential benefits of a cognitive training protocol taking into account baseline individual qualities of the participants (Shani et al., 2019(Shani et al., , 2021Rennie et al., 2020).

DATA
We trained and tested classifiers to predict whether a child would benefit, or not, from a fixed cognitive stimulation strategy. To build these algorithms we used a small dataset from a past intervention performed with Mate Marote's online platform . This dataset includes the performance of 73 typically developing 6-to-7 y.o. children (33 girls) in one cognitive training intervention (the same for all children). The intervention involved 3 sequential stages (Figure 1): (A) a Pretest (or baseline) where children's EF and attentional capabilities were measured with a battery of standardized tests, (B) a Training stage where children played several games designed to challenge their EF (referred to as "the cognitive training protocol"); and, finally, (C) a Posttest stage where children's cognition was evaluated again.
During the Training stage, participants played three adaptive computer games aimed at training EF (specifically, working memory, planning, and inhibitory control skills). Children played at their own schools only one game in each 15-min session, and a total of no more than three sessions per week. The three games alternated for all children throughout the intervention. More details of the intervention, together with precise descriptions of the training and evaluation games, are available in Goldin et al. (2014) and Nin et al. (2019).
A week before the beginning of the training and 1 week after the last playing session all children took a battery of standard tests (Pre-and Posttest, respectively). The included tests evaluated: attentional networks (Child-ANT task; Rueda et al., 2004), inhibitory control and cognitive flexibility (The heartflower task; Davidson et al., 2006), planning (Tower of London task; Phillips et al., 2001) and spatial working memory (Corsi Block Tapping Task; Kessels et al., 2000;Fischer, 2001).

MODEL Features
Performance in each task depends on the prior state of cognition. As was mentioned earlier, baseline individual differences might include information on how the following cognitive training would work. Which turns each Pretest measure into a possible successful feature. We obtained a total of 12 pretest measures from every participant, and used those values as features to train multiple classifiers (Supplementary Table S1 for a detailed description of every individual feature, which includes attentional resources, inhibition, cognitive flexibility, and planning). The selected features represent different dimensions of each participant's baseline cognition (i.e., obtained during the Pretest). More than one value was obtained from every evaluation in order to perform the expected prediction.
Prior to training the classifier, every individual feature was rescaled using Sklearn's Robust Scaler (Interquartile range, statistics robust to outliers) and normalized into the range 0-1.
Pairwise comparisons between feature average values at Pretest where made using Mann-Whitney U non-parametric test (McKnight and Najab, 2009).

Classes
We constructed two classes: "Improved" and "Not improved, " aiming to show whether participants improved, or not, after cognitive training. To consider the existence of improvement after cognitive training, for each feature we calculated a Reliable Change Index (RCI) as was proposed by Jacobson and Truax (1991). The threshold for reliable change is calculated as 1.96 times the standard error of the difference between scores of a measure administered before and after de cognitive training (Preand Posttest, respectively). Of the many versions available, we used the method proposed in Estrada et al. (2019) specified as: where D i is the individual pre-post difference; S pre and S post , the standard deviation at pretest and posttest, respectively; and R PrePost is the internal consistency of the measure. The latter was obtained by calculating Cronbach's alpha following the procedure described in Cronbach (1951). As every participant completed the same standardized cognitive test twice, at Pre-and Posttest, by comparing the performance metrics between both stages we could evaluate if there were changes after the cognitive training. Hence, for every pair of Pre-and Posttest values we calculated RCI and concluded whether there had been an improvement (if the result was higher than 1.96), a deterioration (if the result was lower than −1.96) or if there was no reliable change between the measures (a result between −1.96 and 1.96).
To obtain the final class for every participant, we counted the amount of improvements and compared it to the amount of measures in which a deterioration was observed. If there were more variables with improved performance, the subject was labeled as "Improved." If the amount of deteriorated variables was equal or superior to the improvements, subject was labeled "Not improved."

The Supervised Algorithms
We performed an hyperparameter tuning with Sklearn's GridSearchCV tool to select the optimal hyperparameter values for a set of 6 classifier algorithms. In alphabetical order, the trained algorithms where: Gradient Boosting (Natekin and Knoll, 2013), K Nearest Neighbors (Laaksonen and Oja, 1996), Multi Layer Perceptron (Suykens and Vandewalle, 1999), Perceptron (Raudys, 1998), Random Forest (Breiman, 2001), and Support Vector Classifier (Lau and Wu, 2003). Afterwards, we compared the accuracy among models, which was calculated within the GridSearch using the optimal hyperparameter values.

Validation
To obtain a robust accuracy in the test set, for all the 6 algorithms we repeated the training-testing process with the optimal hyperparameter values using a Repeated Stratified K Fold Cross Validation (Refaeilzadeh et al., 2016). Compared to a single train-test split, a cross validation strategy allows to obtain more robust results with a small dataset like ours (as the variance of the data is more evident). Because cross validation tests the model's performance on different train-test splits, it does not have a strong dependency on the instances that belong to each split.
In regular Stratified K Fold Cross Validation, the sample is divided in k equal sized subsamples, each containing roughly the same proportion of the two types of class labels. Of the k subsamples (in this case k = 10), a single one is retained as the test set and the remaining are used to train the model (called "training-testing process"). The training-testing process is repeated k times, each time with different subsamples, with each of them used only once as the test set.
The Repeated Stratified K Fold Cross Validation adds an additional step to this process. After performing all the k folds with one set of randomized subsamples, the process restarts with a new set until the training-testing process is complete. The process is repeated n times (in our case, n = 20). Hence, we obtained a total of 200 scores (10 folds repeated 20 times) and the results were averaged to produce a single accuracy estimation (i.e., macro average score).
The performance of each model was evaluated using this accuracy score (Trappenberg, 2020). We also obtained the values of recall and precision to get a better idea about the model's learning. In our problem, recall is defined as the proportion of positive cases ("Improved" label) that were detected from the total of real positive cases. Precision refers to the proportion of real positives (labels with a real value of "Improved" predicted as such) from the amount of cases predicted positive. A good algorithm for our problem should prove similarly good in both metrics. We also constructed a receiver operating characteristic curve (ROC) to visualize the model's ability to predict the "Not improved" label, and calculated the area under the curve (AUC; Marzban, 2004) score to reflect the model's ability to predict the "Not improved" class.
As a final validation step, we performed a permutation test for every algorithm. This test is designed to evaluate the significance of a cross-validated score (Venkatraman, 2000). It permutes the targets to generate randomized data (i.e., no relation between features and targets) and calculates the p-value against the null hypothesis that features and targets are independent. If p < 0.05, we can reject the null hypothesis and assume that they are related, and that the model captures that relationship. To evaluate the significance of each model, we compared the cross-validated score of the algorithm for the randomized data to the score with the original data.
Finally, we selected the algorithm with the higher average accuracy, and performed a permutation feature importance. This analysis provides information on every feature's contribution to the model's prediction. The final weight for every feature is calculated averaging the model's accuracy decrease after randomly permuting the feature values within a testing set. When an important feature is permuted, the score should decrease, while the opposite would happen with a feature that is not very important according to this model's prediction. To obtain robust results with our small dataset, the train-test split was performed with a repeated stratified K fold cross validation as described earlier on this section.

Data Analysis
All the previous steps were performed in Python 3 language using Jupyter Lab interface. Sklearn library was used to build the machine learning algorithm, and eli5 library to perform the permutation feature importance analysis. Seaborn and Matplotlib were used for data visualization and plotting. Scipy was used to perform statistic analysis.

RESULTS
After assigning a class to every subject, we obtained a balanced dataset with 73 instances (N improved = 29). We trained 6 classifiers and retrained them with the optimal parameters (individual accuracy values, in Supplementary Table S2). The algorithm that best fitted the data was a support vector classifier (SVC; Lau and Wu, 2003). The SVC model showed an average accuracy score of 0.67 and AUC of 0.707, which are better than a random baseline performance (final hyperparameter values in Table 1 and permutation importance test result in Supplementary Figure S1). ROC curve in Figure 2. Precision and recall values were similar between classes, as expected from a binary classification on a balanced dataset (Supplementary Table S3). The permutation test result suggests that the SVC final model captures a dependency between the features and the classes (p = 0.01 against the null hypothesis that features and targets are independent, see Supplementary Figure S1). The permutation test also suggested that the KNN model was good at the task (accuracy 0.65, p < 0.03), although not as good as the SVC model (Supplementary Figure S2).
To get a better understanding of the model's predictions, we obtained the permutation feature importance values ( Table 2). This exploratory analysis describes which features are the most relevant according to the model's predictions. The standard deviation is large for every feature, suggesting that there are no differences between the feature's contribution in the SVC. The cognitive training literature presents inconsistent findings on who benefits the most from an intervention, and part of the literature suggests that subjects with the worst Pretest measures are the ones who, at the end, obtain higher cognitive gains Wang et al., 2019). If features differed between classes in the Pretest, they could give us information to approach that open question. The permutation feature importance analysis suggested that all features contribute similarly to the model, but this could be caused by the small size of our dataset. So, we tried to understand the direction of the variance explained by each feature. In other words, we wanted to know if, for each variable, the players who showed a better performance at Pretest were finally classified as "Improved" or as "Not improved." In order to analyze that, we compared the performance of each of the 12 baseline variables for the "Improved" and the "Not-improved" classes. No difference was significant (Supplementary Table S4).

DISCUSSION
Our study was meant as a proof of concept in the use of machine learning tools to personalize cognitive training interventions. By focusing specifically on the first two steps suggested by Shani et al. (2019), the final purpose of our research line is to design better, more personalized, training protocols and to contribute to settle the debate on cognitive training efficacy once and for all. Following previous studies which show that some people benefit from a given training protocol more than others (Titz and Karbach, 2014;Karbach et al., 2017), we wanted to know if a training gain could be predicted based solely on previous cognitive traits. Confirmation of this relationship could allow us to prevent some participants from completing a cognitive training protocol that will most likely not improve their EF .
With a small dataset (N = 73) from a past intervention performed with a free cognitive training software designed by our group (Goldin et al., 2013, we aimed to train a set of binary classifiers to predict the outcome of a particular stimulation protocol (i.e., to know whether a participant would benefit or not from it). We were able to train 6 machine learning algorithms, and two of them captured the dependency between features and targets: a k nearest neighbors classifier and a support vector classifier (SVC). This last algorithm showed the best performance predicting the binary classes ("Improved" or "Not improved") based on the individual previous cognitive traits. The results of the permutation test indicate that it is able to capture at least a portion of the dependency between the features (individual differences measured in the pretest stage) and the targets (whether a cognitive training was effective). The found accuracy value is moderate (0.67) and can still be improved, but is a very promising result considering the high variability between features (i.e., the individual differences observed in the Pretest).
The AUC value tells how much the model is capable of distinguishing between classes. The AUC result found (0.707) indicates that our model is able to differentiate players that did not show improvements after cognitive training from those who did ("Not improved" and "Improved" classes, respectively).
For this study we built a simple model to understand if we could predict the efficacy of a cognitive training protocol. Although we succeeded, the main limitation of our study is the small dataset. In the future, with more data, we might be able to dissect the "Not improved" class in two subgroups: those who mostly deteriorated (very few participants) and those who remained stable. Furthermore, it would also be interesting to differentiate, within the stable subgroup, those participants who really showed no differences between pre and posttest from those who, on the contrary, improved and deteriorated equally.
To understand the algorithm's prediction mechanism, we obtained the permutation feature importance rank, which showed that no features should be discarded from the SVC model, because there is not enough evidence to say that some of them contribute to the model's accuracy more than the rest. This might be due to the high variability found in the feature's importance, which in turn might be the cause of a small dataset that changes in every iteration, causing the large observed error. Although we cannot discard this explanation, results show that all Pretest measures are valuable in order to predict the efficacy of a cognitive training protocol and, until we can add more data, all cognitive tests prove informative and should be assessed.
Despite, the fact that as already mentioned, we still need to include more data to obtain a more general model for it to be implemented in future interventions, we were able to build a comprehensive baseline model and the results described here have implications for designing personalized cognitive training protocols in order to take into account whether they are going to be effective. For example, the performance of the model predicting the negative class (0.75, Supplementary Table S3) is particularly relevant because it will allow us to identify the subjects that most likely won't improve with one specific protocol. With this information, in the future we could individually target all cognitive training protocols.
Our main priority for the near future is to evaluate if the model does generalize to data from other cognitive training interventions with similar pre/post tests. Thus, we would not only gain insight on the relationship between previous cognitive traits and the training gains, but we may prevent a participant from completing a protocol that will not benefit him/her. Our ultimate goal is to ensure that each child can benefit in the best possible way from the playing time with Mate Marote.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
Children's caregivers gave written consent to participate in the study, which was authorized by an institutional Ethical Committee (Centro de Educación Médica e Investigaciones Clínicas, Consejo Nacional de Investigaciones Científicas y Técnicas, protocol no. 486).

AUTHOR CONTRIBUTIONS
The planning of this analysis was done by MV together with AG. The data analysis was done by MV, with LB's careful supervision and assistance. The manuscript was written by MV, revised several times by AG and commented thoroughly by LB and DF. All authors contributed to the article and approved the submitted version.

FUNDING
This research was supported by Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET). The original data was obtained thanks, also, to the University of Buenos Aires, Human Frontiers, Ministry of Science of Argentina, Centro de Educación Médica e Investigaciones Clínicas, and Fundación Conectar.