Limited Effects of Set Shifting Training in Healthy Older Adults

Our ability to flexibly shift between tasks or task sets declines in older age. As this decline may have adverse effects on everyday life of elderly people, it is of interest to study whether set shifting ability can be trained, and if training effects generalize to other cognitive tasks. Here, we report a randomized controlled trial where healthy older adults trained set shifting with three different set shifting tasks. The training group (n = 17) performed adaptive set shifting training for 5 weeks with three training sessions a week (45 min/session), while the active control group (n = 16) played three different computer games for the same period. Both groups underwent extensive pre- and post-testing and a 1-year follow-up. Compared to the controls, the training group showed significant improvements on the trained tasks. Evidence for near transfer in the training group was very limited, as it was seen only on overall accuracy on an untrained computerized set shifting task. No far transfer to other cognitive functions was observed. One year later, the training group was still better on the trained tasks but the single near transfer effect had vanished. The results suggest that computerized set shifting training in the elderly shows long-lasting effects on the trained tasks but very little benefit in terms of generalization.

INTRODUCTION Executive functions represent higher-level cognitive control processes that are crucial for everyday activities. Different models of the mental architecture of executive functions have been put forth, but a particularly influential model by Miyake et al. (2000) that is based on data from young adults postulates three major executive functions that are separable but strongly interrelated. These functions are (1) working memory updating, (2) inhibition of task-irrelevant responses, and (3) shifting between tasks and mental sets. A later study gave support for the tripartite model of executive functions also in older adults (Vaughan and Giovanello, 2010). All three functions, including the third one that is at the focus of the present study, have been found to decline with older age (Cepeda et al., 2001;Kray et al., 2004;Zelazo et al., 2004). Little research interest has been directed to the trainability of set shifting in late adulthood, despite the fact that the ability to switch sets or tasks quickly is important in our everyday life (Monsell, 2003;Vaughan and Giovanello, 2010). Moreover, as the risk of cognitive impairment is enhanced in late adulthood due to, for example, dementing disorders, there is a need for finding suitable compensatory interventions for older adults. Therefore, we set out to study the effects of set shifting training in older adults with a 5-week adaptive training regime.
Although the generalizability of set shifting training in healthy elderly adults has been scarcely studied, there is an increasing number of studies on the effects of computerized working memory and multidomain training in healthy elderly (Buschkuehl et al., 2008;Dahlin et al., 2008b;Borella et al., 2010;Brehmer et al., 2011Brehmer et al., , 2012; Barnes et al., 2013;Zinke et al., 2013;Sandberg et al., 2014). In addition to improvement on the trained task itself, many of these recent training studies have shown that training can lead to near transfer, that is, improvement on tasks that are closely related to the intervention (e.g., working memory training leading to improved performance on another working memory task). Some findings suggest that training may even show far transfer, that is, generalization to other cognitive domains (e.g., working memory training leading to improved performance on a task measuring fluid intelligence). The results from a recent meta-analysis by Karbach and Verhaeghen (2014) indicated that training of working memory and executive functions was effective in older persons both with regard to near and far transfer, albeit the latter transfer effect was more modest. However, a re-analysis by  found no convincing support for far transfer following working memory training in older age.
The results from the few existing set shifting training studies investigating transfer effects have varied, but most of them have found near transfer effects (Minear and Shah, 2008;Karbach and Kray, 2009;Pereg et al., 2013;Soveri et al., 2013). To our knowledge, only Karbach and Kray (2009) have included elderly adults in their set shifting training study. They found near transfer effects in reaction times to a set-shifting task structurally similar to the trained task for children, young adults, and older adults both with regard to switching cost and mixing cost when compared with the respective active control groups. Switching cost refers to mean reaction times (RTs) of switch trials minus mean RTs of non-switch trials within a mixed block, i.e., within the task block where switching takes place. Mixing cost refers to mean RTs of nonswitch trials in a mixed block minus mean RTs of single task trials where no switching takes place (see also the next paragraph for more information about switching and mixing cost). The effects were most pronounced in children and older participants on the mixing cost. Additionally, far transfer was found to inhibition, verbal and spatial working memory, and fluid intelligence in all age groups. The training tasks of Karbach and Kray (2009) were later used by Zinke et al. (2012) who studied transfer effects of set shifting training in adolescents. They found that compared to controls, set shifting training resulted in transfer to the mixing cost in a similar but untrained set shifting task, but far transfer was limited to a speed task and a tendency toward faster performance in an updating task. Thus, their transfer results were more limited than those of Karbach and Kray (2009). Also Pereg et al. (2013), studying set shifting training in young adults, used the same paradigm as Karbach and Kray (2009) and found only limited transfer effects. One could also note that the results from a recent multidomain (updating, shifting, and inhibition) training study conducted with young and old adults showed only near transfer effects (Sandberg et al., 2014). Soveri et al. (2013) studied set shifting training with young adults and found no significant transfer effects. As regards performance on the trained tasks following set shifting training, only Soveri et al. (2013) reported these effects, finding that the training group outperformed the control group on an accuracy measure.
Set shifting represents a rather well-studied construct in cognitive psychology. In set shifting experiments, participants are first asked to perform more simple tasks (=single tasks) with just one instruction in mind (for example, determining if the number in a number-letter pair is even or odd). In addition, the task includes a mixed task block where the participants have to perform different tasks depending on different properties of the stimuli. For example, they may need to determine if the number in a number-letter pair is even or odd when the pair is presented in a certain location, or to determine if the letter in the pair is a vowel or consonant when the pair is presented in another location. Key measures of set shifting ability include switching cost and mixing cost measures both in RTs and accuracy that were defined in the previous paragraph. A switching cost reflects the generally longer RTs and higher error rates to switching trials compared with repetition trials within the mixed block. In turn, the repetition trials of the mixed block tend to elicit slower and more error-prone responses than the single block trials. This effect is coined as the mixing cost and it is thought to reflect increased monitoring demands in the mixed block (Monsell, 2003). All in all, set shifting calls for several executive processes, such as shifting attention between different aspects of the stimulus, shifting between instructions, retrieving instructions from long-term memory and acting upon them, inhibiting the previous instruction or task set, and overall monitoring (Monsell, 2003). There is also a growing number of neuroimaging studies on set shifting (for a review, see e.g., Ruge et al., 2013). These studies have used different procedures that require participants to shift between varying stimulusresponse mappings, spatial locations, abstract goals etc. Recent neuroimaging studies employing multiple types of shifts within a paradigm have revealed both domain-independent as well as domain-specific neural correlates of set shifting (Ravizza and Carter, 2008;Chiu and Yantis, 2009;Muhle-Karbe et al., 2014). One further theoretical division in set shifting tasks is the separation into perceptual vs. rule-based switching. Perceptual switching tasks require reorienting of visuospatial attention, that is, "what/where one should address one's attention, " whereas rulebased switching tasks call for changing goal-directed information (rules), that is, "what one should do" (Ravizza and Carter, 2008). There is evidence that these two aspects of switching differ in terms of behavioral effects and neural recruitment, meaning that one cannot draw general conclusions only on the basis of a single type of a set shifting task.
As mentioned above, set shifting ability declines with age, but there are differences as to which type of switching costs are most affected by age (Verhaeghen and Cerella, 2002;Wasylyshyn et al., 2011). Wasylyshyn et al. (2011) investigated in their meta-analysis the relationships between aging and switching and mixing costs (labeled as local vs. global switch cost in their paper). They found that in general, the switching cost does not seem to be affected by age. In other words, selective attention processes needed for the deactivation and activation of cognitive processes in order to perform switches do not seem to be agesensitive. However, Wasylyshyn et al. (2011) found that the mixing cost that reflects the ability to maintain two task sets was enhanced in older age, and the effect was not explained by general age-related slowing. Wasylyshyn et al. (2011) speculated that the larger mixing cost in late adulthood could be related to impaired working memory, as previous studies have shown a strong relationship between age-related cognitive deficits and working memory processes. In other words, working memory demands might adversely affect set shifting performance in older adults.
The aim of the present study was to investigate transfer effects of set shifting training in older adults, as only one previous set shifting training study reviewed above has included elderly subjects (Karbach and Kray, 2009), and the extent of the generalization effects of set shifting training is controversial. First, we presumed that the training group would outperform the control group on the trained tasks. Second, in the light of previous studies, we expected to find near transfer effects to untrained set shifting tasks. Here, we also wanted to explore if the expected near transfer effects would show differential results regarding perceptual vs. rule-based set shifting. Third, far transfer effects were expected to be less plausible but possible. Measures of inhibition and working memory updating were included as far transfer measures, as these executive functions are related to set shifting (Miyake et al., 2000). Also, Karbach and Kray (2009) found transfer to these domains in their set shifting training study. Cognitive training studies often include measures of fluid intelligence as far transfer measures, as working memory updating and fluid intelligence are strongly correlated (Engle, 2002). In fact, Karbach and Kray (2009) reported that set shifting training generalized to fluid intelligence. Therefore, we also included measures of fluid intelligence among our far transfer measures. In addition, verbal fluency was included as a far transfer measure, as set shifting, working memory updating and response inhibition, in addition to lexical retrieval ability, are important components for optimal performance on verbal fluency tasks (Henry and Crawford, 2004;Flanagan et al., 2014;Shao et al., 2014), and therefore set shifting training might have an effect even on verbal fluency performance. In addition, memory measures were included as transfer measures because in aging research, one has argued for an interplay between executive and memory functions (Bisiacchi et al., 2008). Finally, the visuomotor speed measure was included as a measure of processing speed. In order to explore how long-lasting the possible training-induced effects were, a one-year follow-up was included.
In the present randomized controlled trial, we used a 15session long adaptive training regime, and included an active control group. We also wanted to look more closely at perceptual vs. rule-based switching (cf. Ravizza and Carter, 2008), as that has not been investigated in previous set shifting training studies. Therefore, the set shifting measures in our pre-post test battery included both a perceptual part, where responses were given according to location of target, and a rule-based part, where responses required the retrieval of appropriate stimulusresponse mappings. Untrained tasks tapping set shifting served as near transfer measures. Far transfer tasks included measures of inhibition, working memory updating, fluid intelligence, verbal fluency, episodic memory, and visuomotor speed. We included at least two tests per cognitive domain (apart from visuomotor speed) to ensure that possible transfer effects are not task-specific (see Shipstead et al., 2012).

Participants
Thirty-six healthy Finnish-speaking older adults recruited from various sources (an adult education center, a sports club for seniors, on bulletin boards etc.) volunteered for the experiment. Initial screening of potential participants was conducted over the phone to exclude those with self-reported neurological or psychiatric diseases. Thereafter, a short neuropsychological assessment was conducted, consisting of a semi-structured interview probing the participants' education, occupation, vision, hearing, possible illnesses, traumatic brain injuries, medication, alcohol and/or drug abuse, and possible alcohol intake during the 24-h period preceding the testing, as well as the Finnish version of Consortium to Establish a Registry for Alzheimer's Disease (CERAD; Welsh et al., 1994;Hänninen et al., 2010), the Logical Memory immediate and Logical Memory delayed subtests of Wechsler Memory Scale-Revised (WMS-R; Wechsler, 1996), the Similarities subtest of Wechsler Adult Intelligence Scale-Revised (WAIS-R; Wechsler, 1992), and Memo-Boston Naming Test (Memo-BNT; Karrasch et al., 2010). Before the neuropsychological assessment, the participants were asked to give their written informed consent. After the assessment, they filled in a Finnish translation of the Edinburgh Handedness Inventory (Oldfield, 1971), the Godin Leisure-Time Exercise Questionnaire 1 (Godin and Shephard, 1997), and Behavior Rating Inventory of Executive Function-Adult Version (BRIEF-A) 2 (Roth et al., 2005). They also filled in the Beck Depression Inventory-II (BDI-II; Beck et al., 2004) at home before the pretesting in order to rule out major depressive symptoms, as well as the PK-5 Personality test 3 (Psykologien Kustannus Oy, 2007). Two participants were excluded after the neuropsychological assessment as they performed below cutoff on several memory measures, and one participant dropped out during the training period, bringing the final number of participants to 33 (19 females and 14 males). The study was approved by the Ethics Committee of the Hospital District of Southwest Finland. The follow-up part of the study was approved by the Ethics Committee of the Departments of Psychology and Logopedics at the Åbo Akademi University. The participants did not receive monetary compensation for their participation.
The participants were first matched in pairs and then randomly allotted to the training group (n = 17; 10 women/7 men) or to the active control group (n = 16; 9 women/7 men). Variables that were taken into account during matching were 1 The results of this questionnaire will be reported elsewhere. 2 BRIEF-A was also filled in at the end of the post testing. 3 The results of this test will be reported elsewhere. education, WAIS-R (Wechsler, 1992) Similarities score 4 , age, and gender. The participants were not aware of their group membership. The groups were comparable in terms of years of education t (

Procedure
The experimental procedure including the tasks that were administered is depicted in Figure 1. A randomized controlled trial with a pretest-posttest design was used. Both the training group and the active control group participated in 15 training sessions, 45-60 min/session, three times a week for 5 weeks. The training took place at the university in groups with maximally four people, or individually when needed. All participants underwent the individually administered extensive pre-posttest battery. The posttest was performed maximally 11 days after training, and there was no group difference with regard to the number of days between the last training (or "pseudo-training") session and posttest t (31) = 0.535, p = 0.596. The training tasks were adaptive for both the training group and the control group, with the tasks becoming more difficult as the participants advanced. At pretest, at every training session, and at posttest, all participants rated their level of motivation (on a scale 1-5, where 1 = not at all motivated; 5 = very motivated) and fatigue/alertness (on a scale 1-5, where 1 = very tired; 5 = very alert).

Procedure and Training Tasks for the Training Group
Three computerized set shifting training tasks were used: (1) a Categorization Task (CT) that was a modified version of the Wisconsin Card Sorting Test (Berg, 1948;Soveri et al., 2013), (2) a Number-Letter (NL) task (Soveri et al., 2013), adapted from Rogers and Monsell (1995), and (3) a Dot-Figure (DF) task that was a modified non-verbal version of the Number-letter task.
All training tasks included four difficulty levels. To advance to the next difficulty level, the participants had to pass a level test. The level test was a version of the CT that was at the same difficulty level as the previous week's training task, and it was performed after the last training session of the week. The participants who made <20% errors advanced to the next difficulty level. Exceeding this error criterion would have implied staying at the same level for at least 1 week, but all participants advanced after each level test. As there were four difficulty levels, there were three level tests. After the participants had reached the highest difficulty level (level 4), they stayed on that level for the remaining training sessions. The participants were asked to perform as fast and as accurately as possible throughout training. The order of trials in the training tasks and the level tests was randomized.

The Categorization Task (CT) in Training
In this task, four stimulus cards appeared in a horizontal line at the top of the computer screen. The task was to match response cards, appearing one at a time, with the stimulus cards, based on different sorting rules that were given. At levels 1 and 2, the four stimulus cards included different shapes (cross, circle, triangle, or square), colors (red, blue, yellow, or black), and quantities (one, two, three, or four figures), and the figures were placed at the center of the cards. The task was to sort the response cards according to these features by deciding which stimulus card had figures of the same shape, color, or number, as the figures on the response cards, based on the sorting rule that was shown underneath each response card. At levels 3 and 4, location (upper left, upper right, lower left, or lower right corner) was added as a fourth sorting rule and feature on the stimulus cards. Thus, at levels 3 and 4, the figure was always placed in one of the four corners of the card (Figure 2A). At levels 1 and 3, the sorting rule changed randomly after four to six response cards and at levels 2 and 4 after one to three response cards, regardless of whether the responses were correct or incorrect. Level 1 employed three sorting rules with less frequent shifts (after 4-6 trials with altogether 270 trials), level 2 three sorting rules with more frequent shifts (after 1-3 trials with altogether 270 trials), level 3 four sorting rules with less frequent shifts (after 4-6 trials with altogether 300 trials) and level 4 four sorting rules with more frequent shifts (after 1-3 trials with altogether 288 trials). Task completion took about 15 min. Each difficulty level was preceded by a short practice sequence including all relevant sorting rules. The four response keys, 1, 2, 3, and 4 on the keyboard corresponded spatially to the stimulus cards. The sorting rule was presented for 1,000 ms at the beginning of each trial, and the response card was presented simultaneously until a response was given, or maximally for 3,000 ms. Before moving on to the next response card, audio-visual feedback was given for 1,500 ms. A correct response elicited a high pitch tone and a bright screen, while an incorrect response or no response elicited a low pitch tone and a dark screen. Feedback was given at all difficulty levels. After the task, the number of correct responses, incorrect responses and missed responses were shown on the computer screen. The task included two 1-min pauses, which the participants could end sooner by pressing the Enter key.

The Number-Letter (NL) Task in Training
At levels 1 and 2 in this task, black number-letter pairs on white background were presented in one of two squares on the computer screen, one square above the other ( Figure 2B). When the number-letter pair was presented in the upper square, the participant had to determine if the number was even or odd, and when it was presented in the lower square, the task was to determine if the letter was a vowel or a consonant. Thus, the location of the number-letter pair served as a cue for which task to perform. Number-letter pairs were constructed by combining the vowels A, E, I, U and the consonants G, K, M, R, with the even numbers 2, 4, 6, 8 and the uneven numbers 3, 5, 7, 9. The participants could not anticipate when a number-letter pair shifted from one square to another (switching trial), or when it was shown in the same square as the previous pair FIGURE 1 | The experimental procedure including tasks that were administered. For more detailed information on the procedure and the tasks, see Section Material and Methods.
(repetition trial, Figure 2B). At levels 3 and 4, a third square was added that was placed underneath the upper two squares, and the number-letter pairs appeared in red or blue. If the pair was presented in the lowest square, the participant had to decide whether the color of the pair was red or blue. Two response keys on the computer keyboard were used, with one response key for vowels, even numbers and red color, and the other for consonants, odd numbers, and blue color. As in the CT, switching trials occurred less frequently at levels 1 and 3 (after 3-5 trials) and more frequently at levels 2 and 4 (after 1-3 trials), thus yielding two squares/less frequent shifts at level 1, two squares/more frequent shifts at level 2, three squares/less frequent shifts at level 3, and three squares/more frequent shifts at level 4. The number of trials at each level was 288, and it took ∼15 min to complete the task. Each difficulty level was preceded by a short practice sequence. Every trial began with a blank screen. After 150 ms, a fixation cross appeared in the middle of the screen, being replaced by two or three squares (one of which contained a number-letter pair) after 300 ms. The squares remained on the screen until a response had been given or 3,000 ms had passed. Audiovisual feedback and information about the responses was given in the same manner as in the CT task, and two 1-min pauses were included that could be cut short by pressing Enter.

The Dot-Figure (DF) Task in Training
This task was identical to the NL task, except that instead of number-letter pairs, dot-figure pairs were used ( Figure 2C). At levels 1 and 2, a dot-figure pair presented in the upper square prompted the participant to decide whether the number of dots (that varied between 1 and 4 dots) was even or uneven. When the dot-figure pair was presented in the lower square, the task was to decide whether the figure that was either a triangle, square, circle or oval had an angular or round shape. At levels 3 and 4, a third square was added under the upper two squares (Figure 2C), and at these levels the dot-figure pairs appeared in red or blue. If the pair was presented in the lowest square, the participants had to decide whether the pair was red or blue. Two response keys on the computer keyboard were used: one for even number of dots/angular shape/red color, and the other for uneven number of dots/round shape/blue color. The four difficulty levels followed the same logic as in the NL task.

Pseudo-Training Procedure & Computer Games for the Active Control Group
Three puzzle computer games were used: (1) Tetris (Tetris Worlds, THQ), (2) Bejeweled (Bejeweled 2, PopCap Games), and (3) Angry Birds (version 3.0.0, Rovio Entertainment Ltd). Each game was played for 15 min per session. The games were FIGURE 2 | (A) The Categorization Task. The more difficult version of the task (level 3 and 4). The stimulus cards are at the top of the screen and the response cards appear at the bottom half of the screen, and the written cue is given underneath the response card at each trial. In (I) the given sorting rule is "quantity," that is, the participant should press "3." In the repetition trial (II), the participant should press "1," and in the switching trial (III) the participant is given a new cue "location" and should press "3." (B) The Number-Letter task. The easier version of the task (level 1 and 2). Two squares are placed vertically on the screen and the number-letter pair is presented in either one of them. In (I) the participant's task is to decide whether the number is "even" (correct response) or "odd." In the repetition trial (II) the correct response is "odd," and in the shifting trial (III) the participant is to decide whether the letter is a "vowel" or a "consonant." (C) The Dot-Figure task. The more difficult version of the task (level 3 and 4). Three squares are placed vertically on the screen and the dot-figure pair is presented in one of them. In (I) the participant's task is to decide whether the number of dots is "even" or "odd" (correct). In the switching trial (II) the participant's task is to decide whether the figure is "angular" (correct) or "round," and in the switching trial (III) the participant is to decide whether the figure is "red" or "blue" (correct). (D) The perceptual part of the OMO task (mixed task), FIGURE 2 | Continued where the participant responds by pressing the key that corresponds to the spatial location of the odd stimulus. In trial (I), the correct choice is the middle key that corresponds to the cross. In the repetition trial (II) the correct response is the figure to the right, that is, the parallelogram. In the switching trial (III) the correct choice is the letter "v." (E) The rule-based part of the OMO task (mixed task), where the participant responds by pressing a previously memorized key for that letter or figure (1 = z and triangle, 2 = x and square, and 3 = c and circle). In trial (I), the correct choice is the square. In the repetition trial (I) the correct choice is again the square. In the switching trial (III) the correct response is the letter "x." selected based on their limited demands on set shifting and other executive functions, as well as their appeal to a wide audience. There were 3 difficulty levels in Tetris and Angry Birds. Bejeweled did not have separate difficulty levels, but the game became more difficult due to time pressure, so that the participants had to respond faster as they advanced in the game. Tetris served as a criterion task, in other words, when the participants advanced in Tetris, they could move to the next difficulty level in Angry Birds as well. The participants were asked to perform as fast and as accurately as possible throughout training.

Tetris
In Tetris, geometric shapes composed of square blocks each fall down in a matrix, and the participant's task is to move these shapes with the aim to create a horizontal line without gaps. When such a line is created, it disappears and blocks above will fall. When enough lines are cleared, a new level is entered. Difficulty level 1 represented the easiest version of Tetris, and if the participant improved his/her performance in this version during sessions 1-3, the participant moved to the next difficulty level on session 4. Otherwise the participant stayed at the same level until his/her performance improved, whereafter the participant moved to the next level either on session 8 or 12. When the participant improved his/her performance on level 2, the participant moved to the most difficult level either on session 8 or 12, and played at this level for the remaining sessions.

Bejeweled
In this game, the participant was to swap one gem with an adjacent gem to form a chain of 3 or more gems either horizontally or vertically. Gems disappeared when chains were formed and gaps were filled by gems falling from the top. Bejeweled was played in a so-called action mode, with the game becoming gradually more difficult due to time pressure.

Angry Birds
Here the participant used a slingshot to launch birds at pigs in different environments, aiming to destroy all the pigs. As the participant advanced, new sorts of birds became available that had special abilities, which the participant could activate. This game had three difficulty levels. If the participants advanced to the next difficulty level in the criterion task, namely Tetris, they moved to the next difficulty level in Angry Birds as well.

Pre/Post Testing
We employed an extensive cognitive test battery including pre/posttest versions of all three training tasks, and tests measuring near and far transfer. Near transfer effects were measured by two set shifting tasks: a modified version of a set shifting test ("odd-man-out" test) previously used by Ravizza and Carter (2008), and the Trail Making Test (A&B; Tombaugh, 2004) 5 . Based on the model of Miyake et al. (2000), tasks measuring inhibition and working memory updating were regarded as far transfer tasks, as were tasks measuring fluid intelligence, verbal fluency and visuomotor speed. Far transfer to inhibition was measured by the Simon task (Simon and Rudell, 1967) and the Stroop task (Lezak et al., 2012). Working memory updating was tapped by the visual n-back task (Cohen et al., 1994) and the WAIS-R (Wechsler, 1992) Digit span subtest (only the backward span is reported here). Fluid intelligence was assessed by the Culture Fair Intelligence Test (CFIT, 1973) and the WAIS-R Block design subtest. Visuomotor speed was measured by the WAIS-R Digit symbol subtest. Furthermore, verbal fluency that taps executive functioning was tested by phonological fluency and semantic fluency tasks. Two episodic memory tests (CERAD wordlist learning/delayed recall and WMS-R Logical Memory immediate/delayed recall) were also performed. Semantic fluency and the memory measures were included in the neuropsychological assessment that was performed already before the pretest session. At posttest, the CERAD wordlist learning and the WMS-R immediate recall were always administered first due to the delayed recall, but the remaining pre/posttests were administered in a random order, both at pre-and post-test 6 . The participants were asked to perform as fast and as accurately as possible when the task at hand required it.

Training Tasks at Pre/Posttest
The pre/posttest version of the Categorization Task (CT) represented the most difficult level (level 4) of the training task. Four single tasks were always performed first (20 trials each). The sorting rule (shape, color, quantity, or location) was always the same within a single task. The single task was preceded by a practice sequence, in which all the four sorting rules were presented twice, and the practice sequence was presented until the participant made less than 25% errors. After the single tasks, the mixed task block including switching trials was administered, in which the sorting rule changed after 1-3 trials. The number of trials was 144 (72 switching trials, 72 repetition trials). The mixed task block was preceded by a short practice sequence that was repeated once if the participant made more than 20% errors. The order of trials was randomized, but each sorting category and repetitions of the same sorting rule (one, two, or three trials) was presented the same amount of times. Audiovisual feedback was given also in the pre/posttest version of the task. In order to control for situations where the participant might have made a perseveration error that by chance led to the correct answer, the cards were sorted so that the sorting rule could not match with both the previous and the present sorting category. The switching cost (the difference between switching trials and repetition trials within the mixed task block) and the mixing cost (the difference between repetition trials and single-task trials) in RTs and in the proportions of correct answers were calculated for the CT task. The Number-Letter (NL) task and the Dot-Figure (DF) task were administered as follows at pre/posttest. Both tasks started with three single task blocks (32 trials each). In the first single task, number-letter/dot-figure pairs were always shown in the uppermost square (even number/even number of dots or odd number/odd number of dots), in the second single task in the middle square (vowel/angular shape or consonant/round shape), and in the third single task in the lowest square (red or blue color). In all single task blocks, there was an equal number of trials for the two response options. Each single task was preceded by a short practice sequence. If the participant made more than 20% errors, the practice sequence was repeated once. After the single tasks, the mixed task block was performed with 72 switching trials and 72 repetition trials. The order of trials was randomized, and the sequences were balanced for the number of trials per square and for the number of occurrences for each response alternative. The mixed task block was also preceded by a practice sequence that the participant could perform at own pace (max. 10 s per trial). This practice sequence was repeated until the participant made fewer than 20% errors, whereafter a practice sequence with the same ISI as in the actual task was administered once. Audiovisual feedback was given. Similar to the CT, switching cost (the difference between switching trials and repetition trials) and the mixing cost (the difference between repetition trials and single-task trials) in RTs and in the proportions of correct answers were calculated for the NL and DF tasks. All RT measures for the training tasks were based on correct responses only.
To provide more global and possibly more reliable measures of the training tasks, the switching cost, and mixing cost in RTs and in the proportions of correct answers were averaged across the three training tasks. Combining tasks that differ in terms of paradigm and content but nevertheless aim to tap the same domain (here set shifting) has been argued to be a better strategy than combining only homogenous tasks (Schmiedek et al., 2014). The following composits were constructed for the pre/post analyses: composite switching cost in RTs, composite mixing cost in RTs, composite switching cost of proportions of correct answers, and composite mixing cost of proportions of correct answers.

Near Transfer Measures (Set Shifting)
An "odd-man-out" (OMO) task (adapted from Ravizza and Carter, 2008) was used in this study as a near transfer measure of set shifting. The task taps both perceptual and rule-based set shifting, which is of interest here concerning the nature of possible near transfer, as our training tasks required both visuospatial attention and the use of contextual rules. Sets of letters and shapes served as stimuli in the OMO task. The order of trials was randomized. In the perceptual part of the task, letters were presented inside figures, three in a row ( Figure 2D). The letters used in this task were B, N, and V, and the figures used were a circle, cross, and a parallelogram. The participant was to identify which letter or figure did not match with the other letters or shapes. In a switching trial, the odd stimulus shifted from letter to figure or vice versa. When the odd stimulus was a letter, all the shapes were different and vice versa. Responses in the perceptual task corresponded to the spatial location of the odd stimulus. For example, if the letter or shape in the middle was the odd one, the participant responded by pressing the middle key of the three response keys (1, 2, and 3 on the keyboard). The perceptual task started with two single task blocks (32 trials each). On the first single task block it was always a letter that was the odd one, while on the second single task block it was always a shape. The mixed task block in the perceptual task consisted of 144 trials (72 switching trials, 72 repetition trials). In 72 trials the odd stimulus was a letter while in 72 trials it was a shape. Shifts occurred after one to three trials. Both the single tasks and the mixed task were preceded by a short practice sequence that was repeated once if the participant made more than 20% errors. In the rule-based part, the participant's task was to press a key that had previously been memorized for that letter or shape (1 = z and triangle, 2 = x and square, and 3 = c and circle; Figure 2E). In this part of the task, only one feature set was present, i.e., three letters in a row or three figures in a row were shown at a time, and the letters were thus not inside the figure as in the perceptual part of the task. Right before performing the rule-based part of the task, a practice task was performed, requiring the participant to memorize the stimulus-response mappings. In the practice task, the participant received one stimulus at a time, either a letter or a shape, and the task was to respond according to the correct response mapping for that stimulus. First the participant was allowed to perform at own pace (maximum 10 s per stimulus), whereafter the practice task was given with the same ISI as the actual task. This was repeated until the participant made less than 20% errors. The participant received auditory and visual feedback in the practice task: a correct response elicited a high pitch tone and a bright screen, while an incorrect response or no response elicited a low pitch tone and a dark screen. When the participant had memorized the response mappings, the rulebased part with two single task blocks and a mixed task block was performed. Both the single task and the mixed task blocks were preceded by a short practice sequence that was repeated once if the participant made more than 20% errors. In the first single task block, only letters were shown, and the participant had to identify the odd stimulus and respond according to the memorized keys (1 = z, 2 = x, and 3 = c). In the second single task block, only figures were presented, and the response was given according to the memorized keys (1 = triangle, 2 = square, and 3 = circle). In the mixed task block, feature sets (either letters or shapes) alternated (144 trials, of which 72 were switching trials and 72 repetition trials), with shifts occurring after 1-3 trials. All task blocks, both perceptual and rule-based, began with a blank screen. After 150 ms, a fixation cross appeared in the middle of the screen. The fixation cross was replaced by the feature set after 300 ms. The feature set remained on the screen until a response had been given or until 3,000 ms had passed. In the OMO task, the dependent variables were the switching and mixing cost in RTs and the proportion of correct responses in the perceptual and rule-based task, respectively. We also explored possible overall task effects by including the average RTs and accuracy across all task blocks as dependent measures in the analyses. All RT measures were based on correct responses only.
The second set shifting measure was the Trail Making Test A&B (Tombaugh, 2004). In part A of the test, the participant's task was to draw in ascending order a line as quickly as possible between numbers 1 and 25 that were placed inside circles on a paper sheet. In part B, the circles contained either numbers or letters, and the task was to draw the line alternating between numbers and letters in the sequence, 1-A-2-B etc., as quickly as possible. The processing cost caused by the alternating between numbers and letters, that is, the total completion time of part B (in seconds) minus total completion time of part A, was analyzed.

Far Transfer Measures Inhibition
The computerized Simon task (Simon and Rudell, 1967) and a paper version of the Stroop Test (Lezak et al., 2012) were used as far transfer measures of inhibition. In the Simon task, a red, or a blue square was presented on either side of the computer screen, and the task was to respond according to the color of the square, irrespective of its position that either matched or not with the position of the correct response key. The task was performed by pressing the left key with the left index finger when the square was blue and the right key with the right index finger when the square was red. The task included both congruent (square on the same side as the relevant response key, e.g., red square on the right side) and incongruent trials (square on the opposite side of the relevant response key, e.g., blue square on the right side. Out of the 100 trials, half were congruent and half incongruent. The order of the trials was randomized. The trials were divided into four equally long blocks with a 5-s break in-between. A practice sequence (eight trials) was administered before starting the actual task. A fixation cross was presented at the beginning of each trial. The cross disappeared after 800 ms, replaced by a blank screen for 250 ms. After this, a blue or red square was presented on either the left or the right side of the screen. The stimulus remained on the screen until a response key was pressed or until 1,000 ms had passed. Then the screen went blank for 500 ms before moving on to the next trial. The dependent variables were the Simon effect in RTs and in proportion of correct responses. The Simon effect is the difference between incongruent and congruent trials, and taps the processing cost related to the incompatible location of the stimulus. In the Stroop task, the dependent variable was the Stroop effect, that is, the difference in completion time between naming ink color of conflicting color words (100 trials on a paper sheet) and naming the ink color of sequences of the letter "x" (90 trials on a paper sheet).

Working memory updating
Possible transfer effects to working memory updating were measured by the computerized n-back task (Cohen et al., 1994) and the Digit span backward subtest of the WAIS-R (Wechsler, 1992). In the n-back task, numbers from one to nine were presented one at a time at the center of the screen. The task was to remember the previous number (1-back) or the one presented two trials back (2-back). Two response keys were used: the left key for a target, that is, the number was the same as the previous number (1-back) or the one two trials back (2back), and the right key for a non-target, that is, the number did not match. The total amount of trials was 240 (120 1-back trials, 120 2-back trials). The numbers were divided into 12 blocks of 20 trials each, so that six blocks were 1-back blocks and six were 2-back blocks. The presentation order of the stimuli was pseudorandomized. The 1-back blocks consisted of nine targets and 11 non-targets, and the 2-back block included six targets and 14 non-targets. Before each block, a written prompt informing whether the following block was a 1-back or a 2-back block appeared on the screen together with a picture of a hand indicating the corresponding response keys. After 5,000 ms, the first number was shown, remaining on the screen for 1,500 ms. After this, the number was replaced by a fixation cross for 450 ms. The fixation cross was then followed by the next number. On each trial, the response had to be given within 2,000 ms. The first trial in the 1-back condition and the first two trials in the 2back condition were excluded from the analysis. The difference in RTs and in the proportion of correct responses between the 2-back and the 1-back conditions were used as the dependent variables for this task. These measures reflect the processing cost caused by the demands on working memory updating in the 2back condition. In the Digit span backward test, the task was to orally repeat sequences of digits in reversed order. The total score for backward span was analyzed.

Fluid intelligence
Fluid intelligence was measured using the Culture Fair Intelligence Test (CFIT, 1973) scales 2 and 4, and the WAIS-R subtest Block design. In CFIT, the participant's task was to find logical relationships between different shapes and figures that were presented on paper. Performance time was limited to 240 s for scale 2 and 180 s for scale 4. Each scale had two equivalent versions, A and B. At pretest, version A of scale 2 and 4 were administered to 17 of the participants (roughly the same number of participants from both groups) and version B to the remaining 16 participants, and vice versa. The dependent variable was the sum of correct responses (scale 2 + scale 4). In the Block design test, the total score of the 9 trials of advancing difficulty (maximum score 51) was analyzed.

Verbal fluency
Semantic fluency (producing as many animal names as possible within 60 s) that was included in the neuropsychological screening was performed at posttest as well, and thus also used as a transfer measure. Also phonological fluency (producing words beginning with the phoneme "s" within 60 s) was used as a transfer measure. For both fluency tasks, the number of correct responses was used as the dependent variable.

Episodic memory
The CERAD (Welsh et al., 1994;Hänninen et al., 2010) wordlist learning and delayed recall and the WMS-R (Wechsler, 1992) Logical Memory immediate and delayed recall, which were included in the neuropsychological screening, were performed also at posttest and thus used as transfer measures. For the CERAD, we analyzed wordlist learning sum score, delayed recall raw score, and savings score in percent computed by dividing the number of words retrieved on delayed recall by the number of words recalled on the third learning trial (x 100). For the WMS-R Logical Memory, we analyzed immediate and delayed recall.

Follow-Up
The follow-up was conducted 1 year after posttest (plus minus 3 weeks). One participant from the control group declined to participate, and thus 32 out of 33 participants were tested in the follow-up. The follow-up was otherwise similar to the posttest, but the following tests were not included: the Simon task, the visual n-back task, the WAIS-R Digit span, Block design and Digit symbol subtests, the Culture Fair Intelligence Test. The whole CERAD (Welsh et al., 1994;Hänninen et al., 2010) was conducted at follow-up in order to control for possible memory deterioration 7 . Before the follow-up session, the participants were asked to give their written informed consent and after the assessment, they filled in the Godin Leisure-Time Exercise Questionnaire (Godin and Shephard, 1997), and BRIEF-A (Roth et al., 2005), as well as BDI-II (Beck et al., 2004). Motivation and alertness was also surveyed, and some questions concerning the participants' gaming/computer habits were included as well.

Statistical Analyses
ANCOVAs with posttest performance as the dependent variable, pretest performance as the covariate, and group as the betweensubjects factor were run on all dependent measures (see Dimitrov and Rumrill, 2003;Senn, 2006). Effect sizes reported as adjusted Cohen's d were calculated using estimated values from the ANCOVA model. For the follow-up, ANCOVAs with followup performance as the dependent variable, pretest performance as the covariate, and group as the between-subjects factor were run only on the dependent measures of the training tasks and the OMO task that had been statistically significant or had an F > 2 at posttest. Each task was reviewed independently regarding possible exclusion of individual cases. In all tests, the exclusion criteria were being an extreme outlier in accuracy or RTs at pretest or showing evidence of misunderstanding test instructions. Concerning accuracy in the computerized tests, outliers were defined as chance level performance. Regarding RTs in the computerized tests and performance in the paper and pencil tests, outliers were defined as performance laying more than three times the interquartile range above or below the 1st or the 3rd quartile, respectively. There were no outliers regarding RTs.

Training Results
The means and standard deviations for the composite training scores at pre/posttest and at the follow-up are presented in Table 1, and they are also shown separately for each training task in

Tasks Measuring Near Transfer (Set Shifting)
With regard to the near transfer tasks, we corrected for multiple comparisons by setting the alpha level to 0.05/13 = 0.0038

The OMO Task: Perceptual Subtest
No significant near transfer effects were seen in the perceptual subtest of the OMO task. Neither the switching cost nor the mixing cost in RTs or accuracy showed significant trainingrelated group differences (Fs < 1). ANCOVAs were performed also on overall performance both regarding RTs as well as accuracy. The control group performed somewhat faster than the training group in absolute terms, but this was not significant F (1, 30) = 2.696, p = 0.111, d = 0.59, 95% CI [−0.14, 1.33]. The training group performed somewhat better regarding overall accuracy compared with the control group, but this difference did not reach statistical significance, F (1, 30) = 3.596, p = 0.068, d = 0.67, 95% [−0.05, 1.38] ( Table 5).

The OMO Task: Rule-Based Subtest
The switching cost and the mixing cost in RTs or accuracy did not differ between groups at posttest as analyzed by ANCOVAs (all Fs < 2). As above, ANCOVAs were performed also on overall performance, both for RTs and accuracy. No significant group difference in overall RTs was found, F (1, 30) = 2.444, p = 0.128, d = 0.56, 95% CI [−0.17, 1.29]. Regarding overall accuracy, the training group outperformed the control group  Figure 3), and this finding also survived the Bonferroni correction. In other words, near transfer effects were seen on the rule-based part of the OMO task regarding accuracy.

Trail Making Test A & B
No significant group difference on the switching effect (TMT B minus A) at posttest was seen (F < 1; Table 6).

Inhibition
No generalization of training gains was observed n the Simon task, as the main effect of group was non-significant at posttest both with regard to the Simon effect in RTs (F < 1) and accuracy (F < 2). Nor did the main effect of group on the incongruency effect ( Table 8).
Visuomotor speed was measured with the WAIS-R Digit Symbol subtest that did not show any group difference at posttest (F < 1; Table 8).

Motivation, Alertness and Subjective Set Shifting Ability
In order to investigate possible changes in motivation or alertness across the intervention, the relevant survey responses were analyzed with a mixed model ANOVA with motivation/alertness (3 levels: motivation/alertness at pretest, across training sessions 9 , and at posttest) as within-subjects factors and group as a between-subjects factor. A significant main effect of motivation was found F (2, 62) = 5.265, p = 0.008, as the participants were more motivated at pretest compared with the training sessions/posttest. The motivation x group interaction was non-significant (F < 1). The main effect of alertness was not significant (F < 2), but the alertness x group interaction was statistically significant F (2, 62) = 7.191, p = 0.002. Subsequent one-way ANOVAS showed that there were no group differences concerning alertness at pretest or across training sessions (both Fs < 1), but at posttest a significant group difference was found, F (1, 31) = 6.308, p = 0.017, with the training group reporting a higher degree of alertness (M = 4.32, SD = 0.68) compared with the controls (M = 3.50, SD = 1.15). The set shifting index (raw score) of the BRIEF-A self-report form that was analyzed with an ANCOVA, did not show a statistically significant group

Training Results
The same analyses were run for the follow-up as for the posttest, using pretest as a covariate. The main effect of group on the composite switching cost in RTs did not reach the level of significance at follow-up, F (1, 29) = 2.825, p = 0.104, d = −0.61, 95% CI [−1.36, 0.13], but there was a significant group difference with regard to the composite mixing cost in RTs F (1, 29) = 10.900, p = 0.003, d = −1.21, 95% CI [−1.95, −0.46], due to the smaller mixing cost of the training group at follow-up compared with the control group. Regarding accuracy, a significant group difference for the composite switching cost was seen at follow-up, F (1, 29) = 7.292, p = 10 One participant from the control group was excluded from the analysis due to a highly inconsistent score. 0.011, d = 0.99, 95% CI [0.24, 1.73], with the cost being relatively smaller for the training group compared to the control group, but the group effect of the composite mixing cost of accuracy was non-significant (F < 2; Table 1). Both statistically significant findings survived Bonferroni correction (0.05/4 = 0.0125).

The OMO Task
ANCOVAs were run for the overall RTs and overall accuracy at follow-up for both subtests using pretest as a covariate. Perceptual subtest. The control group was somewhat faster than the training group at follow-up, F (1, 29) = 5.778, p = 0.023, d = 0.87, 95% CI [0.13, 1.61], but this difference did not survive Bonferroni correction (0.05/4 = 0.0125). The main effect of group regarding overall accuracy did not reach statistical significance (F < 1; Table 5). Rule-based subtest. No significant group difference on either overall RTs (F < 1) or overall accuracy F (1, 29) = 2.892,  Table 5).

Motivation and Alertness
At the follow-up, one-way ANOVAS showed no group differences on motivation or alertness ratings (both Fs < 1; Table 5).

DISCUSSION
The present study addressed a potentially important but only scarcely studied area, namely the effects of set shifting training in healthy elderly. In the light of previous training studies, we expected to find improvement on the trained tasks and near transfer effects. Nevertheless, we also wanted to explore whether far transfer effects could be observed. In brief, what we found were strong and long-lasting training effects on the trained tasks, very limited evidence for near transfer, and no far transfer. These results are summarized and discussed in detail below. Concerning the trained tasks, the training group showed the expected improvement compared to the controls. Our training group outperformed the control group at posttest regarding both switching as well as mixing costs in reaction times. The corresponding posttest effects on accuracy were not statistically significant, although the switching cost accuracy showed a trend for significance in favor of the training group. The analyses on the follow-up performances showed that the training group outperformed the control group on the mixing cost in reaction times and switching cost in accuracy even after 1 year. Thus, the follow-up findings confirmed that the training regime worked, and a 5-week set shifting training can create long-lasting training effects on the practiced tasks. With regard to near transfer, no statistically significant effects were observed on the switching cost or mixing cost in reaction times or accuracy in either part of the odd-man-out task. The switching cost in reaction times in the rule-based part was very small, and the mixing cost was negative. Concerning the overall accuracy and reaction time measures across all task blocks of the odd-man-out task, the rule-based part showed a statistically significant training effect on overall accuracy with a very large effect size (d = 1.35). The corresponding overall reaction time measures did not yield a group difference at posttest. Furthermore, no near transfer effects were observed on the Trail Making Test. To sum up, only one measure, overall accuracy on the rule-based part, showed near transfer, indicating a very limited transfer effect. We found no evidence for far transfer on the extensive test battery tapping other executive domains, fluid intelligence, episodic memory, verbal fluency, or visuomotor speed. Only one of the earlier set shifting training studies (Soveri et al., 2013) has addressed both transfer effects and training effects on the training tasks themselves. Naturally enough, the goal of cognitive training is to obtain improvement on untrained tasks, but verification of training effects on the trained tasks serves as a proof that the training program as such works. In the present study, these effects were verified. In general, improvements on the trained tasks have been the most robust finding in brain training studies (for reviews, see Simons et al., 2016). This is also true for cognitive training studies that have specifically addressed elderly individuals (Karbach and Verhaeghen, 2014). The fact that these effects were maintained in the follow-up concurs with the largest cognitive training study conducted thus far: the Advanced Cognitive Training for Independent and Vital Elderly study found long-lasting effects on the trained tasks 2 years, 5 years, and even 10 years after training (Ball et al., 2002;Willis et al., 2006;Rebok et al., 2014).
The present results concerning near transfer are broadly in line with most previous set shifting training studies (Minear and Shah, 2008;Karbach and Kray, 2009;Zinke et al., 2012;   Between-group differences at posttest and follow-up were analyzed with ANCOVA. a One participant in the control group declined to participate in the follow-up, leading to a sample size of 15 at that measurement point. *Alpha level is Bonferroni corrected (p = 0.0038 at posttest), ns, not significant.
FIGURE 3 | Number of correct responses (%) across single tasks, switching trials and repetition trials of the rule-based part of the "odd-man-out" near transfer task for the training group and the control group, including standard errors (scale 90-100%). Pereg et al., 2013) insofar that they have also reported selective near transfer effects. The results also fit well with recent results from an executive process training study including set shifting training that also found limited near transfer effects (Sandberg et al., 2014). A possible reason for the observed near transfer to the rule-based odd-man-out task is that the training tasks may have recruited similar cognitive resources. In general, it has been argued that transfer can take place only if the training and transfer tasks depend upon partly the same cognitive processes and neural systems (e.g., Dahlin et al., 2008a;Waris et al., 2015), and in most executive training studies conducted with older participants that have reported transfer, the transfer has been seen on tasks that are very similar to the trained tasks (Morrison and Chein, 2011;Buitenweg et al., 2012). The Number-Letter and Dot-Figure tasks employed arbitrary cues (placement of number-letter/dot-figure pairs), and the participants had to learn and update the response rules during task performance. Also the Categorization Task and the rule-based odd-man-out task may share some underlying cognitive mechanism(s) as they are both complex in nature, and require several executive processes (Ravizza and Carter, 2008;Naglieri and Otero, 2014). Ravizza and Carter (2008) found that rule-switching in their odd-manout task that was similar to ours, was related to greater activity in the dorsolateral prefrontal cortex, which in turn has been linked to rule-guided behavior and to context maintenance. Also the Wisconsin Card Sorting Test that the Catergorization Task is based on has been linked to dorsolateral prefrontal cortex activation (Nyhus and Barceló, 2009), and the lateral prefrontal findings in relation to the Wisconsin Card Sorting Test are thought to reflect maintenance of task-set units (e.g., "color") in working memory (Miller and Cohen, 2001). It might thus be that the near transfer finding in the present study reflects active maintenance of task-relevant information (cf. also Pereg et al., 2013) rather than set shifting. In other words, it is possible that the shared cognitive component between the three training tasks and the rule-based odd-man-out task is working memory updating, an executive process that is required to a higher degree by the rule-based than the perceptual odd-man-out task. This would also be in line with the study by Pereg et al. (2013), as the results from their study suggested that what had been trained as a "set shifting ability" in the study by Karbach and Kray (2009) was Between-group differences at posttest were analyzed with ANCOVA; ns, not significant. Between-group differences at posttest were analyzed with ANCOVA; ns, not significant.
Frontiers in Aging Neuroscience | www.frontiersin.org Between-group differences at posttest were analyzed with ANCOVA; ns, not significant. not a broad ability, but rather a specific skill related to the unique working memory updating requirements of the training tasks. It is of interest to note that set shifting deficits in late adulthood are usually found when participants have to maintain and coordinate two task sets in working memory (Wasylyshyn et al., 2011). The transfer effect found in the rule-based part of the odd-man-out task was no longer significant at the 1-year follow-up. The fact that no near transfer effects were found on the Trail Making Test may have been due to the fact that this paper-and-pencil test is a rough measure compared with computerized tests that can reveal more subtle performance changes. The results regarding far transfer effects have been mixed in previous set shifting studies. Minear and Shah (2008) did not include far transfer measures in their study at all, Zinke et al. (2012) found only modest far transfer effects, and Soveri et al. (2013) did not find any far transfer effects. However, Karbach and Kray (2009) found transfer effects to tasks measuring working memory updating, inhibition, and fluid intelligence. Our study differs from the study by Karbach and Kray (2009) in that we used an adaptive training paradigm, and we had a higher number of switches that were distributed somewhat differently in the training. Additionally, Pereg et al. (2013) used the same protocol as Karbach and Kray (2009), but were not able to replicate the far transfer findings of Karbach and Kray (2009). It has recently been argued that especially the far transfer effects seen in executive training studies are not consistent (Shipstead et al., 2010(Shipstead et al., , 2012Morrison and Chein, 2011;Buitenweg et al., 2012;Hulme, 2013, 2016), and several previous executive training studies have suffered from methodological shortcomings (e.g., not including an active control group, not using an adaptive training regime, or not including enough transfer measures). We tried to take these criticisms into account by employing an active control group, using an adaptive and long enough training paradigm, and by including at least two transfer tasks per cognitive domain. Still we found only very limited statistically significant near transfer effects.
Some limitations of the present study should be pointed out. First, the sample size was small, which decreases the statistical power and increases the risk for Type II errors. Given the practical challenges of cognitive training studies, Internet-based training experiments that enable larger sample sizes offer one promising way to study further the transfer effects of executive training in the future (e.g., Ngandu et al., 2015). Second, some of the participants performed at ceiling in parts of the training tasks and the odd-man-out task (mainly in the single task blocks), limiting the sensitivity of these tasks in showing possible training-related effects. Third, to rule out possible expectancy effects, future studies should also examine whether the active control group has the same expectations of improvement on the pre/post tasks as the experimental group, as only then can we more confidentially attribute differential improvements to the shifting training (Boot et al., 2013). Fourth, while no motivational differences were found between the groups, the control group reported being less alert than the training group at posttest. However, as the groups were equally alert during pretest and training, alertness was not a confounding factor for the training period. It is unlikely that the higher subjective alertness level of the training group at posttest reflected a general training effect, as in that case one might have expected more widespread transfer. One possibility is that it could be an after-effect of the posttest where the training group performed tasks that were very familiar to them and where they could excel (i.e., the training tasks), whereas the control group had no similar tasks to perform as the computer games were not included in the pre-post test battery. Nevertheless, in absolute terms, both groups displayed adequate levels of subjective alertness at posttest.
In conclusion, we found that set shifting training in the elderly yielded reliable and long-lasting effects on the trained tasks. However, the near transfer effects from this training were very limited.

ETHICS STATEMENT
All participants gave their written informed consent to participate in the study. The consent forms were also reviewed by the Ethics Committee of the Hospital District of Southwest Finland, i.e., the subjects were among other things informed that they could quit their participation in the study at any time without a reason, and that the collected data was confidential and kept in a safe place. The follow-up part of the study was approved by the Ethics Committee of the Departments of Psychology and Logopedics at the Åbo Akademi University. No additional considerations, the subjects were healthy.