Lack of Cross-Modal Effects in Dual-Modality Implicit Statistical Learning

Li, Xiujun; Zhao, Xudong; Shi, Wendian; Lu, Yang; Conway, Christopher M.

doi:10.3389/fpsyg.2018.00146

ORIGINAL RESEARCH article

Front. Psychol., 27 February 2018

Sec. Cognition

Volume 9 - 2018 | https://doi.org/10.3389/fpsyg.2018.00146

Lack of Cross-Modal Effects in Dual-Modality Implicit Statistical Learning

$\r\nXiujun Li,&#x;$ Xiujun Li^1,2†

Xudong Zhao^2†

Wendian Shi^2*

Yang Lu¹

Christopher M. Conway^3,4*

¹School of Psychology and Cognitive Science, East China Normal University, Shanghai, China
²Department of Psychology, School of Education, Shanghai Normal University, Shanghai, China
³NeuroLearn Lab, Department of Psychology, Georgia State University, Atlanta, GA, United States
⁴Neuroscience Institute, Georgia State University, Atlanta, GA, United States

A current controversy in the area of implicit statistical learning (ISL) is whether this process consists of a single, central mechanism or multiple modality-specific ones. To provide insight into this question, the current study involved three ISL experiments to explore whether multimodal input sources are processed separately in each modality or are integrated together across modalities. In Experiment 1, visual and auditory ISL were measured under unimodal conditions, with the results providing a baseline level of learning for subsequent experiments. Visual and auditory sequences were presented separately, and the underlying grammar used for both modalities was the same. In Experiment 2, visual and auditory sequences were presented simultaneously with each modality using the same artificial grammar to investigate whether redundant multisensory information would result in a facilitative effect (i.e., increased learning) compared to the baseline. In Experiment 3, visual and auditory sequences were again presented simultaneously but this time with each modality employing different artificial grammars to investigate whether an interference effect (i.e., decreased learning) would be observed compared to the baseline. Results showed that there was neither a facilitative learning effect in Experiment 2 nor an interference effect in Experiment 3. These findings suggest that participants were able to track simultaneously and independently two sets of sequential regularities under dual-modality conditions. These findings are consistent with the theories that posit the existence of multiple, modality-specific ISL mechanisms rather than a single central one.

Introduction

Human learners show sensitivity to environmental regularities across multiple perceptual modalities and domains even without being aware of what is learned (Aslin and Newport, 2009; Emberson and Rubinstein, 2016). This ability, referred to as implicit statistical learning (ISL), is a ubiquitous foundational cognitive ability thought to support diverse complex functions (Guo et al., 2011; Thiessen and Erickson, 2015).

A current debate in this area of research concerns the mental representations resulting from ISL. The nature of these mental representations is important for revealing the characteristics of the mechanisms underlying ISL (Cleeremans and Jiménez, 2002; Fu and Fu, 2006; Li and Shi, 2016). In the classic study of implicit learning, Reber (1967) demonstrated ISL in participants who were exposed to letter strings generated from an artificial grammar. In these experiments, letter strings obeyed the overall rule structure of the grammar, being constrained in terms of which letters could follow which other letters. Participants not only showed evidence of learning this structure implicitly, but also could apparently transfer their knowledge of the legal regularities from one letter vocabulary (e.g., M, R, T, V, X) to another (e.g., N, P, S, W, Z) as long as the underlying grammar used for both was the same. This effect has been replicated many times, with transfer being demonstrated not just across letter sets (Shanks et al., 1997), but also across perceptual modalities (Tunney and Altmann, 2001). The transfer effects in artificial grammar learning (AGL) are usually explained by proposing that the learning is based on abstract knowledge, that is, knowledge that is not directly tied to the surface features or sensory input (Reber, 1989; Altmann et al., 1995; Shanks et al., 1997; Peña et al., 2002). An additional characteristic of ISL is that it occurs with perceptually diverse input, including linguistic stimuli, tone stimuli, visual scenes, geometric shapes, color stimuli, and motor responses (Saffran et al., 1999; Fiser and Aslin, 2002; Kemény and Lukács, 2011; Durrant et al., 2013; Goujon and Fagot, 2013; Guo et al., 2013). Importantly, the same ISL phenomenon appears to be observed regardless of the nature of the input patterns. Given that ISL occurs with perceptually diverse input, it is possible that what underlies ISL is a single, central mechanism that treats all types of input stimuli (e.g., tones, shapes, and syllables) as equivalent beyond the statistical structure of the input itself.

However, there is evidence contrary to this view, suggesting that ISL is not neutral to the input modality but rather is rooted in modality-specific, sensorimotor systems. First, demonstrations of transfer of knowledge does not necessarily mean that the acquired knowledge is amodal. What is learned may be the surface characteristics of the stimuli but a separate, higher-level process may form mappings between the different types of input, allowing above-chance performance with the new input (Redington and Chater, 1996). Consistent with this view, a recent study showed that transferring knowledge to a new stimulus set in an AGL paradigm required working memory resources; when memory resources were depleted using a dual-task manipulation, no transfer effects were observed despite learning of the regularities occurring (Hendricks et al., 2013). Second, although ISL can occur with different types of stimuli, this does not necessarily indicate that ISL is subserved by a single, domain-general mechanism that applies across a wide range of tasks, inputs, and domains. Instead, it is just as possible that there may exist multiple parallel subsystems, each relying on similar computational algorithms, which can process and learn the underlying structure in various stimuli, (e.g., Chang and Knowlton, 2004; Conway and Christiansen, 2005; Conway and Pisoni, 2008; Goujon and Fagot, 2013; Frost et al., 2015). For example, Chang and Knowlton (2004) found that ISL was sensitive to stimuli features, and the changes in fonts could affect the ISL performance of the letter strings. This finding appears to indicate that at least some of the learned knowledge is modality or stimulus specific. In addition, using vibration pulses, pictures, and pure tones as experimental materials, Conway and Christiansen (2005) compared tactile, visual, and auditory ISL and found modality constraints affecting ISL across the senses, with auditory ISL showing better performance than both tactile and visual learning (see also Conway and Christiansen, 2006, 2009). Similarly, Emberson et al. (2011) presented visual and auditory input streams under different timing conditions (fast or slow presentation rates). The results showed that auditory ISL was superior to visual learning at fast rates, but the opposite was true at slower presentation rates, suggesting the existence of modality constraints affecting learning.

The studies reviewed to this point relied on comparisons across individual modalities. However, the perceptual environment is rarely limited to one modality or a single information stream (Stein and Stanford, 2008), and learners often face multiple potential regularities across modalities at the same time. Research in the area of multisensory integration suggests that to some extent, information can be processed separately and in parallel across different perceptual modalities. For instance, participants can monitor simultaneously visual and auditory inputs in different spatial locations without a behavioral deficit under conditions of divided attention (Santangelo et al., 2010). In addition, findings from working memory research suggests that when information is presented in both visual and auditory-verbal formats, the information is encoded separately and yet a facilitative effect is also observed in bimodal formats (e.g., audiovisual stimuli), leading to improved memory (Mastroberardino et al., 2008). Although these studies demonstrate the manner in which multimodal input streams are processed by attentional, perceptual, and memory mechanisms, it is currently unclear to what extent ISL can support such processing demands.

It is important therefore is to explore the degree to which multimodal information streams are processed independently or are integrated together to support implicit learning (e.g., Sell and Kaschak, 2009; Cunillera et al., 2010; Mitchel and Weiss, 2010, 2011; Thiessen, 2010; Shi et al., 2013; Mitchel et al., 2014; Walk and Conway, 2016). The extent to which simultaneous multisensory input are processed independently rather than being integrated together provides perhaps the strongest support for the existence of multiple, modality-specific mechanisms of ISL. That is, learning of multiple input streams in parallel does not seem feasible for a single central learning mechanism; only if multiple learning mechanisms exist could parallel input streams be learned and represented independently of one another.

Conway and Christiansen (2006) assessed multistream learning in a series of three experiments. In Experiment 1, participants were exposed alternately with auditory sequences produced from one artificial grammar and visual sequences generated by a second grammar. In the test phase, new sequences were generated from each grammar; crucially, for each participant, all sequences from both grammars were instantiated only visually or auditorily. The results revealed that participants only endorsed a sequence as “grammatical” if the sensory modality matched the grammar that it was paired with during the learning phase. These findings suggest that ISL is closely bound to the input modality in which the regularities are presented, rather than operating at an abstract level. Johansson (2009) extended these findings by increasing the amount of exposure during the learning phase, believing that this would be more likely to result in formation of abstract representations. The results were still consistent with stimulus-specific, not abstract, representations.

However, due to the crossover design used (Conway and Christiansen, 2006; Johansson, 2009), these two studies were not able to examine the learning of cross-modal sequences at the same time in a strict sense. That is, the visual sequences and auditory sequences were interleaved and alternated with one another rather than being presented concurrently. Therefore, in order to provide evidence that multiple sensory modalities can be used to learn sequential regularities simultaneously and independently, a different type of design is necessary.

In an initial study using a dual-modality design, Shi et al. (2013) presented participants with visual and auditory sequences simultaneously, examining the degree to which multimodal input sources are processed independently. They found that the participants could acquire the regularities presented simultaneously regardless of the grammatical rules being the same or different, and there were no significant differences between unisensory and multisensory conditions. That is to say, audiovisual presentation of the same regularities did not show an enhanced ISL effect one might expect if there was a single learning system integrating the perceptual inputs. Similarly, audiovisual presentation of different grammars at the same time also did not show an interference effect, suggesting that the learning of two sets of regularities presented in two different perceptual modalities can occur independently and in parallel.

In the Shi et al. (2013) experiments, the audiovisual sequences consisted of nonsense syllables and color pictures paired together. The absence of enhanced or decreased ISL during the cross-modal presentation conditions may be due to the nature of the stimuli. Specifically, Conway and Christiansen (2006) found that when two input streams following different grammatical rules were presented in separate perceptual dimensions (e.g., unfamiliar shapes and nonsense syllables), participants were able to demonstrate learning of the two grammars. However, when the two input streams containing different grammatical rules were presented in the same perceptual dimension (e.g., two different sets of nonsense syllables or two different sets of unfamiliar shapes), participants were not able to learn both sets of regularities (Conway and Christiansen, 2006). Therefore, the perceptual characteristics of the stimuli appear to play an important role in cross-modal ISL.

Based on these findings, we conjectured that cross-modal effects would be more likely to occur if the two input streams contained high perceptual overlap, namely, visual and auditory stimuli that referred to the same objects. Specifically, in this study, we used animal pictures as the visual stimuli and their names as the auditory stimuli. Using pictures of animals and their names presented concurrently should provide the strongest test of parallel learning mechanisms. That is, if learning of the grammatical patterns operates independently over visual and auditory input, then concurrent presentation of animal pictures with their auditory names should show no behavioral facilitation. On the other hand, if learners are representing the visual pictures and auditory names as a single perceptual event (e.g., a multimodal percept that includes the picture of the animal tied to its name), then we might expect to observe better learning under concurrent compared to unimodal presentation. Similarly, if the animal pictures and animal names are presented in sequences generated from two different artificial grammars at the same time, this would be expected to result in an interference effect if learners are trying to integrate the pictures with the names. But if no interference is observed under such seemingly difficult learning conditions, this would be strong evidence for parallel, independent learning mechanisms.

The objective of this study therefore is twofold. First, we tested whether multistream, cross-modal ISL results in increased learning when the auditory and visual stimuli provide redundant information. Second, we explored whether multistream cross-modal ISL results in decreased learning when the auditory and visual stimuli provide conflicting information. The answers to these questions will ultimately allow us to determine whether the cross-modal patterns were processed independently within each modality or whether the perceptual modalities are integrated together during the learning process.

Experiment 1A: Unimodal Visual ISL

Visual ISL of a single modality was tested, and used to establish a baseline level of performance for comparison with dual-modality ISL in subsequent experiments.

Method

Subjects

Twenty-two Chinese graduate students with normal hearing and normal or corrected-to-normal vision were recruited from Shanghai Normal University via an advertisement (Age range = 23–29; Mean age = 25.75; Females = 13). The decision of the sample size of 22 was predetermined by priori power analysis, based on the G^∗Power package version 3.1.9.2, which indicated the sample size > 19 adequately makes a moderate experimental effect (Cohen’s d = 0.8) being detectable with power = 0.9 at alpha = 0.05. None of them had ever participated in any type of cognitive experiments. This study was carried out in accordance with the recommendations of the Ethics committee of the Shanghai Psychological Society with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki, and they were paid for their participation. The protocol was approved by the Ethics committee of the Shanghai Psychological Society.

Materials

An artificial grammar (see Figure 1) (Christiansen et al., 2012) was used to produce a set of sequences containing between five to seven elements. Each letter of the grammar was mapped onto an animal vocabulary including tiger, lion, elephant, horse and goat. The grammar determined the order of sequence elements drawn from five different categories of stimulus tokens. Two categories, A and B, each contained a single token, a tiger (A) and a lion (B), respectively. The C category consisted of two tokens, a black elephant (C₁) and a gray elephant (C₂). The D and E categories each contained three tokens, a white horse (D₁), a black horse (D₂), and a gray horse (D₃); and a white goat (E₁), a black goat (E₂), and a gray goat (E₃), respectively. There were a total of 10 tokens distributed over the ten stimuli.

FIGURE 1

FIGURE 1. Artificial grammar 1.

A large number of sequences can be generated according to this artificial grammar. We used 48 legal sequences that were generated from the grammar for the acquisition phase (see Appendix A, Grammar 1). The test set (see Appendix B) consists of 80 novel legal and 80 illegal sequences. The illegal sequences each began with a legal element, followed by several illegal transitions and ending with a legal element. For example, the illegal sequence B–C₁–E₂–A–D₁–D₁ begins and ends with legal elements (B and D, respectively) but contains two illegal interior transitions.

A possible token sequence resulting from this artificial grammar could be A-D₁-E₂-B-C₁-D₂.which corresponds to a tiger, white horse, black goat, lion, black elephant, and black horse (see Figure 2).

FIGURE 2

FIGURE 2. Example of sequence.

Procedure

The experiment consisted of two phases: a learning phase and a testing phase. At the beginning of the acquisition phase, participants were told a story adapted from Rosas et al. (2010) but presented in Chinese: “Thomas travels along his country in a train carrying his circus. They arrived in a total of 144 cities. Whenever they arrive in a city, Thomas makes his animals perform. Next, you’ll see sequences of animals on the computer screen. They are the appearance order of animals performing at each city. Please remember these sequences and afterward you will take a test.”

The participants were not told that the sequences had been produced according to an artificial grammar. The sequences were presented one at a time to each participant. Each animal picture in the sequences lasted for 1000 ms with 300 ms inter-animal intervals. A 2000 ms pause occurred between the first and second sequences. Each of the 48 learning phase sequences was presented three times in random order for a total of 144 trials.

After the familiarization phase, participants were presented with the testing phase. The testing phase started with the following story: “The animals appearing in the orders you have observed were generated according to a set of complex rules that determined the order of the animals within each city. Thomas will travel along another country, so animals appearing in new orders have been produced. Some of these new orders conform to the same rules as before, the others are different. Only the orders conforming to the same rules would be allowed to perform on stage. Next you will see the animals appearing in new orders. Your task is to determine if the orders conform to the same rules as before by pressing one of two buttons marked YES and NO without feedback. You need to respond quickly and accurately.” Then, the 160 test sequences were presented in random order to the participants. The presentation time of the test sequences was the same as that used during the acquisition phase.

After completing the experiment, the participants were debriefed about their explicit knowledge of the rules as well as any particular strategies that they might have used during both the training and testing phases.

Results and Discussion

The mean test accuracy in Experiment 1A was 122 out of 160 (76%), with a standard deviation of 24. A one-sample t-test indicated that performance for the unimodal visual ISL task was significantly above chance, t(21) = 8.26, p < 0.001, d = 1.73.

Experiment 1B: Unimodal Auditory ISL

Auditory ISL of a single modality was tested and used to establish a baseline level of performance for comparisons with dual-modality ISL in subsequent experiments.