Cross-Representational Signaling and Cohesion Support Inferential Comprehension of Text–Picture Documents

Désiron, Juliette C.; Bétrancourt, Mireille; de Vries, Erica

doi:10.3389/fpsyg.2020.592509

ORIGINAL RESEARCH article

Front. Psychol., 18 January 2021

Sec. Educational Psychology

Volume 11 - 2020 | https://doi.org/10.3389/fpsyg.2020.592509

Cross-Representational Signaling and Cohesion Support Inferential Comprehension of Text–Picture Documents

1. Technologies de Formation et Apprentissage (TECFA), Faculty of Psychology and Education, University of Geneva, Geneva, Switzerland
2. LaRAC, Univ. Grenoble Alpes, Grenoble, France

Article metrics

View details

Citations

3,9k

Views

1,4k

Downloads

Abstract

Learning from a text–picture multimedia document is particularly effective if learners can link information within the text and across the verbal and the pictorial representations. The ability to create a mental model successfully and include those implicit links is related to the ability to generate inferences. Text processing research has found that text cohesion facilitates the generation of inferences, and thus text comprehension for learners with poor prior knowledge or reading abilities, but is detrimental for learners with good prior knowledge or reading abilities. Moreover, multimedia research has found a positive effect from adding visual representations to text information, particularly when implementing signaling, which consists of verbal or visual cues designed to guide attention to the pictorial representation of relevant information. We expected that, as with text-only documents, struggling readers would benefit from high text cohesion (Hypothesis 1) and that signaling would foster inference generation as well (Hypothesis 2). Further, we hypothesized that better learning outcomes would be observed when text cohesion was low and signaling was present (Hypothesis 3). Our first experimental study investigated the effect of those two factors (cohesion and signaling) on three levels of comprehension (text based, local inferences, global inferences). Participants were adolescents in prevocational schools (n = 95), where some of the students are struggling readers. The results showed a trend in favor of high cohesion, but with no significant effect, a significant positive effect of cross-representational signaling (CRS) on comprehension from local inferences, and no interaction effect. A second experiment focused on signaling only and attention toward the picture, with collection of eye-tracking data in addition to measures of offline comprehension. As this study was conducted with university students (n = 47), who are expected to have higher reading abilities and thus are less likely to benefit from high cohesion, the material was presented in its low cohesive version. The results showed no effect of conditions on comprehension performances but confirmed differences in processing behaviors. Participants allocated more attention to the pictorial representation in the CRS condition than in the no signaling condition.

Introduction

The use of text–picture combinations has become increasingly common in instructional documents in school and everyday life. However, learners with low reading abilities struggle to comprehend instructional texts (e.g., Cain and Oakhill, 2014). Text processing research has defined struggling readers as learners who have trouble both decoding and comprehending a text (Hoover and Gough, 1990; Kenedou et al., 2010; Florit and Cain, 2011). A previous study by Désiron (unpublished) showed that for young adults, language comprehension abilities (vocabulary and verbal reasoning) were predictors of multimedia comprehension, but decoding abilities were not. These results were in line with text processing research that confirmed Kintsch’s construction–integration theory (Kintsch and van Dijk, 1978; Kintsch, 1980) in that manipulating text cohesion to support the generation of inferences positively affected comprehension of a text-only document (e.g., McNamara et al., 1996; Ozuru et al., 2009). In a text–picture, or multimedia, document, the need to generate inferences also occurs between the text and the picture (Holmes, 1987) and has been theorized as the need to create links between the text and the picture to integrate them together and form a coherent mental model (Mayer, 2014; Schnotz, 2014). Similar to text cohesion, the signaling principle of multimedia learning considers that text–picture integration and thus comprehension can be facilitated by visually elucidating the link between both representations (van Gog, 2014). Numerous studies have showed the positive effect of signaling on comprehension, particularly for learners with little prior knowledge (see Richter et al., 2017 for a review). Hence, changes in the cohesion of the text and the use of signaling between the text and the picture are potentially helpful to struggling readers. However, text processing or multimedia learning research has rarely examined learners with low reading abilities (Ozuru et al., 2013) but have focused more often on learners with low prior knowledge (e.g., McNamara et al., 1996; Florax and Ploetzner, 2010). Further, text processing research has distinguished text-based comprehension from inferential comprehension based on their performance measures. Multimedia research has rarely considered this distinction and often assessed comprehension as whole or, in contrast to knowledge transfer, an approach that was derived from research on problem solving. The purpose of this research was to investigate the effects of text cohesion and cross-representational signaling on the comprehension of an instructional text–picture document.

Learning From Text

Reading with comprehension is not a straightforward process, as the comprehension of even the shortest text may require the generation of inferences. This critical ability is well described in text comprehension research, particularly in the construction–integration model from Kintsch (1998), the original model of which, from Kintsch and van Dijk (1978), was a schema theory. This means that learners have an a priori general idea of what they will read about and that the reading will provide them with new information that will feed their schema of the situation. The model then evolved to consider not only instructional texts but narratives as well (Kintsch, 1980), with an emphasis on text structure and its correspondence with the construction of a mental model. In this updated model, a learner integrates the following into a coherent mental model: the elements from the text-based representation, knowledge from long-term memory, and any inferences they make from the text. The ability to generate inferences depends on reasoning processes both to establish the implicit connection between two (or more) pieces of information distributed in the text (bridging inferences) and to build on previous knowledge of the world in order to understand the global situation (elaborative inferences). Therefore, the generation of inferences allows the reader to link elements and build new knowledge. With bridging inferences, learners link provided in neighbored sentences (local bridging inferences) or distributed further apart in the text (global bridging inferences). With elaborative inferences, learners retrieve details from prior knowledge and integrate them in their mental model.

A large body of text comprehension research has demonstrated that learners’ abilities to generate inferences from text were a strong predictor of their success or failure in comprehending the text (Cain and Oakhill, 1999, 2014), independent of individual factors such as word decoding skills, working memory capacity, and domain knowledge. Other research on the influence of the generation of inferences investigated the effects of varying the level of cohesion in a text (McNamara et al., 1996; McNamara, 2001, 2011; Ozuru et al., 2009). McNamara et al. (1996) found that high school students (11–15 years) with low prior knowledge better comprehended a text at high cohesion than at low cohesion and that a reversed effect was observed for high prior knowledge students. Further, this effect was observed for the inferential but not the text-based level of processing. In a study with college students, Ozuru et al. (2009) investigated the interactions between cohesion and prior knowledge, and cohesion and reading abilities. They found that while cohesion in itself did not affect comprehension, it interacted with reading abilities depending on the level of comprehension considered. Regression analyses indicated that the contribution of prior knowledge increased when comprehension required more integration, while an opposite pattern was observed for reading abilities. In other words, low prior knowledge learners particularly benefit from high cohesion for the generation of global inferences, while learners with low reading abilities particularly benefit from high cohesion for the retention of text-based information and the generation of local inferences. Learners with high reading abilities or high prior knowledge are able to generate inferences without support. However, further analyses of variance showed that learners with low reading abilities and low prior knowledge did not benefit from high cohesion, stressing the importance to take both reading abilities and prior knowledge into consideration. Additionally, this line of research (e.g., Ozuru et al., 2009; McNamara, 2011) underlined the importance of distinguishing between the comprehension of elements extracted directly from the text (assessed with text-based questions) and the comprehension of elements requiring the generation of inferences (assessed with local-bridging, global-bridging, and elaborative questions). This body of research found that, overall, facilitating the generation of inferences through manipulations of text cohesion affected only inference questions and particularly so for global-bridging questions.

Learning From Text and Picture

Current models of multimedia learning (Mayer, 2014; Schnotz, 2014) are based on the dual coding theory (Paivio, 1971; Clark and Paivio, 1991), which predicts—and proved—better memory after a presentation using both a picture and a verbal label of an object than after one using twice the information in one medium. Having information anchored in two representation channels rather than in one results in the construction of a stronger mental model. Both the cognitive theory of multimedia learning from Mayer (2005, 2014) and the latest version of the integrated model of text and picture comprehension (ITPC model) from Schnotz (2014) assert that information from verbal and pictorial representations is first processed through different sensory modalities before being encoded in a coherent model of the situation that relies on both those representations and previous knowledge. According to the cognitive theory of multimedia learning, the multimedia effect, that is, text–picture material, improves learning more than text alone because complementary information provided by the pictorial representation supports the construction of the mental model of the situation, which is required for deep understanding and thus learning (for a more extensive explanation, see for example Schüler et al., 2013). The ITPC model includes a coherence principle, which suggests that “students learn better from words and pictures than from words alone if the words and pictures are semantically related to each other” (Schnotz, 2014, p. 23), especially students with poor reading skills or little prior knowledge. This assertion is based on research showing that learning with multiple representations (particularly written text and pictures) can be beneficial for comprehension, provided that learners can identify the links between representations through cross-references.

Whereas the literature has repeatedly reported that adding pictures to text improves comprehension (e.g., Mayer and Gallini, 1990; Hegarty and Just, 1993; Schnotz and Bannert, 2003; Mason et al., 2013c), there is also evidence that the integration process can be challenging for students (Ainsworth et al., 2002). According to the Design, Functions, Tasks (DeFT) framework (Ainsworth, 1999, 2006; Ainsworth and Van Labeke, 2002), multiple representations can primarily be used to complement one another, constrain the interpretation of each other, or allow learners to construct a deeper understanding of a given topic. Based on the coherence principle from the ITPC model, written text and pictures should be considered as complementary multiple representations, for which “a single representation would be insufficient to carry all the information” (Ainsworth, 1999, p. 137). In addition to the overlap of information across representations, the DeFT framework includes the idea that representations can bear different computational weights (Larkin and Simon, 1987). In short, using computationally unequal representations can be beneficial to learners because they will infer some information more easily from one type of representation than from the other. Using multiple representations with written text and visual pictures, Larkin and Simon (1987) considered that a picture can represent linked information spatially closer together than a written text and thus be more facilitative of the inference generation process.

To guide the integration process of learning material employing multiple representations, multimedia research has investigated the effect of the insertion of visual or verbal cues in either verbal or pictorial representations or both (for reviews, see van Gog, 2014 and Richter et al., 2016). Whereas van Gog (2014) distinguished signals according to their implementation—text based, picture based, or used across representations—Richter et al. (2016) classified signals according to their nature—with verbal signals opposed to visual signals. Verbal signals are deictic references that correspond to an explicit reference to the pictorial representation in the text or verbal labels inserted in the pictorial representation. Visual signals refer to the use of a single color for a word in the text and its visual counterpart in the pictorial representation or to the use of color or a spotlight in the picture. Therefore, we consider that the taxonomies of Richter et al. and van Gog should be used concurrently to define or design multiple representations. Results from the meta-analysis by Richter et al. showed an overall significant beneficial effect of signaling in text–picture relations that was more profitable to learners with low and medium prior knowledge than those with high prior knowledge, in line with the ITPC model’s predictions (Schnotz, 2014).

Multimedia learning research has investigated not only learner’s ability to construct a coherent mental model but other characteristics as well. Thus, similar to the results of the text research, multimedia learning research pointed out an expertise reversal effect (Sweller et al., 2003; Kalyuga, 2014), which states that learners with low prior knowledge are more likely to benefit from the text’s adjunct visual representation than learners with high prior knowledge are. Notably, this expertise reversal effect was found in studies on signaling (Spanjers et al., 2011; Richter et al., 2017; Richter and Scheiter, 2019). Moreover, the cognitive–affective theory of multimedia learning (Moreno, 2005, 2006) addressed the question of a link between affect and comprehension. Based on the cognitive–affective theory of multimedia learning, recent work on implementing emotional design (Um et al., 2012; Mayer, 2014; Mayer and Estrella, 2014) focused on the influence of motivation, of which interest is a component (Hidi, 2006). As an example, Um et al. (2012) investigated the effect of adding emotionally positive graphic design elements, such as colors and faces, and found that they increased comprehension, self-rated motivation, and satisfaction. Using a similar manipulation, Mayer and Estrella (2014) found an increased comprehension but no effect on motivation and difficulty ratings. Focusing on the type of picture adjunct to the text, Lenzner et al. (2013, Study 3) found that higher interest was reported when the text was presented with an instructional picture than with a decorative picture. The meta-analysis from Schneider et al. (2018) found an overall effect of signaling on motivation and cognitive load. The inclusion of signaling in single or multiple media positively correlated with motivation and cognitive load, indicating more motivation and less cognitive load when learning with signaled material. Désiron (unpublished) investigated learner characteristics predicting multimedia learning when distinguishing text-based from bridging inference questions (local and global). Findings from this study indicated that different reading abilities (vocabulary and verbal reasoning) affect multimedia comprehension depending on the type of question (text based, bridging inference) asked. Learners’ situational interest was investigated as well and was found not to be a predictor of multimedia comprehension, either for text-based or bridging-inference questions.

Following the theory and research on signaling in multimedia learning, this study investigated the effect of cross-representational signaling (CRS) on learning from a multimedia document. We define CRS as signals supporting a high semantic overlap between written text and visual pictures, following Schnotz’s (2014) ITPC model coherence principle. Following the taxonomy from van Gog (2014), these signals are used across representations, and they are verbal (in both the text and picture) or visual (in the picture), according to the taxonomy from Richter et al. (2016). Indeed, the use of verbal signals in the picture successfully improved comprehension in previous research, particularly in an eye-tracking study by Mason et al. (2013a). Adding color to a pictorial representation was recurrently found to benefit learner comprehension (Jamet et al., 2008; Boucheix et al., 2013), as was using color across text and pictorial representations as well (Kalyuga et al., 1999). More recently, Richter and Scheiter (2019) compared the effect of signaling within the verbal representation with that of signaling across verbal and pictorial representations when learning from a digital chemistry textbook. The authors found that young adults (13–17 years old) with low prior knowledge recalled more information when signaling was used across verbal and pictorial representations but that it did not affect learners with high prior knowledge. However, the manipulation failed to influence the outcome regarding learner comprehension. Based on predictions from the ITPC model (Schnotz, 2014) and the results of previous research, CRS should positively support comprehension when learners have low prior knowledge or reading abilities and the text is difficult to comprehend (low cohesion).

Although multimedia learning research has investigated comprehension, it has not often, or not clearly, distinguished between comprehension of the text base and comprehension requiring the generation of inferences. Rather, it often has focused on the distinction between text-based and transfer questions (e.g., Mayer et al., 2004; de Koning et al., 2011; Mason et al., 2013b). Butcher (2006) investigated learning outcomes with different measures, when the learner was reading a text only, a text with a simplified diagram, or a text with a detailed diagram on the heart and circulatory system. In Experiment 1, learning was measured by means of drawings, memory questions (similar to retention), and inferences (elaborative). These elaborative inferences were close to a transfer task, as learners were asked to transfer knowledge acquired from the instructional document to novel situations. No significant differences between groups were observed for inference questions, but learning with diagrams did lead to the generation of significantly more correct inferences in self-explanation. The generation of inference is still rarely investigated in multimedia research with text and pictured. However, we believe that an assessment distinguishing between text-based and inferential comprehension would greatly benefit the field.

Research Aim and Hypotheses

The aforementioned literature showed that the comprehension performance of learners with low reading abilities or low prior knowledge was improved by increasing text cohesion or by signaling links between verbal and pictorial information through visual cues in the text and the pictures. However, no study has investigated how these two factors would interact. As we aimed to focus on learners with low prior knowledge and/or low reading abilities, bridging inferences were studied, but elaborative inferences were not.

Hypothesis 1

Our first prediction was based on results from text processing research (e.g., McNamara et al., 1996; Cain and Oakhill, 1999, 2014; McNamara, 2001; Ozuru et al., 2009) on the effect of cohesion on text comprehension. Thus, we expected that learners with low prior knowledge reading a highly cohesive text would obtain better scores on inference generation questions than learners with a low cohesive text would obtain.

Hypothesis 2

According to the ITPC model (Schnotz, 2014), pictures can be used as guides to comprehend a text. Further, research has showed that the multimedia effect is more salient in learners with low prior knowledge, especially when signaling is used (Richter et al., 2017). Thus, our main prediction was that learners who studied the multimedia material with CRS would perform better than learners who studied the multimedia material without signaling. As pictures should support inference generation, differences were particularly expected in answers to questions requiring the generation of inferences.

Hypothesis 3

Finally, we expected that text cohesion would interact with signaling. The highest comprehension performance should be found when the multimedia material is written at a low level of cohesion and includes CRS. The positive effect of CRS should be less pronounced with high text cohesion. Indeed, the guidance from CRS probably does not improve the generation of inferences beyond the benefits of high text cohesion. Previous research has showed that the generation of inferences is supported by both verbal references in the text and signaled visual references in the picture, both of which were found to support comprehension individually (e.g., McNamara, 2001; Richter et al., 2017). Therefore, we expected the interaction to be observed only when the generation of inferences was required and not for text-based questions.

Experiment 1

Materials and Methods

Participants and Design

Six classes of students (n = 95) in first year of one Ecole de Culture Générale (prevocational track) took part in the experiment as part of a class activity proposed by their teacher. The number of participants was determined from an a priori power calculation using the software G^∗Power 3.1 (Faul et al., 2007) for a multivariate analysis of variance (effect size r = 0.29—derived from Seufert, 2003, α = 0.05, power of 0.80), with a recommended sample size of 89. Students who did not give their informed consent were given a silent reading task by their teacher. Four participants did not complete all tasks in time and were excluded from the data analyses. The data from 91 participants (51 female) with a mean age of 16.8 years (SD = 11 months) were analyzed. This study was approved by the university’s ethics committee and by the school research committee.

Participants were randomly assigned to one of four experimental between-subjects conditions resulting from a two-cohesion (low vs. high) by two-signaling (no signal, with CRS) factorial design.

Material

The experimental material was a five-page-long multimedia document on river sailing and how to escape the Maytag effect when caught in a rapid. It was presented in landscape format with the text on the left side and a picture on the right side.

Cohesion

The text was written from multiple sources and manipulated to obtain a low and a high cohesive versions. Changes were implemented following the recommendations from Ozuru et al. (2009), which impacted both local and causal inference generation:

1.
Replacing ambiguous pronouns with nouns;
2.
Adding descriptive elaborations that link unfamiliar concepts with familiar concepts;
3.
Adding connectives to specify the relationships between sentences or ideas;
4.
Replacing or inserting words to increase the conceptual overlap between adjacent sentences;
5.
Adding topic headers;
6.
Adding thematic sentences that serve to link each paragraph to the rest of the text and overall topic; and
7.
Changing sentence structures to incorporate the additions and modifications (p. 232).

The low cohesive version was 500 words long while the high cohesive version was 706 words long. Supplementary Appendix A contains an example of the low and high cohesive versions of the text in which the specific changes are indicated.

Signaling

The implementation of signaling in the form of CRS consisted of the insertion of captions and arrows in the picture as well as the use of the same color coding in both the text and the picture. Figure 1 shows an example of a page without signaling and with CRS.

FIGURE 1

Measures

Prior knowledge

To control for participants’ knowledge on the topic of river sailing, we used a questionnaire with six statements on a five-item self-rating scale ranging from “do not know” to “know very well” (e.g., I […] the dangers of river sailing) or from “cannot explain” to “can explain very well” (e.g., I […] what a Whitewater is).

Reading abilities

To control for participants’ reading abilities, we used two tests of reading abilities found to be good predictors of multimedia comprehension (Désiron et al., 2018). The vocabulary test was a French version of the Hill Mill assessment, which asks participants to determine 33 synonyms with six options each and an 8 min time limitation (Deltour, 1993), for a maximum possible score of 44 points. The verbal reasoning test was a translation of the test designed by Meteyard et al. (2015) that assesses participants’ ability to generate inferences by means of short texts followed by four open-ended questions, for a maximum possible score of 12 points.

Comprehension

In accordance with research on text processing (e.g., Ozuru et al., 2009), comprehension was assessed at three different levels, with short open-ended questions. The findings of Ozuru et al. (2013) indicated that open-ended questions are a more sensitive measure of inference generation than multiple-choice questions. Five text-based questions measured participants’ retention of elements clearly stated in the text (e.g., “Which watercraft[s] use[s] a single-paddle?”) and could be answered with single words. Four local inference questions measured participants’ comprehension of elements that required the generation of bridging inferences from elements no more than a sentence apart (e.g., “Does the Maytag whirlpool form upstream or downstream from the boiling?”). Four global inference questions measured participants’ comprehension of elements that required the generation of bridging inferences from elements dispersed in the text (e.g., “According to the document, why is it only after the boiling that one should resurface?”). The local and the global inference questions needed to be answered with one or two sentences. Therefore, the answers that were expected for comprehension questions ranged from one word to two sentences, depending on the level of comprehension. Utilizing an analysis grid taking into consideration idea units from the text and pictures, each question could score 1 point, with the value of the idea unit ranging from 0.50 to 1, depending on the number of ideas expected. Thus, the score per level of comprehension thus ranged between 0 and 4 (inference questions) or 0 and 5 (text-based questions). The answers were evaluated by the first author, and a second rating was done by the second author on a random subset of 20% of the comprehension questions (n = 20). Interrater reliability was determined by intraclass correlation coefficients (ICCs), which were ICC (3, 1) = 0.93 for the comprehension questions. The raters jointly settled their few differences in the two ratings.

Procedure

This experiment was conducted in school, during 45 min classes, with up to 10 participants using 9.7-in tablets. The participants first completed the self-assessed knowledge questionnaire, before reading the experimental material in one of the four experimental conditions (low cohesion no signal, n = 23; low cohesion with CRS, n = 23; high cohesion no signal, n = 22; high cohesion with CRS, n = 23) without a time limit. The participants were then prompted to answer the comprehension questions, presented following the order in which the elements requiring an answer occurred in the text. Finally, the participants completed the reading ability tests, presented in a random order. At the end of the allocated time, the participants were debriefed with regards to the research hypotheses corresponding to the different experimental conditions.

Results

We used a 2 × 2 factorial design with cohesion (low vs. high) and signaling (none, CRS) as the between-subjects factors and comprehension questions (text based, local inferences, global inferences) as the dependent variable.

Learner Characteristics

The sample as a whole had little prior knowledge (M = 5.53 out of 24, SD = 3.89). Overall, the participants scored just above half of the maximum possible points on the vocabulary test (M = 23.84, SD = 4.25), which was below the expected score for their age range (Deltour, 1993). Regarding the test of verbal reasoning, the participants scored about half of the possible points (M = 6.70, SD = 2.18). Therefore, this sample corresponded to the conditions deemed more likely to benefit from multimedia documents, according to the ITPC model (Schnotz, 2014).

To control for an effect of participants’ characteristics, we ran a correlation analysis of the three levels of comprehension. Prior knowledge did not correlate with any level of comprehension, vocabulary correlated with all levels (p < 0.001), and verbal reasoning correlated with text-based and local inferences comprehension questions (p = 0.002 and p = 0.032, respectively). The statistical analyses also indicated that covariates were not significantly different across groups (prior knowledge, p = 0.155; vocabulary, p = 0.806; and verbal reasoning, p = 0.135).

Effects of Cohesion and Signaling on Comprehension

A multiple factor analysis of covariance (MANCOVA) was performed on comprehension scores with cohesion level (low, high) and signaling (no signal, with CRS) as the between-subjects independent variables and comprehension questions (text based, local inferences, global inferences) as the dependent variable. Following correlation analysis (see the previous section for details), reading ability tests for vocabulary and verbal reasoning were included as covariates. As shown in Table 1, there was no significant advantage of high cohesion, V = 0.084, F(3, 83) = 2.53, p = 0.063, η²_p = 0.084, but there was a significant trend in local inferences questions for those who learned with a highly cohesive text (M = 1.36, SD = 0.91) compared to those who learned with low cohesive text (M = 1.02, SD = 0.75), F(1, 85) = 6.29, p = 0.014, η²_p = 0.069.

TABLE 1

	No signal		With CRS
	EMM	SE	EMM	SE
Low cohesion
Text-based questions (max score 5)	1.96	0.18	2.08	0.18
Local inferences questions (max score 4)	0.89	0.16	1.04	0.16
Global inferences questions (max score 4)	0.66	0.15	1.06	0.15
High cohesion
Text-based questions (max score 5)	2.22	0.18	2.15	0.18
Local inferences questions (max score 4)	1.15	0.16	1.16	0.16
Global inferences questions (max score 4)	0.65	0.15	0.94	0.14

Estimated marginal means and standard errors for the outcome measures in Experiment 1 (n = 47).

Maximum score indicated in parentheses.

Consistent with Hypothesis 2, there was a significant effect of signaling, V = 0.092, F(3, 83) = 2.79, p = 0.046, η²_p = 0.092, and learners with CRS scored higher than learners without signaling on local (M = 1.04, SD = 0.74 no signal; M = 1.33, SD = 0.92 with CRS) and global (M = 0.67, SD = 0.56 no signal; M = 1.01, SD = 0.88 with CRS) inferences questions. Separate two-factorial analyses of covariance on the outcome variables revealed a significant effect for global inferences questions, F(1, 85) = 2.84, p = 0.017, η²_p = 0.065. The difference for local inferences questions was only a marginally significant trend, F(1, 85) = 1.92, p = 0.072, η²_p = 0.038. As expected, there was no significant effect of signaling for text based questions (p = 0.898).

There was no significant multivariate effect of the interaction between cohesion and signaling, V = 0.019, F(3, 83) = 0.55, p = 0.649, η²_p = 0.019. The covariate, vocabulary, was significantly related to comprehension questions for text based, F(1, 78) = 29.79, p < 0.001, local inferences, F(1, 78) = 15.17, p < 0.001, and global inferences, F(1, 78) = 17.08, p < 0.001. The covariate, verbal reasoning, was significantly related to comprehension questions for text based, F(1, 78) = 8.11, p = 0.006, but not to local inferences, p = 0.126, and global inferences, p = 0.479.

Discussion of Experiment 1

This first experiment tested three hypotheses on the effect of text cohesion and CRS on multimedia comprehension. In line with text processing research, three levels of comprehension (text based, local inference, and global inference) were assessed. In addition, some learners’ characteristics that were found to affect text comprehension were measured. In this regard, the results concurred with previous findings on the role of language skills (vocabulary and to a lesser extent verbal reasoning) in the comprehension score for multimedia learning (Désiron, unpublished). There was no effect of prior knowledge on the comprehension score, probably because the knowledge level overall was very low. Regarding the effect of the two independent variables, the results hardly supported Hypothesis 1, as there was no effect of cohesion on comprehension scores overall. Still, an effect of cohesion for the generation of local inferences was observed. The findings partially supported Hypothesis 2 because a significant positive effect of CRS was found for global inferences and a positive marginal trend for local inferences. No difference was found for text-based scores, as expected. Finally, Hypothesis 3 was not supported, with no significant interaction between the two independent variables.

The beneficial role of CRS is in line with a previous research (Richter et al., 2016) that found a positive effect of signaling on multimedia comprehension. This confirms the assumption of the ITPC model (Schnotz, 2014), which posits that students with low prior knowledge and reading abilities, as were those in the sample of this study, benefit from support to connect the corresponding verbal and pictorial information. However, the underlying mechanisms and processes explaining this effect are still speculative and need further investigation, which is presented in Experiment 2. Regarding cohesion, previous researches have provided mixed results (Ozuru et al., 2009; McNamara et al., 2011) that was explained by some variability across studies of the sample under consideration (in particular their reading abilities and prior knowledge) and the ways to concretely implement cohesion in the material. Indeed, increasing cohesion also affects text length, which may factor as another difficulty for learners with very low reading abilities. For these reasons, the cohesion factor was not varied in the next experiment. Only the low cohesion version was kept because, according to the ITPC model, the effect of CRS on comprehension is more likely to appear when the text is difficult and most available texts were written at this level (Graesser et al., 2004).

Experiment 2

In the second experiment, offline outcome measures of multimedia comprehension were combined with online measures, using eye tracking, and measure of learners’ subjective evaluations (motivation and cognitive effort), as it is often practiced in multimedia learning research (Paas et al., 1994; Mayer and Estrella, 2014; Huang and Mayer, 2016). In addition, comprehension was assessed with a drawing task, which is, to our knowledge, rarely used to assess inference generation in multimedia research. For example, Butcher (2006) asked learners to draw about the heart and circulatory system as a pre- and posttest assessment of their mental models.

Previous multimedia research (see Richter et al., 2016 for a review) and the results from our first experiment showed that signaling successfully supported this integration process for struggling readers. However, the integration process of verbal and pictorial representation is not only reflected by offline measures of inferential comprehension but also by online measures such as gaze data as well (e.g., Mason et al., 2013a). The seminal eye-tracking study from Yarbus (1967), who compared picture-free observation and observation with instructions, demonstrated that eye movements reflected attention to visual material and thus top–down or bottom–up processing. The aim of inserting signaling devices in multimedia documents is to guide learners’ attention by prompting top–down processing. Signals are assumed to support learners’ selection of information and particularly so when they have little prior knowledge of the content (Ozcelik et al., 2010; van Gog, 2014). For example, Mason et al. (2013a) compared the use of labels in a picture adjunct to a text on atmospheric pressure in a sample of sixth graders and found that the presence of signals effectively supported text–picture integration. Further, eye-tracking data revealed longer fixations on signaled elements (labeled in the picture) of both the text and picture during text rereading or picture reinspection (second pass) when labels were included. Scheiter and Eitel (2015) studied the effect of colored labels in both the text and picture on learning about the heart and circulatory system. Their results indicated that learners directed more frequently their attention toward the information when it was signaled than when it was not. Previous research also has showed that multimedia manipulations affect not only learning outcomes but the learners as well, as hypothesized by the cognitive–affective theory of multimedia learning (Moreno, 2005, 2006). The effects of the implementation of multimedia principles were found to impact motivation (Mayer and Estrella, 2014; Dousay, 2016; Schneider et al., 2018) and cognitive effort (Huang and Mayer, 2016). The meta-analysis from Schneider et al. (2018), in particular, showed that signaling positively affected motivation, although the effect was small. Based on Experiment 1 and the literature, we expected to confirm Hypothesis 2, which posits that CRS is beneficial for comprehension at the inference level. Moreover, eye-tracking measures were used to determine whether the beneficial effect of CRS was linked to the fostering of attention toward the pictorial information, and the text–picture integration following the procedure used in Mason et al. (2013a) and Scheiter and Eitel (2015). This second experiment also explored the effect of CRS on motivation and cognitive load as outcome variables. Prior knowledge of the topic and reading abilities were still used as control measures.