Front. Commun., 16 May 2024
Sec. Visual Communication
Volume 9 - 2024

Catch me if you can: how episodic and thematic multimodal news frames shape policy support by stimulating visual attention and responsibility attributions

  ZeMKI, Centre for Media, Communication and Information Research, University of Bremen, Bremen, Germany

Using media coverage of animal welfare as an example, this study examines how the perception of multimodal news frames shapes recipients’ visual attention, attributions of responsibility, emotions, and policy support. To investigate the mechanisms of multimodal-episodic versus thematic framing, we combined eye-tracking measurements with a pre-post survey experiment in which 143 participants were randomly assigned to an episodic or a thematic multimodal framing condition. The results show that episodic multimodal frames are viewed longer than thematic frames, elicit stronger individual and political responsibility attributions, and increase political support for stricter animal-welfare laws. Understanding multimodal framing as a multistep process, a serial mediation model reveals that episodic frames affect viewing time, which leads to stronger attributions of political responsibility and, in turn, stronger policy support. Our results support the idea of a complex interplay between subsequent stages of information perception and processing within a multimodal framing process.


Framing is considered one of the most important theoretical perspectives for analyzing the influence of media information on the perception and evaluation of political content (Tewksbury and Scheufele, 2009). The basic assumption is that the selective accentuation of distinct aspects of reality in media contexts provides recipients with interpretive patterns that can simplify but also significantly shape their information processing. Frames, as a set of interpretative information units, provide a “central organizing idea” that offers an interpretation of this information, “weaving a connection among them” (Gamson and Modigliani, 1987, p. 143). Depending on the recipient’s individual characteristics, this can affect how citizens think about political issues and, in turn, sway their judgments and actions (Scheufele and Tewksbury, 2007). The example of media framing on animal welfare, which we use as a subject of investigation in this study, is a good illustration of these mechanisms: The topic is highly relevant to society and therefore regularly covered in the news (Arpan et al., 2006; Freeman, 2014; Buddle and Bray, 2019), with recurring patterns in coverage, including moral implications, responsibility attributions (Buddle and Bray, 2019) and the use of emotionalizing imagery (Evans, 2016). However, as the average citizen generally has little direct experience of animal welfare, they obtain their information primarily from media coverage - which makes framing effects very likely (Gitlin, 2003; Freeman, 2014). Previous studies have accordingly found that media coverage on animal welfare efficiently impacts attitudes, responsibility attributions and support for policies (Tiplady et al., 2013; Buddle and Bray, 2019), and can even influence recipients’ actions (e.g., a reduction (albeit temporary) in meat consumption; Tonsor and Olynk, 2011). By stimulating active selection and meaning construction, framing addresses fundamental principles of complexity reduction during the process of perception and processing that unfold during almost every communication process (Geise and Baden, 2015).

As images have become an elemental part of these communication processes, the study of their influence on viewers has become increasingly relevant (Smith et al., 2020). This is all the more true because of the analogical property of images, their “true-to-life quality,” and their lack of an explicit propositional syntax (Messaris and Abraham, 2001, p. 217). Images in multimodal contexts are particularly attention-catching (Garcia and Stark, 1991; Bucher and Schumacher, 2006), perceived quickly and effortlessly by their recipients (Messaris and Abraham, 2001; Dahmen, 2012), and can further excite their viewers by appealing to their emotions (Barry, 1997). Text, in contrast, is less salient; however, its analytical and syntactic structure lends itself to the cognitive elaboration of a story’s content and thus to a more prescribed construction of meaning (Messaris and Abraham, 2001; Geise and Baden, 2015; Powell et al., 2019). Therefore, textual frames are prone “to promote a particular problem definition, causal interpretation, moral evaluation and/or treatment recommendation” (Entman, 1993, p. 52). In this study we are focusing particularly on the attribution of responsibility for three main reasons: First, responsibility attribution is conceptually one of the most important factors in the framing of social problems (Kim, 2015). Secondly, therefore responsibility attributions are also particularly relevant for our research topic of animal welfare (Buddle and Bray, 2019). Lastly, research considers the attribution of responsibility a key driver of heightened emotions (Kuehne et al., 2015) as well as of increased policy support (Iyengar, 1991; Bouman et al., 2020), two framing effects we focus on in our study. Considering that visuals and text, therefore, present information differently, are received differently, and are associated with different effects on feelings, thoughts, and actions, the framing approach seems particularly suitable, as it allows parallel consideration of the specific aspects of both visual and textual communication in a multimodal framing process (Coleman, 2010; Geise and Baden, 2015; Geise, 2017).

While the majority of framing studies have concentrated on the mechanisms of visual or verbal effects, examining their contribution in isolation (Coleman, 2010), the interplay of words and pictures in multimodal framing processes is still relatively understudied (Powell et al., 2015). Particularly, we know little about how multimodal news content, as typically represented via news frames, catches the recipients’ attention and activates their emotional reactions toward and cognitive evaluations of news content. Yet, given that people primarily receive political information through the news, which is inherently multimodal, this knowledge seems key to understanding how people’s reception of political news contributes to public information and discourse (Graber, 1996; Dixon et al., 2015).

In this study, we, therefore, examine how the perception of multimodal news frames that carry a textual responsibility frame accompanied by a congruent (episodic versus thematic) news image shapes recipients’ cognitive evaluation. In doing so, we inspect how the perception of a certain news image intermingles with the perception of a news text, and how these two modality-distinct framing ‘devices’ differentially capture visual attention, elicit emotional responses, and, finally, influence the public’s support of a policy. To this end, we conceptualize our methodological setting along with the underlying framing steps through which citizens construct coherent meaning from complex multimodal frames, ranging from the first stimulus exposure to its deeper elaboration. Understanding framing theory as a general framework for analyzing subsequent stages of perceiving and processing information (Baden, 2010; Geise and Baden, 2015), we rely on experimental data integrating eye tracking measurements with a pre-and post-survey design to assess the effects of multimodal-episodic versus thematic framing on citizens’ visual attention, responsibility attributions, emotions, and policy support.

As multimodal media frames and their interpretation patterns seem “particularly relevant when the way an issue is presented has potential social consequences” (Hardin et al., 2002, p. 344), we use the example of news articles covering the issue of animal welfare. Animal welfare is not only an issue with potential social consequences that are frequently featured in the media (Arpan et al., 2006; Buddle and Bray, 2019). Emotionalizing visuals are also often used in reporting on animal welfare (Evans, 2016); this issue, therefore, seems particularly suited for studying the use of appealing multimodal frames on media recipients’ attention, emotions, and cognitions. Our research design, which contrasts episodic and thematic frames in a mono-thematic setting using the issue of animal welfare, is inspired by existing studies which aim for a comparative analysis (e.g., Gross, 2008; Aarøe, 2011; Hart, 2011; Boukes, 2021).

Our results show that, within the framework of multimodal news frames that encompass press photographs and article texts, the observation of press photographs in particular shapes cognitive and affective framing effects. Concerning the different frame types, episodic multimodal frames are viewed for longer than thematic ones. Also, episodic frames cause stronger individual and political attribution of responsibility as well as increase policy support for stricter animal protection laws. On the contrary, thematic frames generate stronger emotional responses.

Understanding multimodal framing as a multistep process, a serial mediation model shows that episodic frames affect viewing time, which leads to a stronger attribution of political responsibility. This, in turn, leads to stronger policy support. Our results, therefore, support the idea of a complex interplay between the subsequent stages of information perception and processing within a multimodal framing process, which results in corresponding cognitive evaluations.

News framing as a multistage process

Many studies have shown that the different framing of a topic, actor, or issue in media presentation leads to correspondingly different perceptions and classifications of the information among recipients (see, e.g., Borah, 2011, for an overview). While framing research has long concentrated on the analysis of textual frames (Coleman, 2010), some scholars have also demonstrated that the unique qualities of each framing modality–visual versus textual–become apparent at different levels in the multistage framing process (Geise and Baden, 2015; Powell et al., 2015).

A wide range of studies has revealed that different frame types are associated with different effects on cognition (Iyengar, 1991; Gross, 2008; Kepplinger et al., 2012) and emotion (Brantner et al., 2011; Lecheler et al., 2015a; Nabi et al., 2020). These observations resonate well with findings from media psychology that acknowledge the mediating role of emotion during information processing. Accordingly, there are already some theoretical inputs (see, e.g., Nabi, 2003; Lecheler et al., 2013; Kuehne, 2014; De Los Santos and Nabi, 2019) and empirical contributions (see, e.g., Major, 2011; Kuehne et al., 2015; Kuehne and Schemer, 2015; Schuck and Feinholdt, 2015; Lecheler and De Vreese, 2019) that consider the interplay of the affective and cognitive dimensions in the framing process. In these conceptions, which are mostly guided by appraisal theories (see, e.g., Scherer et al., 2001), emotions are understood as mental states that arise from cognitive evaluations and judgments, whereby recipients constantly assess incoming information and evaluation patterns that lead to certain emotions (Scheufele and Gasteiger, 2007; Iyer et al., 2014; Kuehne, 2014).

Building on this work seems particularly insightful, as the modality-specific properties that drive the mechanisms of multimodal framing (Geise and Baden, 2015) have not been in focus here, so their specific impact on emotion and cognition still seems relatively unexplored. Correspondingly, most previous studies on affective framing have implicitly assumed monomodal textual framing processes, which are then empirically examined in model testing (Kuehne, 2014; Kuehne et al., 2015). Aiming to bring both research strands together–an exploration of effective as well as multimodal framing – in this study, we examine how visuals contribute to the power of episodic versus thematic multimodal frames.

As suggested by framing research and theory (Dahmen, 2012; Geise and Baden, 2015), the multistep framing process is initiated by the reception of a certain news frame that stimulates the recipient’s attention. The person guides his/her eyes toward the multimodal media information (P1) and starts to process it. This attention shift is instantaneously accompanied by two interconnected cognitive appraisal steps (Lazarus, 1991). The first appraisal is a reflection of the personal relevance of the given situation as it is mirrored, among other measures, by the allocation of visual attention (Dahmen, 2012; Geise and Baden, 2015; Keib et al., 2018; Smith et al., 2020). While news frames, therefore, are predicted to cause a change in the allocation of visual attention, different types of frames are expected to drive visual attention differently (Smith et al., 2020). Very few studies, however, have examined the physical perception of news frames.

Dahmen (2012) found that divergent frames of press photographs lead to divergent sensory responses. For example, recipients viewed emotionalizing visual frames for longer than neutral, visual (or textual) ones. Applying eye tracking in the context of visual sports coverage, Smith et al. (2020) also found that different frame types (body-oriented vs. face-oriented) guided visual attention – measured by the time to first fixation as well as by the total observation duration –differently toward selected elements of depicted athletes. According to perception theory, this is also momentous for further information processing, because the intensity with which visual attention is drawn to a multimodal news frame influences the intensity to which recipients further engage with the news content in subsequent stages of processing (Dahmen, 2012).

Likewise, further frame processing is accompanied by a second cognitive appraisal, which is mirrored, among others, by responsibility attributions (P2). This pertains to how individuals assign “blame or credit and whether it is directed at oneself or another, coping potential, and future expectations” based on the perceived treatment (Lazarus, 1991, p. 827). Because the effects of news frames are often studied in terms of their effect on “beliefs and attributions” (Gross and Brewer, 2007, p. 122), some scholars have shown that news frames indeed can efficiently impact the attribution of responsibility. Iyengar (1991), for example, demonstrated that different frame types carried by the news led to correspondingly different responsibility attributions. While frames that focused on issues nurtured societal responsibility attributions, recipients were more likely to hold individuals accountable when they had seen narrative-episodic frames that feature single cases.

In many cases, such responsibility attributions are complemented by emotional reactions (Smith and Ellsworth, 1985). Emotions, therefore, were found not only to play an important role in news processing in general (Kim and Cameron, 2011; Nabi et al., 2020) but also in the perception and processing of multimodal news frames in particular. For instance, Brantner et al. (2011) revealed that divergent visual frame types (e.g., press photos of victims vs. politicians) led to divergent, affective evaluations of the identical article. Iyer et al. (2014) showed that frames can also trigger specific emotions, such as fear or anger. While, according to cognitive appraisal theory and framing research (Kuehne, 2014), cognitive appraisals lead to emotional responses, such as the activation of anger (P3), these emotional responses can then shape further cognitive judgments, such as support for political actions (P4).

Consistent with this idea, Scheufele and Gasteiger (2007) found that press photographs in multimodal news reinforced cognitive framing effects by eliciting emotions. A press photo of war-torn children, for example, forced cognitive framing effects toward increased policy support for military intervention. During these processes, a certain news frame can unfold a direct cognitive effect on individuals’ attitudes and behaviors; yet, it can also exhibit an effect on cognitive appraisals, which elicit emotions and then unfold an indirect (mediating) effect on attitudes and behaviors (Kuehne, 2014). These theoretical considerations can be illustrated by the following model of suspected multimodal framing effects (Figure 1).

Figure 1

Figure 1. Theoretical model of suspected framing effects.

Since these suggested effects, however, depend on the specific characteristics of the applied frame, storytelling, and episodic news frames should impact the underlying processing steps that supply multimodal-framing effects differently than issue-focusing thematic frames (Iyengar, 1991; Gross, 2008; compare Table 1). More precisely, because episodic frames display information in an illustrative, narrative, and event-oriented manner, often making use of attracting “action-oriented images” (Iyengar, 1991, p. 7), they can be expected to be particularly attention-arousing.

Table 1

Table 1. Pecularities of episodic versus thematic framing.

Thematic frames, on the other hand, present information in an issue-based context as a general policy problem, which often materializes as a “background report” that requires deeper cognitive elaboration (Iyengar, 1991, p. 7) but is potentially not as likely to attract attention ‘at first sight.’ In this respect, Dahmen (2012) found that people devoted less observation time to photographs that encompass thematic political frames (e.g., showing a politician signing a piece of paper), while illustrative, emotionally laden news visuals (e.g., depicting protesters praying and holding anti–stem cell research signs) garnered a longer observation duration. In a follow-up study, Dahmen (2015) found that participants focused on narrative depictions for longer periods.

Regarding the differential framing effects within the first cognitive appraisal, we thus assume the following:

H1: Episodic multimodal frames attract more visual attention than thematic multimodal frames, which results in a longer observation duration than thematic frames.

As outlined above, in the course of the framing process (see Figure 1), the second cognitive appraisal follows. Episodic and thematic frames are expected to affect political evaluations differently, and research has shown this, especially for attributions of responsibility (see, e.g., Iyengar, 1991; Gross, 2008; Hart, 2011). The emphasis on individual cases, which is typical of episodic frames, favors responsibility attributions at the individual level (Iyengar, 1991). Thematic frames, on the other hand, which focus more on the (political) consequences of a certain problem, promote a social-societal attribution of responsibility. As a consequence, after receiving thematic frames, not the individual citizen but rather society and the political actors are considered responsible for dealing with the respective situation (Iyengar, 1991). Studying the effects of the media portrayal of obesity, Major (2009) accordingly found that thematic frames stimulate higher societal attributions of responsibility than episodic frames. Considering the second cognitive appraisal’s differential framing effects, we expect the following:

H2: Episodic multimodal frames lead to a higher level of individual responsibility attribution (H2a); thematic multimodal frames lead to a higher level of political responsibility attribution (H2b).

While responsibility attributions are closely linked to emotions such as anger, guilt, and satisfaction (Smith and Ellsworth, 1985), episodic and thematic frames are expected to also have differential effects on the arousal of emotions. Episodic frames, according to their special ‘storytelling’ nature, should elicit stronger emotional reactions than thematic frames (Gross, 2008). From a theoretical perspective, this can be explained by three mechanisms: (1) as being a result of the narrative rhetoric of an episodic presentation that reduces reaction and counterarguing (Iyengar, 1991; Niederdeppe et al., 2011); (2) by the often-embedded human-interest details that foster a personal connection and put a ‘real face’ on the portrayal of the problem that the receivers can direct their emotional reactions (Gross, 2008; Aarøe, 2011; Grabe et al., 2017); and (3) by the relatively ‘consumable’ nature of the content that eases information processing and evaluation (Iyengar, 1991).

Correspondingly, Aarøe (2011) found that episodic frames elicited stronger emotional reactions than thematic frames. Gross (2008) also showed that citizens who perceived episodic frames expressed more intense emotional reactions (e.g., aversion or empathy) than participants who saw thematic frames. In recent research work, Ciuk and Rottman (2020) found that respondents who had been exposed to episodic frames reported greater emotional reactions (study one) as well as a greater sense of sadness (measured discreetly in study two) compared to those exposed to the thematic frame. Regarding the arousal of emotion, we thus assume the following:

H3: Episodic multimodal frames elicit stronger emotional responses, such as anger, satisfaction, and guilt than thematic multimodal frames.

Analyzing how covering climate change with an episodic or thematic frame differently affects one’s predispositions for individual behavior change and support for policies to address climate change, Hart (2011) showed that participants exposed to a thematic frame developed more support for policies than participants exposed to an episodic frame. Inversely, in the context of coverage of the Social Security repeal, Springer and Harwood (2015) found that the reception of episodic frames led to stronger endorsement of the policy decision than the reception of thematic frames did.

Not directly referring to episodic and thematic framing, McGlynn and McGlone (2018) examined how different types of framing regarding obesity influenced policy support. The authors found that a ‘human agency version,’ comparable to an episodic frame, prompted stronger support for upstream public policies designed to impede obesity (e.g., a snack tax and warning labels on junk food, or eliminating fast food concessions in public schools) than a thematic condition that required the further elaboration of its recipients. Aarøe (2011) accordingly suggested that episodic frames have a greater capacity to direct the effect of the receivers’ emotional reactions into support for the policy position argued by the frame. We thus assume:

H4: Episodic multimodal frames lead to higher policy support than thematic multimodal frames.

In line with our model assumptions (P1-P4), however, we do not expect that the effects (H1-H4) assumed here occur in isolation. Rather, in line with coherent findings from other authors (e.g., Iyer et al., 2014; Powell et al., 2015), we assume that multimodal frames shape the cognitive evaluation of how strongly recipients agree with a certain political decision (e.g., policy support); and we expect that this general pattern is engendered by a series of underlying process steps in which citizens’ visual attention, responsibility attributions, and emotions mediate the multimodal framing process.

H5: Episodic multimodal frames result in higher policy support than thematic multimodal frames. This effect is mediated by first, the allocation of visual attention; second, by responsibility attributions; and third, by the activation of discrete emotions, such that an episodic frame will lead to a higher allocation of visual attention, the higher allocation of visual attention will lead to a higher attribution of political responsibility, attributions of political responsibility will lead to a stronger emotional reaction than a thematic frame, and this emotional reaction will policy support more strongly.


To test our hypotheses, the research design was conceptualized along with the underlying framing steps in which citizens construct coherent meaning from complex multimodal frames, ranging from the first stimulus exposure to its deeper elaboration. Accordingly, we combined eye tracking measurements with a survey-embedded experiment to compare the effects of perceiving and processing episodic versus thematic multimodal news frames.

To this end, participants were randomly assigned to an episodic or a thematic multimodal framing condition featuring the news issue of animal welfare. Except for the measurement of emotional and cognitive reactions toward implemented stimuli (e.g., emotions such as anger or enjoyment as well as responsibility attributions based on framing conditions), measurements embedded in the questionnaire were taken before and after treatment exposure. We thus follow a pretest-posttest logic that allows us to draw causal inferences from treatment reception to the variables under investigation. While no control group was created, both groups in the applied between-subject design could be compared and thus control each other.


One hundred fifty-six participants were recruited from students at a midsized German university. Thirteen cases that did not fully cover the participant’s gaze (less than 90 percent of the eye tracking measures) were excluded from further analysis. After eliminating these cases, the final sample included 143 participants: 69 women, 72 men, and two neutral/other-gender individuals. The participants’ age ranged from 18 to 43 years (M = 23.76; SD = 3.84). Participants received small monetary compensation for their participation in the study. Manipulation checks suggested that the participants were attentive during treatment exposure.


Stimulus images and news texts that composed the multimodal frames under examination were selected from the current media coverage on animal rights and animal welfare. The issue was chosen as a recurring news topic with potential individual, social, and political consequences (Arpan et al., 2006; Buddle and Bray, 2019) but one that had received little media attention during the study period. While this, on the one hand, lessens the likelihood that intensive prior exposure would influence participants’ responses, we also considered animal welfare an important topic that allowed us to study the use of appealing multimodal frames on the recipients’ attention, emotions, and cognitions (Evans, 2016).

Based on current news reports about animal welfare, the text was generated as a typical news article that carries a textual responsibility frame. Accordingly, the text presents the issue of animal welfare in a way that allows for the attribution of responsibility for causing and solving the problem to either individuals or the government (Valkenburg et al., 1999). To this end, the text first referred to meat consumption and the high economic importance of factory farming. Then, it explained that an increasing number of consumers are calling for action to support animal welfare and that consumer and animal welfare organizations are also pushing for stricter animal rights policies. This was followed by a short passage that addressed challenges to implementing animal welfare rules and programs before the last passage suggested options regarding a stronger commitment to animal welfare.

To construct the two multimodal frame conditions, the article text was combined with two different picture types. Within the episodic multimodal framing condition, the article text was accompanied by a press photograph depicting the issue of animal welfare in a storytelling, action-based, manner that revealed concrete actions that support animal welfare. Within the thematic multimodal framing condition, the article text was complemented by an issue-focused depiction of farm animals in their actual living environments, which referred to questions about animal welfare.

The pictures used for the study were selected in a two-step process. First, we systematically searched contemporary media coverage for images that portrayed animal welfare/animal rights, from a more issue-oriented to a more action-based, storytelling angle. We then conducted a qualitative pre-study in which 75 students performed picture sorting tasks, categorizing the images and estimating their visual qualities, thematic focus, and emotional tone (i.e., positive vs. negative valence). Over the course of 90 to 110 min, the participants processed the image sample in various sorting and evaluation tasks, including sorting the motifs freely according to picture categories and picture types. Based on these pretest perceptions, the images could be categorized into two overarching “motif types”: a group of images that portray animal welfare/animal rights by focusing on the issues and its consequences, and a group that portrays animal welfare/animal rights through the lens of related actions in an illustrative, narrative and event-oriented manner - what largely coincides with the fundamental distinction between episodic and thematic frames.

Corresponding in its design to typical multimodal news articles, the final treatments contained an article’s text, a headline, press photograph, caption, publication date, resort information, and the author’s name (see Appendix Figures A3, A4). Further references to a news source were not provided; this was to avoid delivering extraneous cues.


Participants were informed about the study and its methodology, without revealing details regarding its purpose; they received a consent form and were randomly assigned to one of the stimulus conditions. According to the pretest-posttest logic, the study then started with the pre-survey, which captured sociodemographic variables (e.g., gender, age, etc.) and measures that, when pooled with the post-survey results, allow for the construction of the dependent variable (e.g., change in policy support).

After completing the pre-survey, participants were placed behind a desk with a monitor and a discretely mounted “Tobii Pro” eye tracking system. To ensure measurement accuracy, each participant underwent an individual calibration using a standard 9-point calibration image; the eye tracking system adjusted to his/her physiognomy. After successful calibration, participants viewed the displayed news articles as they would in their daily routines. Based on experiences from earlier eye tracking studies conducted in a laboratory (see, e.g., Kruikemeier et al., 2018), we aimed to create a typical, daily news consumption scenario in which participants were randomly confronted with four multimodal news articles about three regular news issues (i.e., economy/wealth distribution, education, and homeland security) and one that featured the issue of animal welfare.1

During the news exposure, which lasted about 10 min per participant, eye movements were seamlessly recorded with a sampling rate of 120 Hz to capture the participants’ visual attention, guided by the different elements within the multimodal news frames. After perceiving the multimodal frames, participants performed a post-survey in which we captured their emotional and cognitive reactions toward the implemented treatments and measured possible changes that could be causally attributed to the observed multimodal frames (e.g., responsibility attributions and policy support).


Visual attention was captured through eye tracking, which measured the observation duration through foveal fixations on the treatment in seconds (rescaled from milliseconds). According to eye tracking research (see, e.g., Bucher and Schumacher, 2006), the observation duration or dwell time can be considered an established indicator of visual attention that is guided to media information. Moreover, considering the recipients’ allocation of visual attention as “a mental action tendency” (Scherer and Moors, 2019, p. 724–725), the operationalization of visual attention as an observational indicator of relevance attribution regarding the first appraisal corresponds to the theoretical considerations of CAT (Brosch et al., 2013).

To separate fixations from saccades, we applied the standard fixation algorithm (Tobii I-VT filter): The minimum duration required for a fixation to be registered as a data point was 60 milliseconds, which captured shorter and longer fixations, as are common during reading (Radach et al., 2008). The velocity threshold was set at 30 degrees per second, which is sufficient for recordings with various levels of noise (Chen et al., 2008).

To extract the data, we created ‘areas of interest’ (AOIs) for each news frame, which covered its typical elements, such as the headline and article text. This allowed us to determine how long the participants guided their visual attention to specific parts of the multimodal news frame. The observation duration ranged from 3.134 to 45.040 s (M = 18.364, SD = 7.531) for the multimodal frame, from 0.101 to 24.677 s for the embedded news images (M = 4.905, SD = 3.838), and from 0.575 to 42.705 s (M = 13.459, SD = 6.046) for the accompanying textual elements.

Emotional Reactions were measured directly after treatment exposure, applying a modified version of the standardized, differential effect scale to capture emotions during media use (Renaud and Unz, 2006). Each emotion (e.g., anger, joy, affection, pleasure, and satisfaction) was measured separately after the multimodal frame by using a statement such as “I felt anger,” to which the participants could respond via a 5-point Likert scale, with five indicating the strongest agreement to the statement (for descriptive statistics, see Appendix Table A1).

Responsibility attribution was measured after exposure to the stimulus. In line with conceptions successfully applied in framing research (Valkenburg et al., 1999), we tested the different attributions of responsibility by requiring participants to rate on a 5-point Likert scale the extent to which they considered (a) the individual (“Each and every one of us does not care enough about the protection of animals and species”), (b) the economy (“The food industry does not care enough about the protection of animals and species”), and (c) NGOs (“Non-profit organizations and protest movements do not care enough about the protection of animals and species”), and (d) politics (“Politics does not care enough about the protection of animals and species”). We focused on the attribution of responsibility as the second appraisal because according to CAT, the attribution of responsibility is central to eliciting specific emotions (Smith and Ellsworth, 1985; Lazarus, 1991). Furthermore, previous studies have already shown a connection between thematic and episodic framing and attribution of responsibility (Aarøe, 2011).

Policy support was measured after treatment exposure. We asked participants for different recommendations regarding their personal (“Each and every one of us should individually do more for animal and species protection so that something finally changes”), economic (“The food industry should finally realize that cruelty to animals is undignified so that something finally changes”) and political (“Politicians should enact stricter laws so that something finally changes”) actions. Similar to Gross’s (2008) measurement of policy views, participants were asked to rate each statement on a 5-point scale ranging from 1 (“I do not support the statement”) to 5 (“I fully support the statement”).

Manipulation checks

To check whether the framing manipulations worked, two questions assessed the extent to which recipients engaged with the multimodal news frames during stimulus exposure. First, we measured the perceived importance of animal welfare before and after the treatment exposure; second, we measured the participants’ perceived informedness regarding the issue of animal welfare after he or she had received the multimodal news articles. Regarding the former, a paired t-test revealed that the multimodal frames fostered the participants’ perception of the issue’s importance [t(142) = 0.47; p < 0.000]. Regarding the latter, reception of the multimodal frames also strengthened the participants’ feelings of being well-informed about animal welfare [t(142) = 5.23; p < 0.000]. These results indicate that our frame manipulations were effective and that the recipients were attentive to inserted frames during treatment exposure.

To further ensure that any differences in our dependent variables were due to the different frame types–and not to other underlying factors–participants had to evaluate their special characteristics applying the concept of “photo news factors” (e.g., Rössler et al., 2011). Participants perceived the multimodal episodic news frames as equally relevant [t(141) = 0.17, p = 0.86], negative [t(133) = 0.34, p = 0.73] and salient [t(141) = −1.10, p = 0.73] than the thematic news frames (see Appendix Figure A5). In both multimodal frame conditions, participants rated the embedded news visuals as having the same potential for appealing to their recipients on an emotional level [t(141) = −1.09, p = 2.77]. Participants also regarded the embedded news visuals in framing conditions as typical press photographs [t(133) = −1.77, p = 0.078].

Data analysis and results

In this study, we examined the effects of perceiving multimodal frames that involve an article’s text and a congruent (episodic versus thematic) news image. By capturing the recipients’ observation of the news frames via eye tracking, we scrutinized the extent to which the different frame areas (text vs. image) caught the recipients’ visual attention and how they further influenced the subsequent frame processing, which can shape the recipients’ emotional reactions and cognitive evaluations.

Our data reveals that recipients devoted longer observation time to the textual frame content (M = 13.46, SD = 6.05) than to the visual element (M = 4.91, SD = 3.84) of the multimodal frame (i.e., total fixation duration on a complete news frame: M = 18.36, SD = 7.53).2 However, it was the visual elements that first attracted the recipient’s attention (photographs: M = 0.18, SD = 0.09; text: M = 0.20; SD = 0.12). As within a multimodal news frame, its visual and textual components are thus perceived differently, we tested whether they also contribute differently to framing effects on policy support (i.e., our dependent variable). A regression model across the different frame types (see Appendix Table A2) revealed that only the observation duration of the news visual embedded in the multimodal frame (not the textual frame part) had a significant influence on policy support (B = 0.06, p = 0.009). The finding that images can be particularly effective in the framing process is consistent with recent research on visual (see, e.g., Coleman, 2010; Dahmen, 2012) and multimodal framing (see, e.g., Powell et al., 2015). Further analyzing the multistep framing process in the following section, based on these findings, we can assume that a sizable part of the observed effects can be attributed to the visual components of the multimodal frames.

In the subsequent step, we examined whether the different frame types showed group differences regarding the variables of interest. We conducted t-tests for independent samples and robust Welch tests (when variance homogeneity was violated). With H1, we expected that episodic multimodal frames would attract higher visual attention, as measured via observation duration than thematic ones. Our analysis supports this hypothesis: The Welch test showed that episodic multimodal frames are viewed for a longer dwell time: The observation time was 2.97 s longer for the episodic news frame (M = 6.36, SD = 4.27) than for the thematic frame condition (M = 3.39, SD = 2.59) [95% - CI(−4.13, −1.81)]. As this difference is statistically significant [t(120) = −5.05, p < 0.000, d = −0.84], H1 is supported by our data.

Regarding the second processing step, we assumed that episodic and thematic frames affect further cognitive processing of information differently, thus leading to different attributions of responsibility. More precisely, with H2a, we expected that multimodal frames with an episodic appeal would lead to a higher level of individual responsibility attribution, while issue-based thematic frames should result in a higher level of political responsibility attributions.

Testing individual responsibility attribution using a t-test, we found a statistically significant difference between the thematic-and episodic-framed conditions. The mean for the individual responsibility attribution was around half a scale point [95% - CI(−0.76, −0.13)] lower in the thematic frame condition [t(141) = −2.81, p = 0.006, d = −0.47]. A similar pattern was observed for political responsibility attribution. The mean for the political responsibility attribution was around half a scale point [95% - CI(−0.77, −0.13)] lower in the thematic frame condition [t(141) = −2.78, p = 0.006, d = −0.46]. This only partially supports H2, as the episodic frame results in higher responsibility attributions on the individual (H2a) but also on the societal level, which we, in line with previous studies (Iyengar, 1991), expected to be the outcome of the thematic framing condition only. H2b, therefore, is not supported by our data.

In line with this finding, the reception of episodic news frames (but not thematic frames) promoting the topic of animal welfare increased the level of policy support concerning animal welfare (H4). Again applying a t-test, we found a statistically significant difference regarding policy support between both frame conditions: The mean value for policy support concerning animal welfare was around half a scale point [95% - CI(−0.77, −0.06)] lower in the thematic frame group [t(141) = −2.32, p = 0.02, d = −0.39]. However, applying a Welch t-test, our analysis interestingly revealed that the thematic frame caused its recipients to have stronger emotional reactions. The participants reported higher emotional responses in terms of anger, joy, affection, pleasure, and satisfaction (see Figure 2 and Appendix Table A1) after seeing thematically framed news articles than episodically framed ones. For example, the mean value for the activation of anger was around half a scale point [95% - CI (0.03, 0.93)] higher in the thematic frame condition. H3, therefore, has to be rejected.

Figure 2

Figure 2. These are the mean differences in emotions between the thematic multimodal and episodic multimodal frames. The differences for anger (p = 0.039), satisfaction (p < 0.000), joy (p = 0.007), affection (p = 0.002), and pleasure (p = 0.034) are significant. Means and standard errors are plotted. Note that the y-axis does not reflect the full range of the scale (5-point Likert scale).

Complementing the picture, we further scrutinize the suspected interplay of subsequent framing effects by applying an ordinary least squares path analysis. Understanding multimodal framing as a multistep process, we tested the hypotheses (compare H1 to H4) not independently but in their interaction within the multistep framing process (compare H5) by using a serial mediation model, in which we also controlled for age and gender (PROCESS model 6; Hayes, 2018). This allowed us to examine whether the frame type predicts the recipients’ support for animal welfare policy and whether the direct path is mediated by firstly, the total fixation duration; secondly, the responsibility attribution; and thirdly, the activation of discrete emotions (such as anger)3 as was expected with H5. For the mediation model, the frame conditions were included as the two levels of the independent variable (0 = thematic frame, 1 = episodic frame). Ninety-five percent bias-corrected bootstrap confidence intervals, based on 10,000 bootstrap samples, were used for the statistical inference of indirect effects (Figure 3).

Figure 3

Figure 3. This is a mediation modela showing the direct effect of the frame type and the indirect effects of the fixation duration, responsibility attribution, and anger on policy support. We dummy coded the frame variable (0 = thematic, 1 = episodic); therefore, positive values indicate the influence of the action frame, while negative values indicate the influence of the thematic frame. Ninety-five percent bias-corrected bootstrap confidence intervals, based on 10,000 bootstrap samples, are shown for indirect effects. Solid lines indicate significant effects. Significant codes: *p < 0.05, **p < 0.01, ***p < 0.001. aAccording to Hayes (2018), the linearity, normal distribution of the residuals, homoscedasticity, independence of the measurements, and temporal precedence have to be tested within the prerequisites. We graphically checked linearity, which can be assumed generally for the tested variables. The normal distribution of the residuals and the homoscedasticity did not need to be tested separately, since we used a robust procedure (bootstrapping). The independence of the individual data points was ensured, since the individual study participants took part in the survey and the experiment independently of each other. Temporal precedence was only insufficiently given for cross-sectional data. From theoretical assumptions of the CAT, we can nevertheless assume a temporal precedence here.

Our analysis shows the effect of the episodic multimodal news frame – as opposed to the thematic news frame – on the recipients’ support for animal welfare policy [c = 0.47, t(139) = 2.82, p = 0.006, r2 = 0.20]. After entering the mediators into the model, the frame type predicted the allocation of visual attention [b = 2.90, t(139) = 4.83, p < 0.001, r2 = 0.16]. In the subsequent processing of the perceived frame, political responsibility attribution (r2 = 0.20) was predicted by the type of frame [b = 0.36, t(138) = 2.19, p = 0.03] and the allocation of visual attention [b = 0.05, t(138) = 2.17, p = 0.03]. Within the subsequent step in the framing process, the recipients’ emotional response (r2 = 0.23) was forecasted by the frame type [b = −0.74, t(137) = −3.21, p = 0.002] and the political responsibility attribution [b = 0.52, t(137) = 4.46, p = 0.001]. Policy support was then significantly predicted by the political responsibility attribution [b = 0.63, t(136) = 8.12, p < 0.001, r2 = 0.52].

Moreover, after entering the mediators into the model, the direct effect of frame type on policy support was no longer significant [c = 0.16, t(136) = 1.11, p = 0.27]. The indirect path, therefore, showed that the effect of different frame types on policy support was mediated by the allocation of visual attention and political responsibility attribution [indirect = 0.09, SE = 0.04, 95% - CI (0.02, 0.18)]. As the differential framing effect was, however, not mediated by the activation of the recipients’ discrete emotions, our results only partially support H5. Nonetheless, our results support the idea of a complex interplay of subsequent stages of information perception and processing within a multimodal framing process, which then results in corresponding cognitive evaluations.


This study examined how the perception of multimodal news frames that carry a textual responsibility frame accompanied by a corresponding episodic or thematic news image shaped recipients’ cognitive evaluations of news content. Understanding framing theory as a general framework for analyzing subsequent stages of information perception and processing, we also scrutinized how the perception of an episodic versus thematic news image coalesces with the perception of a news text and how the two modality-distinct frame elements differentially capture visual attention, elicit emotional and cognitive responses, and, finally, influence policy support.

Utilizing eye tracking to capture the perception of multimodal frames on animal welfare, our results show that episodic multimodal frames render a longer observation duration than episodic frames and thus attract higher visual attention than thematic ones. We further found that episodic multimodal frames resulted in a higher level of the individual responsibility attributions, but also – challenging the common assumption (e.g., Iyengar, 1991) – of societal responsibility attributions. Also, our findings demonstrate that the reception of episodic (but not thematic) multimodal frames that promoted the topic of animal welfare increased the level of policy support concerning animal welfare. We expected this general effect to be the result of a complex interplay between the allocation of visual attention and responsibility attributions and the activation of discrete emotions incited by the episodic frame condition.

While our analysis supports this idea in general, and particularly regarding the interaction of visual attention with responsibility attributions of policy support, emotions did play a role, but, interestingly, established no direct effect on policy support. Our results, therefore, indicate that assuming a complex interplay of subsequent stages in the process of perceiving and processing multimodal frames seems a relevant and enlightening angle that should be further researched. This holds particularly true, as our results also suggest that the observed framing effects are largely attributed to multimodal frames’ visual elements. While the finding that images can be particularly effective in the framing process is consistent with recent research on visual (e.g., Ben-Porath and Shaker, 2010; Coleman, 2010; Dahmen, 2012) and multimodal framing (e.g., Powell et al., 2015, 2019), future studies should further investigate the conditions under which visual frame components can outperform the influencing impact of textual framing devices that is well established by framing research focused on textual content.

This seems particularly insightful, as research that examines the effects of multimodal news frames in which press photographs and article texts are combined suggests a strong interaction of both modalities in which the effects of visuals possibly amplify the textual framing effects (Geise and Baden, 2015). Powell et al. (2015), for example, found nonsignificant effects in one text-alone condition but significant effects in an image-alone condition. However, examining the multimodal condition in the same study implied that the inclusion of an attention-grabbing image increased attention to the accompanying text as well, and its structure in turn guided participants’ interpretation and support for policy intervention (Powell et al., 2015).

While we arrived at similar results when analyzing multimodal framing effects in the area of animal welfare, further studies should examine the extent to which the effects regarding the underlying processing stages in the framing process depend on the different issue contexts of the frames examined. While in our study, an additional analysis of the topics of animal welfare, economy/wealth distribution, education, and homeland security revealed structurally identical effect relations (see the Appendix Figure A1), further studies should take a closer look at the presumed interaction of process steps in the multimodal framing process and its dependence on context factors that pertain to divergent issues under examination.

At first sight, our finding that episodic multimodal frames resulted in a higher level of responsibility attributions on both the individual and societal levels, seems challenging. However, newer research suggests that the two effects should not necessarily be viewed as contradicting because in the case that perceptions of political responsibility are influenced by framing, this does not necessarily mean that an opposite effect occurs for attributions of individual responsibility or vice versa (Ben-Porath and Shaker, 2010; Boukes, 2021). In line with this idea, Boukes (2021) argued in a recent study that both framing types could, in principle, affect the attribution of individual responsibility and trigger political responsibility attributions, particularly when compared with a situation involving no exposure to prevailing frames about this topic. Contributing to a better understanding of episodic versus thematic framing effects, our results support this idea, showing that episodic frames can foster individual and political responsibility attributions simultaneously. This finding seems to be in line with Boukes (2021) conclusion that episodic framing may act as a counterbalance to the increasingly dominant neoliberal discourse of elite policymakers who blame individual citizens rather than admit failures in the political system (Guetzkow, 2010), whereas thematic framing itself does not automatically affect attributions of individual (or societal) responsibility.

The finding that stronger emotional reactions, which then promote further cognitive evaluations, are not necessarily the result of experiencing episodic frames corresponds with our finding that thematic but not episodic multimodal frames elicited stronger emotional responses in our recipients. This also corresponds with prior research on episodic and thematic framing. Building on Iyengar’s (1991) seminal study, most researchers have assumed that episodic frames cause stronger effects on the recipients’ emotions (e.g., Gross, 2008; Aarøe, 2011). Yet, empirical findings have been contradicting each other (Kraemer and Peter, 2020), and some scholars have also shown that, under certain conditions, issue-based news frames can lead to strong emotional responses (see, e.g., Scheufele and Gasteiger, 2007; Kim and Cameron, 2011; Kuehne and Schemer, 2015). Here, it should also be acknowledged that most previous studies that compare the effects of episodic and thematic frames center on the analysis of textual news coverage, while there are hardly any studies devoted to the comparative analysis of different visual or multimodal frame types. Considering the latter, our findings align with Powell et al. (2015) and Scheufele and Gasteiger (2007), who demonstrated that multimodal framing effects and the mediating role of emotions therein seem to be dependent on the framing device, its modality, and the dependent variables under examination. In line with this idea, Ciuk and Rottman (2020) showed that the perceived salience of a topic can impact the effect potentials of different frame types.

Further taking into account that multimodal and visual frames are perceived as particularly salient (Powell et al., 2019), future studies, to differentiate our knowledge of multimodal frame processing, should include measures of salience into their models. Further advances should also integrate Boukes (2021) as well as Weikmann and Powell’s (2019) findings, which consider that audience characteristic (e.g., ideological orientation and personal characteristics) and cultural contexts are additional factors for explaining why framing effects might run in expected–or contradictory–directions and may even play a larger role in the processing of multimodal media information.

Nonetheless, by differentiating the ‘multimodal picture,’ we can expand our knowledge of the power of frames by clarifying the underlying psychological processes that involve both sensory data on visual attention and affective responses. As we have shown, the inclusion of eye tracking data and the incorporation of emotions both seem promising for understanding the multimodal frames analysis process. More specifically, our findings extend the understanding of frame strength by demonstrating that the power of a news frame to ‘catch’ a recipient is not exclusively shaped by its effectiveness in changing the importance of cognitive evaluations; it is also shaped by the frame’s capacity to attract attention and direct cognitive appraisal steps into support for the policy position framed by the multimodal news article.


Although this study benefited from its innovative methodological design, which integrated eye tracking to examine visual perception and its impact on the processing of multimodal frames, this is also the cause of its main limitation. We applied a one-time laboratory experiment that focused on the short-term effects of episodic and thematic framing and thus was able to assign the observed effects to our treatment conditions of episodic versus thematic multimodal framing. Yet, advanced framing researchers have emphasized that framing effects are of an “accumulative” nature (Lecheler et al., 2015b, p. 339), unfold over time in a dynamic process (Aarøe and Petersen, 2018; Boukes, 2021), and often under conditions of frame competition (Sniderman and Theriault, 2004). Accounting for the resulting contexts and conditions under which cumulative multimodal framing effects can occur seems an important next step for further research.

A second limitation, and an important strand for further research, is connected with our decision to use the example of multimodal news articles that cover the issue of animal welfare. We opted for animal welfare because it is an issue with potential social consequences (Arpan et al., 2006), which seems particularly suited for studying the impact that appealing multimodal frames have on recipients’ attention, emotions, and cognitions (Evans, 2016). Because from a theoretical standpoint, our results refer to multimodal framing as a general multistep process that features a complex interplay of the subsequent stages of information perception and processing, which results in corresponding cognitive evaluations, we do not expect our results to be limited to certain issues. However, analyzing multimodal frame processes in different issue contexts seems to be an important next step that needs to be achieved through further investigation.

Another potential limitation is related to the selection of stimuli used in the study, namely the focus on animal welfare protests rather than distressed animals, which may elicit even stronger emotional responses from participants and be more directly related to the issue of animal welfare (e.g., Fernández, 2019). While our findings may not fully capture the nuanced emotional dynamics that may arise in situations involving images of animals in distress, future studies could further investigate how such multimodal news frames (differently) shape recipients’ visual attention, attributions of responsibility, emotions, and political support, potentially leading to stronger framing effects.

Altogether, this study uses a complex empirical design and provides new insights into the process of multimodal framing. Thus, this study fills a crucial gap in the literature by illustrating how the distinctive reception of multimodal frames can lead to responsibility attribution and policy support.

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

The studies involving humans were approved by Ethics committee of the Institute for Communication Science, University of Muenster. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study. All participants were debriefed appropriately at the end of the study.

Author contributions

SG: Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing, Formal analysis. KM: Data curation, Formal analysis, Methodology, Visualization, Writing – original draft, Writing – review & editing.


The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. The project was made possible by a research grant from the Friede-Springer Foundation. The Foundation had no influence on the research design, the conduct of the study or the interpretation of the findings. There were no conflicts of interest.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at:


1. ^We focused on the subject of animal welfare in order to illustrate the presumed effect process. However, a model across all topics shows similar results (see Appendix Figure A1). Due to the structural similarity of the said model, we assume that the described effect path is of relevance for more than just the topic of animal welfare.

2. ^In terms of multimodal frame’s observation duration, a t-test showed no significant group differences between the episodic and thematic multimodal frames [t(141) = −1.88, p = 0.06]. While not significant, the descriptive statistics reveal that the observation duration was lower for the thematic frames (M = 17.17, SD = 6.45) than for the episodic multimodal frames (M = 19.51, SD = 8.32).

3. ^We tested for discrete emotion anger, since this emotion occurs universally (i.e., regardless of cultural background) (Lazarus, 1991) is considered central to the attribution of responsibility (Smith and Ellsworth, 1985) and is ascribed great mobilizing potential (Casas and Williams, 2019; Clifford, 2019). However, while testing differing discrete emotions (e.g., joy), we found comparable effects (see Appendix Figure A2).


