Adaptive Playback Control: A Framework for Cinematic VR Creators to Embrace Viewer Interaction

Content creators have been trying to produce engaging and enjoyable Cinematic Virtual Reality (CVR) experiences using immersive media such as 360-degree videos. However, a complete and ﬂ exible framework, like the ﬁ lmmaking grammar toolbox for ﬁ lm directors, is missing for creators working on CVR, especially those working on CVR storytelling with viewer interactions. Researchers and creators widely acknowledge that a viewer-centered story design and a viewer ’ s intention to interact are two intrinsic characteristics of CVR storytelling. In this paper, we stand on that common ground and propose Adaptive Playback Control (APC) as a set of guidelines to assist content creators in making design decisions about the story structure and viewer interaction implementation during production. Instead of looking at everything CVR covers, we set constraints to focus only at cultural heritage oriented content using a guided-tour style. We further choose two vital elements for interactive CVR: the narrative progression (director vs. viewer control) and visibility of viewer interaction (implicit vs. explicit) as the main topics at this stage. We conducted a user study to evaluate four variants by combining these two elements, and measured the levels of engagement, enjoyment, usability, and memory performance. One of our ﬁ ndings is that there were no differences in the objective results. Combining objective data with observations of the participants ’ behavior we provide guidelines as a starting point for the application of the APC framework. Creators need to choose if the viewer will have control over narrative progression and the visibility of interaction based on whether the purpose of a piece is to invoke emotional resonance or promote ef ﬁ cient transfer of knowledge. Also, creators need to consider the viewer ’ s natural tendency to explore and provide extra incentives to invoke exploratory behaviors in viewers when adding interactive elements. We recommend more viewer control for projects aiming at viewer ’ s participation and agency, but more director control for projects focusing on education and training. Explicit (vs. implicit) control will also yield higher levels of engagement and enjoyment if the viewer ’ s uncertainty of interaction consequences can be relieved


INTRODUCTION
As 360-degree videos become widely popular, content creators are now trying to produce engaging narratives using this immersive medium.They aim for contents with complete narrative arc and let the viewers feel immersed in the story world, rather than short and simple footage for brief excitement.We use the prefix "cinematic" to define those narrative-based VR experiences, including story-based dramas, documentaries, or hybrid productions, that feature a beginning, middle, and end.The term Cinematic Virtual Reality (CVR) then emerged and is defined as a type of experience where the viewer watches cinematic content as omnidirectional movies, using a head-mounted display (HMD) or other Virtual Reality (VR) devices (Mateer, 2017).The physical viewing experience may vary from simple 360-degree videos, where the viewer only has the freedom to look around, to a complex computer-generated experience that allows the viewer to walk around, interact with objects and characters in the scene, and even altering the narrative progression (Dooley, 2017).However, we have noticed a lack of a standard production framework as we experienced a series of CVR works, and discovered the ways of constructing a story and implementing viewer interaction varies widely between them, leading to high variability in the viewer experience.In filmmaking, directors rely on a series of cinematic techniques to compose the story, guide viewer attention, and invoke different emotional responses (Bordwell and Thompson, 2013).These cinematic techniques form a framework for staging, camera grammar, cutting, and editing.They work because film directors know that the cinema audience will be sitting in a fixed chair and look directly towards the rectangular screen without any interaction with the onscreen story.In contrast, researchers and practitioners focusing on CVR have, for example, realized the necessity for treating the viewer as a character in the story scene, and that viewers will be expecting a certain amount of agency.They have been producing exploratory projects, gaining experience, and developing design principles ("rules of thumb") for supporting both the story and viewer interaction for CVR (Pillai and Verma, 2019;Yu, 2019;Pagano et al., 2020).However, a robust framework that is flexible and accessible for non-VR experts to create CVR stories is still lacking.
We are addressing this gap with our prototype framework Adaptive Playback Control (APC) presented here.First, instead of looking at everything CVR could potentially cover, we set constraints to narrow our focus, looking at cultural heritage oriented content using a guided-tour style.This focus allows us to illustrate our approach in a concrete way while also maintaining generalisabilty.We review previous work on the same theme, which provided frameworks in various media with viewer interaction, trying to find any typical patterns and necessary elements to be included.Then we present the workflow of creating a storytelling experience with viewer interaction by summarizing these works.The construction of the APC framework parallels this multi-step workflow and will produce guidelines at each step, which are story structure guidelines, content preparation guidelines, and content assembly guideline with viewer interaction at its core.
Since there are various story structures and viewer interaction techniques to cover and evaluate, we also set constraints to limit the number of combinations we will look at for now, and extracted two key components essential to design decisions for CVR with viewer interaction: 1) narrative progression (director vs. viewer control) and 2) visibility of interaction (implicit viewer control, vs. a control method with explicit input by the viewer).We report on a formal user study to evaluate four conditions combined from these key components: director control, viewer control, implicit control, and explicit control.We measured viewer engagement and enjoyment towards the content, their general user experience, and their performance on memory tests.A semi-structured interview was also added to collect subjective feedback from the immersive storytelling experience.Although the objective data did not reveal any significant differences between the conditions, the participant behaviors and responses to interview questions helped us deriving several guidelines for creators who are working on CVR projects embedded with viewer interaction.We recommend that creators should choose between increasing the viewer's feelings of engagement and enjoyment, or enhancing the efficient transfer of knowledge, depending on the purpose of their projects.Also, creators need to consider the viewer's natural tendency to explore and provide extra incentives to invoke exploratory behaviors in viewers when adding interactive elements.We recommend more viewer control for projects aiming at viewer's participation and agency, but more director control for projects focusing on education and training.Explicit (vs.implicit) control will also yield higher levels of engagement and enjoyment if the viewer's uncertainty of interaction consequences can be relieved.
The remainder of this paper is organized as follows: we first introduce related background information to build a framework for creators of CVR experiences.Then we describe how we construct the APC framework and what key elements we have chosen to focus on with higher priority here.After that, we describe the preparation and execution of the user study, followed by the presentation of results, analysis, and discussion.The conclusion and future work are presented in the last section.

Interactive Cinematic Virtual Reality
Filmmakers have established practices and guidelines for effective storytelling with movies (Bordwell and Thompson, 2013), such as the "Mise-en-scène" about using the set and arrangement of elements to present a proper perspective towards the story, and cinematography grammars including various shots to compose the picture and direct the viewer's attention to specific elements.When 360-degree cameras became widely available, practitioners and researchers also explored and came up with guidelines for creators to capture engaging footage and effectively tell stories.Those guidelines include the principles of arranging story elements in the scene, the placement of the camera and characters, and the gestures and body language for human actors to use with the purpose of direct the viewer's attention in the immersive media without using non-diegetic objects (Pope et al., 2017;Syrett et al., 2017;Gödde et al., 2018;Bender, 2019;Tong et al., 2020).Later, as 360-degree video became more widely available and easily accessible for the general public, it was no longer a simple medium for short-term excitement.Creators started to treat it as a more serious form of media, and use it to create long and complete stories, aiming to immerse the viewer into the story scene and invoke more intensive emotional resonance with the characters and plot (Bevan et al., 2019;Hassan, 2020).At this stage, the term CVR emerged to define such experiences (Mateer, 2017).Thus, viewers could develop a feeling of "being there" within the scenes and could freely choose the viewing direction (Rothe et al., 2019a).Researchers also started to notice in CVR, compared to traditional flat videos, that both the viewer's role and expectation of interaction had changed; viewers were no longer passive spectators like in cinemas, but characters inside the story scene, expecting a certain amount of agency within the virtual world (Syrett et al., 2017;Bender, 2019).
Unlike early days where guidelines only focused on directors, new works have moved the viewers into the spotlight (Cavazza and Charles, 2005;Mateas and Stern, 2006;Sharaha and Al Dweik, 2016).Although the discussion is viewer-centered, those researchers were mainly thinking about migrating frameworks from high-level interactive media, such as games, to narrative-oriented media such as 360-degree videos.Because they see the viewer experience has been extensively studied in highly-interactive media and regard these findings as well constructed for CVR.Gradually creators realize a direct migration may not work as both interactive freedom and the mechanisms for interacting with the narrative are different between those two media.On one side, we see creators producing CVR works in a trial-and-error mode, experimenting with prototypes, and gathering design references from filmmaking projects (Ibanez et al., 2003;Brewster, 2017).On the other side, the theories of "coconstruction of the story" and "ludo-narrative" have emerged (Verdugo et al., 2011;Koenitz, 2015), highlighting the viewer's necessary contribution towards the progress of delivering a complete story and providing a circle of experience.In these works, they put forward models of story construction regarding the viewer also as an author of the story (Roth et al., 2018).Nevertheless, they have not designed models specifically for immersive media such as 360-degree videos.We still lack a well-constructed set of guidelines for creators moving from filmmaking or regular 360-degree video to creating immersive storytelling with viewer interaction.
In reviewing this literature, we surveyed previous works that provided frameworks for other types of storytelling, such as tabletop games, video games, and interactive TV programs, trying to find common patterns across them to describe the necessary elements within a framework to support content creation.Carstensdottir et al. (2019) examined interaction design in interactive narrative games, specifically the structure and progression mechanisms, from the perspective of establishing common ground between the designer and player; Ursu et al. (2008) listed a series of existing programs and summarized a system structure for an effective software to support both authoring and delivery of TV programs.This work has led us to an important conclusion: we believe that to assist in the creation of an effective storytelling experience for CVR, such a framework should provide support for both how the story structure should be scripted for interaction and which mechanisms viewers (or players in this context) will use to interact and move the narrative forward.The following section describes our proposed framework, Adaptive Playback Control (APC), for interactive storytelling in immersive media.We will start from its motivation to the process we followed to build it, and then the evaluation we conducted.

The APC Framework
The motivation for creating the APC is to provide a framework for making CVR (e.g., 360-degree video, computer-generated immersive movies) a more interactive experience, while not pushing interactivity to the point where it becomes more of a gaming experience.Since consuming film is more of a "leanback" experience for many, striking this balance is a key to its appeal.
We expect that the APC will give creators a familiarity akin to scripting for conventional videos, and provide them with the confidence of authorial control, combined with satisfactory outcomes throughout production, delivery and later consumption.Thus, resulting works will on the one hand have a pre-scripted narrative as their backbone, but an interactive and immersive experience at the front face.This will ensure the narrative arc remains under the control of the director, while the freedom of interaction is placed in the viewers' hands.However, there are various common storytelling methods for different user scenarios, defined by factors including the content, the emotional consequences that the director wants to invoke in the audience, and the type of interaction viewers are going to use (Tong et al., 2021).In order to ensure the exploration will be effective and can yield practical results for APC users (creators and consumers), we set the first constraint, that in this study applicable contents we will apply the APC onto are those which: (1) are cultural heritage oriented; (2) use a guided-tour style, meaning there will always be an embodied host in the scene, visible to the viewers, whether it is an actual human or synthetic character; and (3) use content that is prerecorded.
Within this realm, we review existing storytelling frameworks for various media, trying to locate a common pattern.We noticed a trend that viewer participation is regarded as a key component if immersion and story comprehension is the aim of the entire experience, such as those ones (Mulholland et al., 2013;Habgood et al., 2018;Lyk et al., 2020).Although covering different topics, they all used immersive content as a base and constructed the storytelling experience on top of that, regarding the viewer as a part of the story, emphasizing viewer's interaction (gazing, gesture, selection, etc.) as a tool to drive the narrative forward.We can extract similarities from those examples, that they usually start by laying out the non-linear structure, prepare the content for each node of the structure, then implement the system, filling in the content and adding viewer interactions.Since those works have been proved effective and others have been using them as references for design (Ferguson et al., 2020), we decide to construct the APC framework in a pattern matching this workflow, and divide it into three parts.
• Structure Guidelines for designing an appropriate nonlinear structure for the story for interactive CVR; • Content Creation Guidelines for content preparation for immersive media (mainly 360-degree videos); • Assembly Guidelines for assembling content by combining the structure with viewer-interaction design.
We expect that content creators will use this APC framework as a reference for their production process, choosing the story structure to support their purpose with storytelling, preparing content segments, and assembling them, with interactive elements designed for viewers, into an engaging and satisfying immersive storytelling experience with interactivity.

Foundation of the Framework
To build a prototype of the APC framework for evaluation, we constructed each of its components in different ways.For the Structure Guidelines, we looked at structures summarized by researchers from other immersive media, such as video games and VR games, and migrated them to narrative CVRs with proper modifications.The Content Creation Guidelines were distilled from previous projects and assembled them from verified film techniques, including the Mise-en-scène, camera manipulation, and guidance techniques such as the Action Units (Tong et al., 2020), to cover the preparation and capturing stage.For Assembly Guidelines and interaction design, we evaluate and choose interactive methods for each of the structures and see which one fits.Since the guidelines cover a series of non-linear structures and viewer progressing mechanisms, the abundance of combined variation means evaluation work will be time and resource intensive.We do not cover all possibilities in one study, instead we filter out impractical combinations and focus on those are closely related to the content type we mentioned in last section, mainly for reasons of required sample size and evaluation effort.First instead of cover all structure types, we focus on one commonly used non-linear form, known as "huband-spoke," which is the base of other derived types (Carstensdottir et al., 2019) and has been widely used in both tours (Mulholland et al., 2013;Sharples et al., 2013) and games (Moser and Fang, 2015).
Then we acknowledge that involving viewers in the control of narrative progression is an outstanding topic in interactive storytelling research (Verdugo et al., 2011) on the one hand, moving from traditional screen-based media to immersive media, a shift in the viewer's perspective naturally calls for giving the viewer the freedom of choice and the agency of impacting narrative progression (Tong et al., 2021).On the other hand, researchers have also discovered that for culture-or educationrelated content, a clear structure helps in audience understanding and retention (Lorch et al., 1993).Researchers who focus on games have also found that, in particular, explicit narrative progressions designated by the creators have a positive effect on declarative knowledge acquisition (Gustafsson et al., 2010).
Viewer awareness of interaction also has an impact on the user experience of storytelling (Rezk and Haahr, 2020).We know that if the viewer is regarded as a character in the story scene in immersive storytelling, they expect a certain level of interaction to be involved in narrative progression (Tong et al., 2021).A system designed to be responsive to viewer input will also increase the level of viewer involvement and immersion (Ryan, 2008;Roth et al., 2018).Concern has also been raised by some researchers, pointing towards the design around consequences of choices.Rezk and Haahr, (2020) cautioned that if viewers are given explicit choices during storytelling, they are likely to hesitate when faced with too many options, as they will evaluate every potential consequence of each option, therefore be unable to make confident choices, thus shattering the feeling of "being there" in the story world.Realizing this, some creators turned to a new design style known as "invisible control," where the viewer still participates in the progression of the narrative, but is not explicitly aware of making choices.To achieve so, one implemented a system monitoring viewer's behavior during virtual museum tours at predefined spots, and respond to it by making unannounced narrative choices over branches (Ibanez et al., 2003), another recorded the player's actions in the game and used the data to determine his/her overall contextual intention, then presented a matching ending from several parallels (Sengun, 2013).
Thus we narrow down the focus of this study to a two-element combination, the narrative progression (director vs. viewer control), and the visibility of interaction affordances (implicit vs. explicit interaction).We also explicitly prioritize them in the APC framework.

METHODS
The main focus of this study is set on how viewer interaction can be enabled in interactive cinematic virtual reality, such as 360degree videos.The structure of the story we employ has a pattern of hub-and-spoke, i.e., the viewer starts from a central location (the hub) and all alternatives (the spokes) start from here.Since the prerecorded content of the story will not change, and essentially every participant will watch the same content, we conducted a between-subjects experiment to compare the user experiences between design variants.The content we used for this study was an in-house guided tour through a series of 360-degree videos.We will describe the material and production process in detail in section 3.3.
In the experiment, we set up four conditions of the interactive and immersive storytelling experience.
The structure of the content remained unchanged between conditions and will be illustrated in detail in section 3.3.However, the control models and viewer interaction methods varied.The exact choices of parameters, including the control of temporal sequence and the type of involvement, are listed in section 3.4.We set up the condition DR to observe, in this specific story, if the viewer's experience will change when the order of the segments was different from what the director initially intended.We measured each condition's effect on the viewer's level of engagement with the content, enjoyment from the experience, and the general user experience.We also designed a series of content-related questions, both directly asking about one of the elements from the scene and conclusions derived from what the host introduced, combined with information visible in the scene.Those content-related questions were used to assess how the conditions affect the viewer's memory performance and the system's performance on the transfer of knowledge.

Research Questions and Hypotheses
In this study, the research questions we would like to ask are: • RQ 1 : For the hub-and-spoke structure, which type of control pattern will bring a higher level of engagement and enjoyment, director control or viewer control?• RQ 2 : When viewers have control over the order of the segments in a hub-and-spoke structure, which one will yield a better usability experience, implicit involvement or explicit control?• RQ 3 : Between implicit involvement and explicit control, which one will lead to better memorization of the content?
We formulated hypotheses corresponding to the research questions based on previous research related to the viewer's role and behavior in narrative CVRs.Firstly, we expected the viewer-controlled modes (VI and VE) will bring higher levels of engagement and enjoyment, as in immersive environments, agency increases the level of presence and brings a deeper feeling of being directly involved in the story scene, as well as a stronger feelings of fun (Ferguson et al., 2020).As has already been verified by previous research, the viewer's role in immersive media is different from the one in a regular movie.The viewer feels they are a character in the scene and form the expectation that they have some influence over how the guided tour will progress (Ryan, 2008;Roth et al., 2018).Secondly, as mentioned in the previous section, explicit choices for the user will make them think about the potential consequence of the action of choice, thus breaking immersion and deteriorating the general experience (Rezk and Haahr, 2020).We also assumed the invisible control method (VI) will impact less on the viewer's general experience towards the system, because explicitly making selections adds extra workload during the experience.Thinking over the choices will also distract viewers from focusing on the narrative from the host.Thirdly, we expected the condition VI will also lead to a better result in the viewer's memory test of the content, because in the viewer's perception, they are passively watching and more focused on the content.In summary, we formulated the following hypothesis: • H 1 : compared to director control, viewer control will lead to higher levels of engagement and enjoyment with the experience; • H 2 : compared to the explicit control-based method, the invisible involvement will lead to a more positive user experience; • H 3 : the invisible involvement will lead to better performance on memory tests.

Apparatus
We used a computer running 64-bit Windows 10 Professional with a 3.2 GHz i7 processor and a GeForce RTX 2080 graphics card, to implement the APC system, record viewer behavior data, and ensure the smooth playback of the 5.7 k videos.During the experiment, participants viewed the 360-degree videos wearing an Oculus Quest 1 HMD with or without using its controllers, depending on the conditions, as shown in Figures 1A,B.

Materials
For this experiment, we captured a series of 360-degree videos in a laboratory room we selected for the content, then assembled them into a virtual guided tour.We first scouted the location and designed the story.In the space we chose, several large experiment installations are placed against each wall, as shown in the top-down view of the space in Figure 2. In the real world, to a newcomer to this space, a host would first introduce the purpose and daily activity in that room, then move on to each of the installations.We duplicated such a visit in this study by capturing 360-degree videos of it, as the spatial layout of the space and the installations matches the hub-and-spoke story structure.It is also noteworthy that in this space the viewer could see all the spokes as they were all "open" and "equivalent", imposing no hierarchical relationship between them.
We then captured the footage when the host was introducing the installations at each designated spot, using an Insta360 ONE R 360-degree camera at the resolution of 5.7 k (5760 × 2880).The camera was mounted on a tripod so the viewpoint was fixed in each of the segment.The positions of the camera and storyteller (host) for each segment were carefully chosen so they are not blocking views, shown in Figure 2. The distance from the host to the camera was kept the same (2 m) at each location, so the viewer would always feel as having a similar space between themselves and the host when watching.A wireless mic was also used and directly plugged into the camera, so the audio quality was maintained at the same satisfactory level no matter the distance between the camera and the host.
Following the script, we first captured the Introduction clip at the hub where the host gave an overview of the lab room and introduced all the installations covered in the tour.Then at each spoke, the host talked about the experiment installation, including its features, applications, and research projects running on them.The narrative structures of each segment at the spokes were also scripted to be similar in terms of running time.We captured a total of five clips, including one Introduction clip at the hub, three major segments at each spoke and one Ending clip, which contained no key information, but only wrapped up the tour experience, rather than having an abrupt cut at the end.The clips were then processed using Adobe Premiere Pro to add fading to black transitions at the beginning and end of each clip, and to adjust the volume levels and narration pace to minimize differences between the recordings.

Implementation of the APC System
In this study, the APC system consisted of two components: 1) a video player to present via the VR headset to the participants the 360-degree video clips we captured, and 2) a mediator component to deliver the designated control from the director (stored in advance) or to respond to viewer interaction during playback.Both components were implemented using Unity3D 2020.3.13f1.The implementation details of each condition are described below.DC: in this condition, the five clips were loaded into the library of APC following a specific order (the Introduction, three major segments, then the Ending).The APC played those 360-degree video clips one after another.There was no viewer input in this condition.
DR: similar to DC, the APC played those 360-degree videos without viewer input involved.The only difference was that the system randomly rearranged the three major segments every time the experimenter initiated the experience for a new participant.The Introduction and the Ending were permanently fixed at the beginning and the end.
VI: in this condition, the viewer's head orientation was monitored and recorded in real time by the moderator component as she was watching the 360-degree Supplementary Videos.A series of invisible gates were set up in the scene, overlapping with the area of subject-matter objects from the perspective of the viewer in the center of the spherical scene, as shown in Figure 3. Thus, when the viewer looked at one of the objects, her head orientation fell within the corresponding gate.The moderator then recorded and calculated how long the viewer had been dwelling on the object shown behind the gates.If the viewer dwells on a gate longer than a threshold (set to 3 s in the system we used for the study), the moderator will decide that the viewer might be interested in this object and pulls the corresponding segment to the top of the playlist as the next one to play after the current clip.The gate threshold can be triggered multiple times along the playback experience until the current clip plays 95% of the way through.The gates were positioned manually in each segment scene to match the viewer's head location in that scene.The entire process was executed by the moderator in the background, and the viewers were not aware of it at any moment.
VE: we implemented a hand-held laser-pointer method in this condition to enable the viewer's explicit interaction.It used a simple point-and-activate mechanisms (Rothe et al., 2019b) and helped reducing the unnecessary learning workload of the viewers.Based on the gate design from VI, we replaced those invisible gates with visible cards, as shown in the screenshot in plus the time length of that video segment as an extra aid of the viewer's decision-making process.Viewers who were holding the controller can then point at the cards with the laser pointer and pull the trigger to make a selection, as shown in Figure 5B.A popup message would temporarily appear in the viewer's Field of View (FOV) to confirm the choice.We set the cards to appear when the progress of the current video clip reaches 70%, so they were not always shown to cause distractions, nor appearing too late to leave the viewers with a very short time window to make selections.Since the interactable cards overlaid on top of topic-related objects and were scattered around the scene, popup messages were programmed to appear in the center of the viewer's FOV to remind her when cards were available for selection, reducing the fear of missing out.If the viewer does not make any choices before the video segment ends, the default order from DC will be used.
In all four conditions, a Helper Head-Up Display (HUD) was set up and constantly visible at the lower left corner of the viewer's FOV, as shown in Figure 5.It aided the viewer in being aware of the progress of the current segment and the progress of the entire tour.Thus the viewers would not lost in the tour and experience difficulty recalling the content of a segment.Instead, they maintained an awareness of the progress and pace.

Measures
When designing the experiment, we noticed two important facts.First, in the VI condition, the viewer was not aware that the system was monitoring her head orientation and making alternations to the playback progress.Second, since a  participant only experienced one of the four conditions, across DC, DR, and VI, participants were also not aware that some conditions might impact the narrative progression, while others will not.Taken together, we realized that the viewer's subjective experience across the conditions would not be equivalent unless they are told afterwards which one they experience; we concluded that we cannot assess and compare the user experience by explicitly asking about their preferences.Instead, we measured their subjective feelings by regarding all conditions as generic immersive storytelling experiences.Then we derived design reference by comparing the participant's overall experience across the conditions.
For subjective measures, we assessed each participant's level of engagement with the content, enjoyment of the experience, and general usability experience.The engagement was measured by the widely-applied User Engagement Survey Short Form (UES-SF) (O'Brien et al., 2018).We also used the operational guidelines provided by Schmitz et al. when applying this measurement (Schmitz et al., 2020).The enjoyment was measured using the 12-item scale provided by Ip et al. (2019).They used this scale to measure learner enjoyment after watching a series of Massive Open Online Courses (MOOCs) in the form of 360-degree videos, similar to our study setup.A simple three-part evaluation form measured the general user experience towards the system, provided by Shah et al. (2020) in their study evaluating a 360-degree video playback system.We removed one non-relevant item, then applied it to see if the participants find it easy to use the system, both the invisible interaction and the traditional point-and-activate method.
Content-related questions were used to assess participants' memory performance from each condition.Three multi-choice questions required the participant to combine and extract several elements, either directly from the host's speech or visually from the scene, all around one specific topic.Three single-choice questions then asked the participant to verify some facts visible in the scene.The seventh question asked the participant to make a design choice for a "simulated situation" based on the higher-level overview given by the host from the entire virtual tour.These questions provided insights into how much the participant remembered from the tour and understood from the introductions.The questions did not require any reasoning or deduction, and so did not require any background knowledge.
We also added a plugin to the APC system to automatically record the viewer's head orientation in real time and to generate a heat map of the viewer's attention and dwelling among the entire scene, accumulated for each segment.We then used the heat maps to identify whether the viewer showed a relatively active exploratory behavior or mostly passive, static watching.

Experiment Procedure
We compared the four conditions (DC, DR, VI, and VE) by applying them to the same five-segment virtual guided tour we captured.For this study, we recruited 44 participants (22 females, 21 males, and 1 who chose not to specify) from the university.All were between 19 and 39 years old (M 26.68, SD 4.978).They self-reported having different levels of VR experience and 360degree video experience.Among the 44 participants, 28 had never, or only a few times in the past year, tried a VR experience, 14 had used VR at least once per month, and two reported using VR on a daily basis.A total of 35 out of 44 reported never having watched a 360-degree video before, and nine had watched a few times in the past few months.
Before the session started, we obtained consent from each participant.They were then introduced to the "virtual guided tour" and explained how he/she would experience it with the HMD and the Swivel-Chair VR setup (Tong et al., 2020).Then, a sample image of the Helper HUD was given to the participant for him/her to understand its purpose before the tour started.If the participant was assigned to the VE condition, a brief introduction of how to use the controller and interact with the selection cards was also added.The participant used only the right-hand controller in this user study.When the session started, each participant went through five 360-degree video segments.The first and last segments were always fixed, while the three in the middle varied by condition and each viewer's interaction or For condition VE, the participant will also see a virtual controller with a red laser pointer in her right hand.There are also interactive cards with texts visible in the scene at certain times (like the one in the screenshot).The participant then uses the virtual controller to point and make explicit choices.
behavior during the experiment.When the tour ended, we asked the participant to remove the HMD and move on to complete the questionnaires.
After the questionnaire, we conducted a semi-structured interview using the following prompts: (1) Please give a subjective description of the entire tour.
(2) Please talk about the most impressive part of the tour.The highest values for each measure are highlighted in bold.
FIGURE 6 | Boxplots summarizing the results of levels of engagement (converted from the UES-SF items), levels of enjoyment, general user experience scores towards the system, and the memory performance scores from the content-related questions, for each condition, in terms of medians, interquartile ranges, minimum and maximum ratings.Top row, from left to right: Engagement, Enjoyment.Bottom row: General user experience and Memory performance.
(3) Please talk about your preference between using a VR headset or a regular TV for this type of content in the future.(4) Do you have any other comments or thoughts?
For VE, we added a question to ask about their motivations for making choices using the controller via the interactive cards.The entire session took approximately 35 min for each participant.

RESULTS
We analyzed the data participants' reported in questionnaires by using a one-way ANOVA for fixed effects (α < 0.05), with Bonferroni post hoc and Tukey post hoc comparisons.The mean values for levels of engagement (converted from the UES-SF items), levels of enjoyment, general user experience scores towards the system, and the memory performance scores from the content-related questions, for each condition, are shown in Table 1.The highest values for each measure are also highlighted.The overall results are summarized and plotted in Figure 6.The calculations of levels of engagement, enjoyment, and general user experience scores were conducted following the operational guidelines provided by the original creators of the instruments.For the content-related questions, since there are both multiple-choice and single-choice questions, we calculated the score using the following rules: for a single-choice question, the participant scored one point if she chose the correct answer; otherwise, she scored zero points.For a multiple-choice question, the participant scored a full mark (five points) if she selected all the correct items and only those correct items, lost 0.5 points if she missed one of the correct items, and lost another point if one of the wrong items is selected.

Level of Engagement with the Content
The levels of engagement with the content were higher on average from the participants who experienced the VI condition (M 16.52, SD 0.516), compared to the other three (DC: M 15.97, SD 0.327, DR: M 16, 30, SD 0.608, VE: M 15.58, SD 0.777).However, the ANOVA test indicated no significant differences among the conditions (F 0.344, p 0.794).

Enjoyment of the Experience
Participants who experienced the VE condition reported the highest levels of enjoyment of the experience (VE: M 11.89, SD 1.684), which is slightly higher than the levels from participants of the other three conditions (DC: M 10.78, SD 1.813, DR: M 11.59, SD 1.211, VI: M 11.79, SD 1.774).However, the ANOVA test indicated no significant differences among the conditions (F 0.973, p 0.415).

General User Experience
The levels of user experience with the system indicate that participants who experienced the VI condition showed more positive user experience with the system (VI: M 12.95, SD 1.145), compared with the other three conditions (DC: M 11.

Memory Performance
We looked at the memory performance scores from participants of the four conditions and noticed that those who experienced the DC condition scored higher in the memory test (DC: M 17.27, SD 1.664), compared to the other three conditions (DR: M 16.22, SD 2.029, VI: M 17.13, SD 1.818, VE: M 16.41, SD 2.791).However, the ANOVA test indicated no significant differences among the conditions (F 0.662; p 0.580).

Behaviors of Attention and Focus
We also recorded each participant's attention over the entire scene by accumulating the dwelling time over the 360-degree sphere as a canvas using a plugin installed in Unity.The final results were generated as heat maps.A total of 176 maps were collected, grouped into four sections, corresponding to the four conditions, shown in Figure 7.
Comparing across the conditions, we notice that in the heat maps from the VE group, the scanned area recorded on the canvas is larger than those of the other groups, indicating that participants who experienced the VE condition showed a higher tendency towards exploration (actively looking around).
Comparing across the segments under the same condition, we also notice that participants showed more exploratory behavior in the first segment (Introduction) than the other three main segments where the host introduced details of each installation.We will discuss the participant behaviors in section 5.2.2 by looking into the heat maps and participant answers during the interviews.

Subjective Feedback from the Interviews
We transcribed and analyzed the interview results using a thematic approach.Topics were identified and generalized into two high-level themes, with five sub-level topics.Below we present the themes from the interviews alongside our interpretation.The further analysis and discussion by connecting the themes with the conditions participants experienced are presented in section 5.2.1.

Theme 1: The Impression of the Virtual Tour Experience
The first theme is about the participants' impression of the virtual tour as a general experience.We learned the viewers were mainly paying attention to either the experience itself, the content presented, or the system, during the virtual tour.Generally speaking, being immersive helped participants' engagement with the experience.

Being Immersed in the Virtual Environment
Many participants showed an emphasis on the general user experience itself and the novelty of the 360-degree videos, instead of the content (or story) of the tour, such as: "It feels like a 360-degree experience, an immersive tour . . .I can look around".The characteristic being immersive is also preferred by participants: "I feel like as if being transported to the space and being there with the tour guide . . .I will prefer to use it."

Feel Like Having a Real Guided Tour
Instead of regarding it as 360-degree videos, other participants directly described it as a guided tour, emphasizing its content by mentioning the name of the space or specific installations they FIGURE 7 | A grid summarizing the generated heat maps, grouping by the conditions.The images are zoomed into the central part to clearly show the dwelling of gaze, from the original equirectangular projections.Four samples are shown at the bottom.In each sector, there are 4 × 11 (or 44) heat maps collected from one condition, 4 segments.In each row, there are four heat maps representing one participant's behaviors.Each column represents the data from all 11 participants from this condition, on this specific segment.In DR, VI, and VE, although participants watched segment 1, 2, and 3 in random orders, the images here are arranged in DC order for easy comparison.
virtually visited (whether using the correct name or their own words): "It was a guided tour of the [name of the space] and I was introduced to three things, [names of the installations]. .."We also noticed that participants recalled the host's name and title: "It was a guided tour given by [name]" All of their heat maps show their attention mainly fell on the host.Participants also directly listed the names of the installations and their applications as their impression of the tour: "It was a tour about three installations . . .", indicating they were mainly attracted by the installations during the tour.

Impression of the Technology
Participants also talked about the VR system itself.They pointed out the usability of the headset, comfort, and how it blocks the peripheral view to help being immersed in the virtual scene.Those are also related to the other theme we will discuss later.But a few also pointed out their expectation of design improvement: "I do not want to use the VR headset frequently . . .not or for long time use because it is uncomfortable . . .I was unable to walk around in it".

Theme 2: As a Medium for Transfer of Knowledge
The previous theme explained how the participants saw the virtual tour and the impression they had.With this theme, we look at if the virtual tour helped the participants as an immersive tool to assist learning about the lab and the installation presented in the materials.

Actual Knowledge Gained
Participants did reply with the essential information presented in the tour, such as "I felt like learning about the lab is very interesting", or one specific item from the tour that the participant personally felt interested with, such as "The [name of the installation] is quite impressive because I think it is interesting/cool . . .". Since the participants are providing actual information gained from the virtual tour, instead of seeing it as a plain 360-degree video experience, the immersive experience did help the participants to learn about the lab and the installations from the host's presentation.

Immersive as an Advantage for Learning
Instead of specific information, some participants put their impression mark on the system itself, including its intrinsic characteristic that the viewer is immersed in the virtual scene and has the freedom to look around or choose what to watch next, reporting it as an advantage, when compared with other forms of learning: "I felt I was in the lab all the time and I did not get distracted by other things . . .Seeing myself actually standing in the lab is better than seeing it through a screen or as a video."

DISCUSSION
In this section we will discuss the main findings and implications from the results of the experiments.

Objective Measures
Since there were no significant differences detected between the objective results from the four conditions, the hypotheses (H 1 and H 2 ) were not supported; we do not find higher levels of engagement and enjoyment in those who experienced the conditions with viewer control (VI and VE).Data analysis detected no significant differences among the memory performance scores from four conditions, so H 3 is also not supported.These indicate that if we look at the objective responses, participant did not react to the four conditions differently.Alternatively, they did not objectively become aware or care about all four conditions applied to the virtual guided tour experience.
We assume these findings are due to three factors.The first one is that participants were unaware of any alternative other than director control, and so assumed that was in play.Since they were not told about the different mechanisms running in the background, nor did they know the original arrangement of the segments, in their eyes, the experience was just a series of 360-degree videos.Even for the VE condition, participants regarded controller input as a separate task to perform (and several of them did not even use it), while the virtual tour itself was still treated as "a series of 360-degree videos I need to watch."This assumption is also supported by the subjective observations we are going to discuss in the next section.
The second factor is related to the content.As stated in previously, we set constraints in advance and focused only on one type of story structure, "hub-and-spoke."Also, none of the experimenters involved in this work was a professional filmmaker or scriptwriter, so narrative intensity and creativity were limited in the content we made.We expect participants' feelings of enjoyment and engagement would have been higher if the content was well-prepared and creatively made by professionals.
The third factor is the amount and granularity of viewer interaction.In conditions VI and VE, viewer interaction was only used to drive the narrative progression.More specifically, among various elements within a storytelling experience, what we allowed the viewers to control was "which segment will I go to next after this one" instead of any specific visual elements or action choices with consequences in the scene.Compared to this virtual tour, we have seen in other interactive, immersive storytelling experiences, that viewers can interact in a much richer manner than simply picking segments on a playlist.One might interact with an object within the scene or even interfere with a character's behavior.It is possible that, compared to only narrative progression, viewers might show noticeable changes in their feelings and memory performance if they are able to interact with more specific and visible elements in the scene (and maybe also linked to narrative progression/ branching).

Subjective Feedback and Observations
In this part, we will discuss several topics related to viewer behaviors that we collected from three sources: 1) the answers from the interviews with the participants, 2) observations of participant behaviors while they were watching the 360-degree videos, and 3) the heat maps showing the distribution of viewer attention across each segment of the tour.

Viewer's Main Impression and the Role in Control
In the previous section, we summarized the answers from the interviews and analyzed them for themes and sub-level topics.We matched the transcription of each participant's responses to individual topics and counted the number of times each topic was mentioned.We then investigated the conditions that each participant experienced and separated the numbers into two groups, director control (DC and DR) and viewer control (VI and VE).The final results are shown in Table 2.The total count of sub-level topic 3 in theme 1 is not further divided as it is not condition-dependent.
For theme 1, we believe the viewer's main impression of the immersive experience changes according to whether the viewer has been given control or not.When a viewer is given control over the narrative progression and notices her agency within the experience, her main impression/recall will be mainly the feeling of "interactivity" or "being immersed," instead of focusing on the content.This conclusion is supported by the number distribution between DC and DR and VI and VE of the two sub-level topics under theme 1).This observation is further strengthened by the distribution of condition origins between the two sub-level topics under theme 2).From that distribution we can also conclude that "viewer control" leads users to remember more about the experience, rather than the content.

Exploratory Behaviors
On the heat maps in Figure 7, we examined where the red clouds were gathered (the area a viewer focused on for the longest time) and the size of the cluster (a small cluster if the viewer almost never moved head when looking at that area).We notice that most viewers' focus point fell on the host during the virtual tour because they were reminded to pay attention to the content as there will be questions asking about them at the end.Most of the viewers took it as a "test" and showed a tendency to carefully listen to the host because of this statement.Several participants also indicated that they did it out of politeness as "the guide was visible in the scene, I naturally felt I should look at him while listening." A difference between the intentions of the exploratory behaviors for the "Introduction" clip and those in the major content segments was also discovered.Many viewers showed an increased frequency of looking around and drifting away from the host during the Introduction, while the same behavior was much less observed during the content segments.In the latter case, the viewers were mainly dwelling only on the host and the object being introduced.We also noticed that several participants chose to look at the ROIs.The reason suggested from the interviews were mainly 1) they noticed some topic that they were interested in and would like to know more about, 2) they were attracted by something in the video and wanted to have a closer look (which was not supported in the 360-degree videos).Another group of viewers showed different behavior patterns from the previous two.They kept actively looking around during the whole tour.They stated that they were new to this form of 360-degree video and new to the place shown in the video (the VR lab), thus were driven by curiosity to look around actively.It was also verified when we looked at their answers in the background questionnaires.Most of the participants who showed "curiosity" behaviors answered "never" or "only a few times a year" when asked about previous VR use or watching 360-degree videos.
We want to conclude that the participant tendencies of watching passively or actively looking around could have been affected by their preconceived impression of the experience.During the experiment, before starting the viewing sessions, we described the experience as "a virtual guided tour made from a series of 360-degree videos," and we did not mention the actual mechanism behind the scene because we did not want the participants to be aware of the differences between conditions.We also emphasized that there would be "content-related questions at the end," and that they "might want to pay attention to the content and the details."These descriptions led participants to form expectations before the tour, that these are "videos" and "they should pay attention to the content," Then they showed less exploratory behaviors in the main segments.But first impression is not only the case, personal experience and interests also contribute to exploratory behaviors.

The Non-Stop Flow of Time and the Control Over Pacing
The need for pace control is a factor we noticed in viewer behaviors.Several participants of the VE condition reported in the interview that they felt there was not enough time for them to make selections.Some of them saw the cards appearing in the scene, but did not have enough time to consider the consequences, and make a selection.
This indicates that pace control could also be a vital aspect of the user experience in immersive storytelling.Alongside control over story direction and progression, viewers might also desire control of the speed of the progression.Unlike movies, where the viewers are just passively watching and the director has total control over the pace of progression, in interactive storytelling, the viewer will carry out active input and bear the task of determining the consequences of each option.Thus, proper pace control put in the viewer's hands could be substantial.In other words, when the storytelling process encompasses viewer interaction with the narrative, enabling the viewer to perceive herself in a role involved in the story, the control of story progression might also be rightly handed over to the viewer, at least partially, in order to match the agency a viewer now has.

Making Choices, or not Making Choices
We added an extra question during the interview for participants in the VE condition to ask their motivations for making their selections.Since in this condition, we found both those who actively made selections when the cards appeared and those who did not make any choices along the tour, even though the controller was in their hands, we discuss them as separate groups.
For those who made selections, the responses show that their motivations mainly fell into two categories: 1) making choices based on personal interest, and 2) making choices by comparing the running-times of the options, and choosing the shortest one.Both of these were linked to information presented on the cards (Name of the ROI, and running time of the segment).The text on the cards became an important reference for viewers to support their choices.
For the participants from VE who did not make choices, many reported they were not sure if by making the selection the system would directly cut to the next scene (like in music players), or just queue it up.They wanted to finish the current segment before proceeding, so they hesitated and eventually gave up on making selections.Once they noticed the system made default selections, they switched to a fully-passive attitude and let the defaults run while simply enjoying whatever was coming up.
Three participants can be viewed as outliers from these two groups.Their perception of explicit control differed from what we designed, or did not follow what we explained before the experiment.One indicated that he used the cards to force "the host to present the segments in a counter-clockwise fashion," as this was his personal preference when visiting museums.Two other participants reported that they regarded the cards as a method to skip and jump forward to the next segment, which is an intention we did not foresee in the design.They indicated in the interview that they wanted to skip forward once they felt bored and thought "the cards were a button to skip the current segment so I pressed it."However, the system did not allow for that, making them feel frustrated.These participants also stated that if they had been more sure of what the consequences of the selection mechanism were, they would have chosen according to whichever segment interested them most.
We assume that this confusion is related to the wording on the cards, as we wrote "Next up: [the name of the ROI]," which may have confused some participants.However, there could be a mismatch between the user's perception of the consequence of control, and the actual consequences.As mentioned by Carstensdottir et al. (Carstensdottir et al., 2019), the creator of an interactive experience needs to consider how the viewers establish a certain mental model in their minds when a control or input method is put in their hands.We need to further investigate and consider our design choices for interactive elements.

Implications for CVR Creators
We derived a first set of guidelines by comparing the interview responses of participants who went through the different conditions.We learned that the viewer's tendency to recall the experience can be mediated by whether the viewer is given control over the immersive experience.This means a CVR creator can guide the viewer's general impression of the experience towards an emotional feeling of interactive/ playable progress (projects designed for fun or suspense) or towards the content itself (projects for education or training) by giving viewers different levels of control, such as solely director control, implicit viewer control, or explicit viewer control.This guideline is also a part of the APC framework we plan to provide, which will assist CVR creators in choosing proper viewer interaction designs for projects with different purposes.
Another observation is the need for balance between the viewer's natural tendency to explore and incentives to invoke exploratory behaviors in viewers.The former was observed as a result of the shift of the viewer's role in CVR.The viewer naturally has an increased tendency to actively explore the virtual environment when s/he is immersed in it, as verified by our own observations from the participant behaviors during our experiments.The latter we derive from the combination of both quantitative data and qualitative observations.We believe that the viewer's tendency to actively explore in an immersive environment can be mediated by external elements, in both topdown and bottom-up forms.
Top-down mediation is related to the tasks a viewer carries out within the immersive experience.As we saw during the experiment, while carrying out the task of "I need to pay careful attention to the speech from the tour guide," viewers showed fewer exploratory behaviors.on the other hand, when they were told to "look for interactive cards in the scene and choose which segment you want to watch next," viewers' exploratory activities increased (as seen from the heat maps generated from the VE condition).
Bottom-up mediation seems more delicate and complex.We have not collected enough data in this user study to fully understand this mechanism.Thus, we only provide our preliminary insights here.On the one hand, when viewers were provided with a controller, they treated it as an entry point for potential interaction, and tried to interact with many things.However, their willingness to explore and interact hit a roadblock after several tries as there are no interactables.This made viewers unclear of the consequences behind their actions, and made them step back into a passive-viewing mode.To sum up, we believe CVR content creators need to be aware that viewer intentions to explore actively has a firm grounding in immersive storytelling, and that viewers must strike a balance between willing to act and being afraid to act, mediated by several factors, some under creator control, and some not.
Returning to the two central questions raised near the beginning, the decision of who should control narrative progression and the visibility of the viewer's interaction option, we sum up the following guidelines for CVR creators as a starting point of applying the APC framework:

Figure 4 .FIGURE 2 |
FIGURE 1 | Two types of participant setups.(A): for conditions DC, DR, and VI, the participant does not use a controller, but simply wears a headset and sits on a swivel chair, has the freedom to look around in the scene.(B): for condition VE, the participant also holds a VR controller in the right hand, and uses it to make explicit choices.

FIGURE 3 |
FIGURE 3 | A top-down view of the 360-degree spherical screen showing the gates scattered around covering the Regions of Interest (ROIs) behind them, waiting to detect the viewer's head orientation vector (which is represented as a thin yellow line in this screenshot) to collide with them, registering gaze dwelling over the ROIs.The image is a top-down view so the gates are the green rectangles.In the 3D scene they are actually thin boards standing in front of the ROIs, and serve only as detection mechanisms and are not visible to the viewers.

FIGURE 4 |
FIGURE 4 | A screen shot from the Unity editor showing the cards distributed around the scene.They are interactable with the virtual controller in the hand of the viewer.In this figure, all of the cards are visible, and pointed out with yellow arrows for illustration purposes only.In the actual scene the yellow arrows are not visible.Also the cards will not show in an unanimous fashion like in this figure.They are programmed to show at certain locations and times according to the progress of the current segment.

FIGURE 5 |
FIGURE 5 | Screenshots of what participants see in four conditions.(A): For conditions DC, DR, and VI, the participants simply sit and watch the 360-degree videos.The Helper HUD is always visible in the scene and fixed to the lower left corner of the view.It is highlighted with the yellow box.The yellow box is not visible in the experiment.(B):For condition VE, the participant will also see a virtual controller with a red laser pointer in her right hand.There are also interactive cards with texts visible in the scene at certain times (like the one in the screenshot).The participant then uses the virtual controller to point and make explicit choices.

TABLE 1 |
The mean values of the results for Engagement, Enjoyment, General user experience and Memory performance, for each condition.

TABLE 2 |
The number of participants' responses grouped by themes and sub-level topics, with their distribution in the different conditions experienced.