Beyond the looking glass: multimodal LLM-based depth-sensing for spatial behavior modeling in media architecture

Wu, Zhikun; Fatah gen. Schieck, Ava

doi:10.3389/fcomp.2026.1746674

ORIGINAL RESEARCH article

Front. Comput. Sci., 07 April 2026

Sec. Human-Media Interaction

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1746674

Beyond the looking glass: multimodal LLM-based depth-sensing for spatial behavior modeling in media architecture

1. The Division of Media and Information Technology, Linköping University, Norrköping, Sweden
2. The Bartlett School of Architecture, University College London, London, United Kingdom

A correction has been applied to this article in:

Correction: Beyond the looking glass: multimodal LLM-based depth-sensing for spatial behavior modeling in media architecture
1. Read correction

Abstract

Large media façades are reshaping interactions in buildings and public spaces into immersive environments, yet empirical knowledge of how pedestrians behave inside these media spcaes is still limited. This study introduces a fully automated pipeline for in-the-wild behavior analysis that integrates a system which consists of a stereo-depth camera, an object detection model with multi-target tracking algorithm, and GPT-4o with visual reasoning. Deployed at London's immersive media building Now Arcade, the system captured 2 h of depth-enhanced video and produced more than six hundred anonymised visitor trajectories without any manual annotation. It reliably identified three recurrent behaviors: passing-by, lingering, and shooting (photographing or filming). To reveal where these actions occur, we propose Behavior Instance Density (BiD) heat-maps that project frame-level behavior instances onto a floor-plan grid of 0.5m × 0.5m squares. A comparative BiD study of 2 h-long content loops with static high-contrast imagery and dynamic low-contrast animation, shows clear content-driven behavior differences. Static saturated graphics encourage longer stays and more filming at both buildings entrance and exit thresholds, while dynamic darker visuals maintain a predominantly transit-oriented flow through the corridor.The proposed pipeline uses a compact, cost-effective sensing setup, safeguards privacy by discarding raw images after processing, and can be scaled for long-term or multi-site deployments. The resulting behavioral insights offer concrete guidance for media-architecture design and lay the groundwork for responsive façades that can update their digital content in real time according to observed human engagement.

1 Introduction

As large-scale digital screens gradually integrate into urban buildings, media architecture has emerged—an immersive three-dimensional environment that fuses digital displays with architecture and public space (Tomitsch et al., 2015). These settings are transforming cityscapes, serving as interactive hubs that draw diverse audiences and redefine how people engage with public spaces. In these complex environments, understanding human behavior is crucial for developing digital content design and enhancing viewers' experience. This research challenge aligns with the interdisciplinary domain of Human-Building Interaction (HBI), which focuses on the relationship between digitally enhanced architectural environments and human behavior (Alavi et al., 2019; Fatah gen Schieck et al., 2025).

Despite the growing prevalence of media architecture, empirical research on viewers' behavior in three-dimensional, multi-screen environments remains relatively scarce. Existing work largely revolves around theoretical models or simplified flat displays, relying on manual observation and qualitative methods (Müller et al., 2012; Behrens et al., 2013; Fischer, 2015; Tzortzi and Fatah gen. Schieck, 2025). These approaches are labor-intensive and lack scalability, limiting the ability to leverage reliable data insights to improve media architecture design.

Recent advances in automated behavior analysis offer promising avenues for addressing these limitations. For example, computer vision techniques (e.g., YOLO¹) have been widely used in pedestrian detection and behavior analysis (Kajabad and Ivanov, 2019; Oltean et al., 2019). Nevertheless, such methods often require extensive labeled data to handle specialized recognition tasks, which is time-consuming and resource intensive. By contrast, depth cameras (e.g., ZED 2i²) and multimodal large language models (e.g., GPT-4o³) offer a more flexible approach, enabling the identification and classification of human actions or behaviors never encountered during training, along with their localization, without manual labeling (OpenAI et al., 2024). These tools open new possibilities for large-scale, automated analysis of human behavior in complex environments and pave the way for potentially creating more interactive and responsive media architecture.

In this paper, we present an automated analysis pipeline that combines depth camera, computer vision, and large language models (LLMs) to capture and examine crowd behavior in a three-dimensional media architecture setting. Specifically, we integrate a ZED 2i depth camera, the YOLOv10x object detection framework (Wang et al., 2024) with DeepSORT tracking (Veeramani et al., 2018), and GPT-4o's visual capabilities (OpenAI et al., 2024) to identify and classify viewer's behaviors. To offer actionable insights into spatial interaction patterns, we introduce an innovative visualization technique, the Behavior Instance Density (BiD) heatmap, to represent the spatial intensity and distribution of behaviors. We applied this pipeline to the Now Arcade⁴ building in London, an immersive public media space equipped with multiple digital screens. Two separate 60-min datasets were collected in the wild under two distinct types of digital content: Rainbow Arcade with static high-contrast imagery and Space in Between with dynamic low-contrast animation. In Rainbow Arcade, 304 viewers were recorded, while 291 viewers were recorded in Space in Between. We analyzed and compared how two different type of content influenced viewer behavior, focusing on passing-by, lingering, and the special behavior of shooting (photographing/filming).

Our results demonstrated that the Rainbow Arcade, with its vibrant and static imagery, elicited higher levels of passing-by, lingering, and shooting behaviors compared to the dynamic and subdued Space in Between. The BiD heatmaps revealed distinct spatial behavior patterns, with greater engagement concentrated near entrance areas. Furthermore, our method, which combines motion analysis from depth camera data with behavior detection via GPT-4o and YOLOv10x, achieved around 90% overall accuracy in classifying key behaviors, demonstrating its effectiveness for real-world viewer analysis. These findings underscore the potential of integrating automated depth sensing and multi-modal analysis to inform media architecture design and foster richer viewer engagement.

2 Related work

2.1 Theoretical models of behavior

Public displays and media façades have long been a key area of interest in media architecture, which blends insights from Human-Computer Interaction (HCI) with architecture and urban design. Researchers in this domain focus on how digital elements embedded in the built environment influence human behavior, exploring the interfaces between architectural design and interaction design.

A wide range of theoretical models has been proposed to classify and interpret viewer behavior around public displays. One approach center on categorizing viewers by their roles or levels of engagement with the display.One approach proposed the “Performance Triad” model, dividing viewers into observers, participants, and performers, emphasizing that these roles are equally integral to the digital performance context (Sheridan et al., 2005). Similarly, another segmentation categorized users into bystanders, spectators, and actors, where bystanders show little interest, spectators passively engage but do not actively interact, and actors become deeply involved in the displayed content (Finke et al., 2008). Although these role-based models provide an initial theoretical framework for understanding viewer engagement around public displays, they do not fully address how viewer roles may shift dynamically within complex social or spatial environments.

A second set of models emphasizes the stages of interaction and the social dynamics surrounding public displays. Brignull and Rogers (2003) were the first to map the social choreography around large public displays—showing how passers-by drift toward the screen, cluster, and evolve from casual onlookers into active participants—while Wouters et al. (2016) extended this line of work with their “Honeypot Model,” which traces the successive engagement trajectories and social cues that entice observers to approach, learn from others, and ultimately join the interaction. Building upon this, another model introduced the “Audience Funnel” model, breaking interactions into six stages, ranging from simply passing by to deeper forms of engagement (Michelis and Müller, 2011). This model offers a more nuanced understanding of how viewers coordinate their activities and levels of participation. Further extending these insights, a separate study proposed the PACD model (Passing-by, Approaching, Coalescing, Departing), which highlights turn-taking and the dynamic interplay among viewers in front of a display (Memarovic et al., 2012). Their work demonstrates how active engagement can stimulate more interaction—even influencing those who initially only pass by. These models have been particularly effective in analyzing interactions around smaller-scale public screens.

To address the complexities of larger and more immersive public displays, one approach introduced the “Urban HCI” model. This model broadens the scope beyond small displays and classifies urban spaces around large public screens into zones such as “Display Space,” “Interaction Space,” and “Comfort Space” (Fischer and Hornecker, 2012). In a similar vein, the ELSI model integrates viewer roles and spatial relationships across displays of varying scales (Memarovic et al., 2015). The ELSI model defines five viewer roles and organizes the space around displays into distinct zones, including the Potential Active Engagement Space (which encompasses the Active Engagement Space), the Passive Engagement Space, and the Display Awareness Space. By synthesizing engagement patterns across different display sizes, the ELSI model provides a multi-functional framework for examining how viewers respond to and interact with public displays.

However, these spatially oriented models primarily focus on flat or isolated displays and thus do not capture the complexities of linked, distributed, or three dimensional configurations. This gap limits our understanding of how viewers behave in more intricate environments such as those incorporating multiple, interconnected display surfaces or media façades that form immersive media space. Addressing this limitation is crucial for advancing research on viewer behavior within media rich architectural contexts, where non-traditional display formats may provoke new forms of engagement.

2.2 Methods for tracking viewers' behavior

To understand how viewers behave around public displays and media architecture, researchers have developed various methods to capture and analyze behavioral data. Early studies often relied on manual observation and data extraction, while capable of providing rich insights, were costly, time-consuming, and challenging to scale for large-scale data collection (Behrens et al., 2013; Fischer, 2015; Qin et al., 2020; Psarras et al., 2019; Memarovic et al., 2016; Tzortzi and Fatah gen. Schieck, 2025). For example, One study performed qualitative analysis and image-based observations to categorize typical behavioral patterns among different viewer roles (e.g., actors, spectators, passers-by) interacting with urban displays (Behrens et al., 2013). Another investigation conducted a twelve-week manual analysis of snapshots and interaction logs to examine viewer engagement across multiple urban settings (Memarovic et al., 2016). While these approaches yielded valuable insights, their reliance on manual coding limited the scope and efficiency of the research. Complementing qualitative techniques with quantitative ones can provide richer and more detailed data. A specialized tool called the “Comprende Mapper” was introduced for manually mapping viewer trajectories in video footage (Fischer, 2015), while other researchers used cameras to record movements later determining each individual's position and gaze direction manually (Qin et al., 2020) or computationally (Psarras et al., 2019). Despite offering fine-grained details, these manual methods remain labor-intensive and difficult to scale to larger contexts.

As technology advanced, automated methods emerged. One study employed LiDAR to track visitor movements in museums, benefiting from improved privacy protection though lacking the ability to capture subtle behavioral nuances (Rashed et al., 2016). In contrast, combining RGB cameras with deep learning-based computer vision techniques like YOLO enables more detailed observation of how people interact with their environment. Another approach utilized the YOLOv3 algorithm to detect pedestrians and differentiate behaviors such as walking, running, or standing, thus offering a more scalable and data-rich approach than manual methods (Kajabad and Ivanov, 2019; Oltean et al., 2019).

Depth cameras further enhance the capture of three-dimensional spatial information. Earlier investigations explored the use of Kinect sensors to track pedestrian movements and interests (Seer et al., 2014; Chen et al., 2016). However, the Kinect's limited field of view and maximum measurement distance (under approximately 5 m) constrained its applicability in large public spaces. The stereo-depth technologies, like the ZED 2i camera, address these limitations by offering an extended measurement range to 20 m (Stereolabs, 2025). Abdelsalam et al. (2024) finding indicated that the ZED 2i can reliably capture depth up to 18 m in HD1080 and HD2K. Moreover, when the setting had very dark objects and poor lighting, the ZED 2i camera produced more accurate and reliable depth maps than the RealSense⁵ cameras (Tadic et al., 2022).

Comparative studies report that the ZED 2i to RealSense devices for skeleton tracking, showing that the ZED 2i outperformed its counterpart across multiple criteria (Sosa-León and Schwering, 2022; Aharony et al., 2024). Although the ZED 2i has not yet been specifically used to examine viewer behavior around public displays, similar technologies have shown promise in related fields. For instance, depth camera-based methods have been used to detect pedestrians in autonomous driving contexts (Harisankar and Karthika, 2020; Abughalieh and Alawneh, 2020), and similar techniques have been applied to locate potted flowers in agricultural settings (Wang et al., 2022). Nonetheless, depth-camera pipelines coupled with computer-vision detectors still require heavy task-specific annotation to capture semantic context in uncontrolled public-display settings, revealing a gap for label-efficient, semantics-aware approaches—motivating the integration of multimodal large language models.

2.3 Integrating large language models

Open I's introduction of GPT-4v in 2023 presented a major breakthrough in integrating visual capabilities into LLMs (OpenAI, 2023). This model excelled in tasks requiring visual comprehension (Yang et al., 2023). In 2024, OpenAI released GPT-4o, further enhancing multimodal capabilities and achieving even stronger performance in vision-related tasks (OpenAI et al., 2024; Shahriar et al., 2024). These advancements support the application of LLMs like GPT-4v to tasks such as human feature recognition and activity identification, addressing scalability challenges that traditional methods often face (Ogawa et al., 2024; Hirano et al., 2024; Fujimoto and Bashar, 2024; Limberg et al., 2024).

LLMs have shown great potential in automating complex tasks, including multi-attribute classification. One approach used GPT-4v to automatically categorize and annotate large image datasets without manual labeling (Fujimoto and Bashar, 2024). By designing suitable prompts and feeding single-person images into GPT-4v, the model was able to generate multiple attributes—such as clothing colors, hairstyle, age, and gender. Its performance surpassed that of a ResNet-50 convolutional neural network model that had been enhanced with manual annotations.

Beyond feature recognition, LLMs have also been applied to human activity recognition. GPT-4v was combined with knowledge graphs to analyze multimodal datasets, aiming to assess safety risks in elderly households (Ogawa et al., 2024; Hirano et al., 2024). By leveraging multimodal large language models (MLLMs) like GPT-4v's reasoning capabilities, these studies addressed complex challenges in daily activity detection.

While LLMs excel at generating context-aware descriptions and high-level reasoning, they struggle with precise object localization. In contrast, object detection neural network models like YOLO (Redmon et al., 2016) are renowned for accurate detection and positioning but lack semantic and contextual understanding. Integrating YOLO with MLLMs like GPT-4o can leverage their complementary strengths: YOLO focuses on detection and localization, while the LLM provides semantic enrichment and relational reasoning. One framework illustrated this synergy by introducing a Visual-Language Agent (VLA) (Yang et al., 2024), in which YOLO acts as the “visual agent,” ensuring precise detection, while GPT-4v serves as the “language agent,” refining results through spatial and contextual analysis. This combined approach enhances accuracy and contextual coherence. Another investigation evaluated YOLO and GPT-4v in aerial scenarios captured by drones (Limberg et al., 2024), finding that YOLO-World efficiently identified individuals and their positions, while GPT-4v improved scene interpretation by filtering out irrelevant areas. This demonstrates that combining object detection with multi-modal reasoning significantly boosts activity recognition performance.

Although integrating YOLO and multi-modal LLMs like GPT-4v or GPT-4o offers a powerful and flexible framework for tackling complex tasks in human feature and activity recognition, these techniques have yet to be applied to viewer's behavior analysis around public displays or validated in uncontrolled, real-world environments. Our research seeks to address this gap by developing and evaluating such a framework in wild, as outlined in the following methodology section.

3 Methodology

We conduct frame-level tracking and behavior analysis in the wild at a busy public media-architecture site rather than in a controlled laboratory (Rogers and Marshall, 2017). In-situ tracking entails uncontrolled illumination, crowding, occlusions, and shifting backgrounds that can degrade detection fidelity, depth accuracy, and temporal consistency; yet it affords higher ecological validity by capturing authentic interactions that are hard to reproduce in lab settings. Accordingly, this section (1) characterizes the site and screen content (Context Analysis); (2) specifies the technical framework—ZED 2i stereo depth, YOLOv10x for person detection, DeepSORT for multi-target tracking, GPT-4o for zero-shot semantic labeling—together with coordinate calibration and speed estimation (Technical Framework Development); (3) documents the in-situ data-collection protocol, including camera placement, schedule, and ethical considerations (Data Collection); and (4) details our visualization strategy via Behavior Instance Density (BiD) heat maps on a 0.5m × 0.5m floor grid (Data Visualization).

3.1 Context analysis

3.1.1 Site analysis

This study focuses on Outernet⁶ in central London, a new form of media architecture that provides an immersive digital experience. Outernet consists of several public areas whose interior walls and ceilings are covered with LED screens, primarily encompassing three core zones: the Now Building, Now Trending, and Now Arcade (Berber et al., 2024). Our research centers on the Now Arcade, which attracts a large number of viewers and fosters unique behavior patterns by presenting various digital content.

According to Figure 1a, Now Arcade, located at 21 Denmark Street, serves as a corridor connecting Denmark Street and Denmark Place. The corridor has a total length of 24 m, a width of 4.5 m, and a height of 6 m, and it offers a main passage to other Outernet areas. Its architectural design skillfully combines digital screens with physical space, forming a distinctive immersive media space.

Figure 1

The principal digital displays in Now Arcade are composed of three big LED screens. The west wall and ceiling are covered by the West Screen and the Ceiling Screen, respectively. The West Screen measures 20 m in length and 5.5 m in height, while the Ceiling Screen is also 20 m long and 4.5 m wide. In addition, the east wall features an East Screen of the same dimensions as the West Screen (20 m in length and 5.5 m in height), though part of its display is obstructed by an entrance leading to the vehicle lift lobby and stairwell. Together, these screens constitute the core infrastructure for digital content display within Now Arcade.

Now Arcade is open daily from 10:30 a.m. to 11:30 p.m., extended until midnight on Fridays and Saturdays. During operating hours, the screens loop various video content. Pedestrians on Denmark Street can see these displays when passing by the main entrance, capturing their attention. Meanwhile, pedestrians entering from Denmark Place must pass through Now Arcade to reach Denmark Street, further increasing exposure to the displayed media content.

3.1.2 Screen content analysis

For this research, we specifically selected two different digital content sequences shown from 6:00 p.m. to 7:00 p.m. each day, which we have designated as Rainbow Arcade and Space in Between. Each type of content forms a complete 60-min sequence (Figures 2a, b).

Figure 2

Rainbow Arcade (Sequence A) consists of six identical sub-sequences, each lasting 10 min. Every subsequence is further divided into two segments: a 30-s opening segment called Clip0, and a 570-s main segment called ClipA (Figure 2c). By contrast, Space in Between (Sequence B) also comprises six sub-sequences, each divided into four segments. The opening segment, Clip0, runs for 30 s, followed by three content segments played in a loop—ClipB1 (110 s), ClipB2 (30 s), and ClipB3 (50 s)—collectively adding up to 570 s (Figure 2d).

Sequence A contains relatively static visuals, focusing on color and graphical presentations, whereas Sequence B features more dynamic content with rapidly changing images and complex animations. This contrast offers a unique perspective for observing how different visual stimuli shape viewer interaction behavior, revealing which content elements may significantly influence viewer engagement and movement paths. Further analysis of the visual content in the sub-sequences is presented in Figure 3.

Figure 3

3.1.3 Observed behaviors

After studying the two content sequences (Sequence A and Sequence B) in Now Arcade, we observed several typical behavioral patterns among viewers interacting with the media architecture. By recording and analyzing these behaviors, we found that different types of digital content not only affect viewers' movement status but also shape their specific modes of interaction. Among these, photographing or filming proved particularly significant.

In Sequence A, for instance (Figure 4a), a man entered Now Arcade through the main entrance, paused to watch the displays, and then took out his phone to record a video. After a brief stop, he returned to the main entrance and exited Now Arcade. This process illustrates the dynamic progression of a viewer's engagement, moving from a simple pause to more active participation (e.g., filming). In Sequence B (Figure 4b), another man entered Now Arcade from the main entrance but was attracted by the more dynamic screen content, slowing his pace as he moved further into the media space. After walking a few steps, he recorded videos from multiple angles before continuing on to the secondary exit. These two scenarios suggest that the type of digital content affects not only the viewer's movement trajectory but also the depth, duration and form of engagement.

Figure 4

Based on our observations in situ, viewers' behavior can be categorized into two primary motion status: Passing-by and Lingering. Passing-by describes viewers who walk through the arcade without significant pauses. This is generally a brief interaction, with limited viewer attention to the screen content; they simply move through the environment without more profound engagement. Lingering refers to viewers who choose to remain in Now Arcade, perhaps wandering around, closely observing the screen content, or even interacting to some degree. Lingering behaviors reflect a higher level of interest in the content and demonstrate how media architecture can capture attention and foster engagement, for example, through occasional touches of the screen.

Furthermore, we observed that photographing or filming (Figure 4c), collectively referred to as shooting, is a specialized behavior that cuts across both passing-by and lingering modes. In some cases, shooting accompanies passing-by—some viewers walk and record simultaneously, quickly capturing the visual elements of the media space. In other cases, shooting is paired with lingering. Some viewers stop, record videos or take photos from various angles, and spend additional time carefully framing shots and capturing details. The variety of shooting behaviors is also reflected in viewers' motives and styles. Some focus on capturing the architecture and screens themselves, while others film friends or passers-by, and still others take selfies to integrate themselves into the media environment. Whether in a passing-by or lingering context, shooting not only serves as a personal mode of interaction but also has broader implications for dissemination. When viewers share their videos or photos on social media platforms, the digital content reaches an even wider audience. Consequently, we regard shooting as an important indicator of digital content quality, with higher shooting frequencies typically signaling more compelling material and greater potential for further exposure (Liu and Li, 2021; Alaily-Mattar et al., 2023).

3.2 Technical framework development

Following the site analysis and in-situ behavior observations, we designed and implemented iteratively a technical framework for frame-level viewer-behavior analysis (Figure 5). This section first outlines the hardware setup at Now Arcade, then details the data-processing and analysis pipeline—which ingests RGB frames and aligned 3D depth from ZED SVO files and integrates YOLOv10x (Wang et al., 2024) for person detection, DeepSORT (Veeramani et al., 2018) for multi-target tracking, and GPT-4o's visual capabilities (OpenAI et al., 2024) for behavior recognition and localization—and finally describes the data collection and visualization.

Figure 5

3.2.1 Hardware setup

The data acquisition system consists of one ZED 2i stereo depth camera and a laptop computer. We selected the highest-performing ZED 2i 2.1mm lens version (Aharony et al., 2024), offering a maximum field of view of 110° with 20 m for 3D recognition and 40 m for 2D recognition (Stereolabs, 2025). To improve depth data accuracy in public spaces, the ZED 2i is equipped with a polarizing filter that reduces glare and reflections from LED screens, thereby optimizing data quality (TEGARA Co., 2022). The camera connects to the laptop via a USB-C 3.0 cable, and data are captured using Stereolabs' ZED Explorer software.⁷

The ZED Explorer software runs on a laptop equipped with an Nvidia RTX3070 GPU, leveraging CUDA⁸ acceleration to ensure smooth and real-time data acquisition. For convenient data collection in real-world scenarios, the system can run continuously for approximately 80 min on the laptop's battery power. The recorded data are saved in SVO⁹ format—a custom file format by Stereolabs that contains both RGB video and corresponding 3D depth data.

In our experiments, the ZED 2i was configured to run at 1080P resolution and 30 FPS, which is considered the optimal performance setting for the ZED 2i depth camera. This configuration delivers high-definition video quality while maintaining accurate depth data within an 18-m range (Abdelsalam et al., 2024). By combining these hardware components and settings, the system efficiently captures and processes the visual and depth data required for our study.

3.2.2 Data processing and analysis framework

In this framework (Figure 5), a “frame image” refers to a single image captured by the ZED 2i camera at 30 frames per second (FPS) during RGB video recording. A viewer detected in the frame image is defined as one “instance.” If multiple viewers are detected, multiple instances are present. We use these instances to progressively identify and analyze viewer movement status (lingering/passing-by) and special behaviors, while extracting and calibrating the relevant 3D depth coordinates for visualization. The on-site qualitative basis for our behavioral classification will be discussed in Section 3.4.

First, the YOLOv10x model is employed to detect viewers in the video. Each viewer is then tracked across frames using the DeepSORT algorithm. Trained on the COCO dataset (Lin et al., 2014), YOLOv10x efficiently detects people in RGB frames and assigns each person a unique ID, along with a bounding box for the detected region in the image. DeepSORT subsequently associates these bounding boxes across consecutive video frames based on the unique ID, enabling trajectory tracking for the same individual. Although DeepSORT reliably maintains most identities, dense occlusions sometimes split one visitor into several short-lived tracks. For BiD heat-map generation (see Section 3.4 for the detail), this fragmentation is inconsequential because the BiD metric aggregates frame-based behavior instances. However, to obtain accurate per-person statistics we performed a post-hoc merge of duplicate IDs.

We adopt a sampling frequency of once per second by analyzing only frames whose numbers are multiples of 30, as the camera records video at 30 frames per second. These frames are referred to as “sampled frames.” For each viewer in a sampled frame, we record the ID, frame number, bounding box coordinates, and the original 3D spatial coordinates. The 3D coordinates are obtained from the ZED 2i's depth data by referencing the bounding box's horizontal center point and vertical upper-quarter point. This spatial information underpins further analyses of viewer position and movement patterns.

To compute each viewer's speed in a sampled frame, we compare their 3D spatial coordinates in the current sampled frame with the coordinates from the prior non-sampled frame (i.e., the frame immediately preceding in the video sequence). Speed v is calculated using the following equation:

In Equation 1, f represents the frame rate of the video (30 frames per second), and x, y, z and x′, y′, z′ denote the viewer's spatial coordinates in the prior and current frames, respectively.

We then apply GPT-4o's vision capability for zero-shot detection by cropping an individual image of each viewer from their bounding box in the frame. Two sets of prompts are used:

Prompt 1:Determine if the person could be [Special Behavior]. Please reply in this format: “Behavior: TRUE/FALSE.”
Prompt 2:Determine if the person could be walking or standing. Please reply in this format: “Status: WALK/STAND.”

Prompt 1 checks whether a viewer in the sampled frame exhibits any special behaviors (e.g., taking pictures, taking selfies, jumping, or dancing). These special behaviors can be defined based on the specific research context.

Prompt 2 determines whether a viewer in the sampled frame is standing or walking. We do not directly classify a viewer as lingering or passing-by because lingering may involve both standing and slow walking. Additionally, a single static frame of slow walking may visually resemble passing-by, making visual analysis alone insufficient. Thus, only the viewer's velocity, as defined in Equation 1, can help distinguish between these states.

However, as the reliable depth measurement range of the ZED 2i is up to 18 m, viewer coordinates detected beyond this range may be inaccurate, introducing errors in velocity calculation. Relying solely on speed for movement status classification can therefore be imprecise. To mitigate this, we aim to integrate the LLM's visual judgments with viewer speed to infer whether the viewer is lingering or passing-by, thereby improving accuracy. The rules are set as follows:

Speed > 1 m/s and Status: WALK → Passing-by.
Speed > 1 m/s and Status: STAND → Lingering.
Speed < 1 m/s and Status: WALK → Lingering.
Speed < 1 m/s and Status: STAND → Lingering.

After completing these analyses, we calibrate the viewers' 3D spatial coordinates. Since the raw 3D coordinates are centered on the ZED 2i's camera coordinate system, they need to be remapped to a real-world reference system. We use Rhino,¹⁰ a 3D modeling software commonly used in architecture and design, and Grasshopper,¹¹ a visual programming environment integrated with Rhino, to read the original 3D coordinate data and align it with point cloud data of the observation site. First, we visualize all viewers coordinate points and the site's 3D point cloud in Rhino. Next, we select the ground surface as the reference plane and establish its origin and the x-axis and y-axis vectors, converting the viewer's raw coordinates into 2D coordinates based on that plane.

The transformation of raw 3D coordinates (x, y, z) into a 2D coordinate system on the reference plane is achieved using the following formulas:

In Equations 2, 3, (x₀, y₀, z₀) is the origin of the reference plane, while and represent the direction vectors of the plane's x-axis and y-axis, respectively. The formulas first translate the original 3D coordinates relative to the plane's origin, forming the vector (x−x₀, y−y₀, z−z₀). This vector is then projected onto the reference plane by calculating its dot product with the direction vectors and , normalized by the squared magnitude of the corresponding direction vector. The resulting (x′, y′) coordinates provide the 2D position of the point on the reference plane, allowing for accurate alignment and visualization in subsequent analyses.

Using these calibrated coordinates, we then calculate the Behavior Instance Density (BiD) to quantify the frequency of specific behaviors in each area. Following prior studies (Shi and Liu, 2014; Kuligowski et al., 2010; Pan et al., 2025), the observation area is divided into 0.5m × 0.5m grid cells, each representing an individual's space, with behavior instances counted per cell. Each cell's BiD value represents the density of viewer behaviors in that region, and heat maps are generated based on the site's floor plan layout. The detailed methodology for generating these heat maps will be discussed in Section 3.6.

3.3 Data collection

In this study of Now Arcade, we collected data on viewer's interactions with (and within) the media environment (Kirsh, 2019). A ZED 2i camera was installed on a tripod at a height of 2.9 m, positioned near the main entrance close to the east wall, and aimed toward the west screen (Figure 1b). Following instructions by the owner of Now Arcade, the tripod was therefore positioned outside the glass entrance, in the public realm of Denmark Street. From this vantage point the ZED 2i had an unobstructed line-of-sight through the doorway and did not narrow the passageway or interfere with visitor flow. The external placement also simplified ethical compliance: all recordings were taken from a public right-of-way, avoiding any impression that the study intruded on private indoor space.

Data collection took place over two days, each featuring a different screen content sequence. On July 5, 2023 (Wednesday), the Rainbow Arcade (Sequence A) content was recorded from 6:00 p.m. to 7:00 p.m. Under clear and comfortable weather conditions, the system automatically detected 316 entries into Now Arcade. On the following day (July 6, 2023, Thursday), the Space in Between (Sequence B) content was recorded during the same time window. The weather was again like the previous day, and 280 entries into the Arcade were detected.

Because of device battery limitations, each recording lasted 80 min, and the analysis period was limited to 60 min. This choice reflected constraints posed by filming in wild conditions—namely, practical considerations of device performance and prior findings suggesting that a shorter observation window can effectively capture interaction patterns in public spaces (Whyte, 1980; Gehl, 2001). A 60-min sample provides sufficient data for analysis while maintaining consistent environmental conditions. The number of viewers and behavior instances observed in each category are summarized in Table 1.

Table 1

Sequence	Total viewers	Passing-by		Lingering		Shooting
		Viewers	Instance	Viewers	Instance	Viewers	Instance
Sequence A	304	299	3,930	283	5,662	75	1,144
Sequence B	291	281	3,176	266	4,905	39	924

Summary of viewer counts and behavior metrics for the two 60-min sequences (Sequence A = Rainbow Arcade, Sequence B = Space in Between).

Total Viewers denotes the number of unique individuals who entered the observation zone during the analysis window. For each behavior, Viewers is the number of unique individuals who exhibited that behavior at least once; Instances is the total number of contiguous behavior episodes across all viewers (sampled at 1 Hz; consecutive frames labeled with the same behavior for the same person are counted as one instance). Behavior categories are not mutually exclusive, so a viewer may contribute to multiple behaviors (e.g., a lingering viewer may also shoot).

3.4 Data visualization

In the study of Now Arcade, we retained viewer data within a 21-m range from the entrance (y = 0) and generated a Behavior Instance Density (BiD) heatmap using the calibrated 2D coordinate data. First, based on Now Arcade's floor plan, we established a two-dimensional coordinate plane and divided the floor area into a 9 × 42 grid, with each cell covering 0.25 m². This coordinate plane spanned much of the Arcade, measuring 21 m in length and 4.5 m in width.

Using the Space in Between (Sequence B) as an example and focusing on lingering behavior (Figure 6), we calculated the BiD for each grid cell. The BiD value represents the number of lingering behavior instances occurring within a cell during the 60-min observation period. This metric reflects how frequently—or for how long—this specific behavior appeared in each spatial location.

Figure 6

The heatmap was generated by coloring each grid cell according to its BiD value, mapped onto a gradient scale from zero to fifty. Cells with a BiD of zero were shown in dark blue, indicating no lingering instances during the timeframe, while cells with a BiD above fifty appeared in dark red, indicating more than fifty lingering instances. This visualization provides a clear representation of the spatial distribution of lingering behavior across the observation period.

Each cell's BiD value represents the density of lingering behavior within that area. By sequentially generating BiD heatmaps, we captured the overall distribution of all viewers' behaviors during the specified timeframe. Additionally, by filtering data by viewer ID, the heatmap can analyze individual spatial behavior patterns, offering insights into crowd dynamics and individual engagement. This method serves as a robust foundation for optimizing media architecture design and enhancing interactive content (see detail in Sections 4.4–4.6).

4 Results

4.1 Validation of multi-modal LLM classification

To validate the classification accuracy of GPT-4o using prompts for behavior detection, we randomly sampled approximately 20% of the total dataset, which consisted of 2 h of video footage captured by the ZED 2i and subsequently analyzed, derived from real-world scenarios in Now Arcade. The dataset inherently reflected the observed class distribution, where non-shooting behavior constituted over 88% of the instances, while shooting behavior accounted for less than 12%. This imbalance aligns with the real-world prevalence of these behaviors. The evaluation results are summarized in Table 2.

Table 2

Prompt	Behavior	Overall accuracy	Precision	Recall	F1-score
Prompt 1	Shooting	93.15%	64.04%	94.43%	76.32%
Prompt 1	Non-Shooting	93.15%	99.21%	92.98%	95.99%
Prompt 2	Standing	90.56%	91.62%	89.28%	90.43%
Prompt 2	Walking	90.56%	89.54%	91.83%	90.67%

Classification performance of GPT-4o under Prompt 1 (shooting vs. non-shooting) and Prompt 2 (Standing vs. Walking).

For Prompt 1, the overall accuracy for identifying shooting vs. non-shooting behaviors reached 93.15%. Non-shooting behavior, the majority class, achieved a high precision (99.21%) and recall (92.98%), leading to an F1-score of 95.99%. In contrast, shooting behavior, the minority class, had a relatively lower precision of 64.04%, though its recall was much higher at 94.43%. This imbalance between precision and recall for shooting indicates that while the model captures most shooting instances (high recall), it occasionally misclassifies non-shooting behaviors as shooting (lower precision).

For Prompt 2, distinguishing between standing and walking, the model demonstrated consistent performance with overall accuracy at 90.56%. Both behaviors showed balanced precision and recall, with F1-scores of 90.43% for standing and 90.67% for walking, indicating reliable classification for both movement statuses.

The results highlight the MLLM's robust ability to identify majority class behaviors (e.g., non-shooting) with high precision and recall, ensuring reliable classification in real-world, imbalanced datasets. For minority behaviors such as shooting, the model's recall suggests it can effectively capture these instances, though its lower precision indicates room for improvement in reducing false positives. In our observations, false positives predominantly arose when subjects were far from the camera (resulting in low-resolution or motion-blurred crops) and when single-frame pose cues resembled filming, such as checking/typing on a phone held near the chest or face, making a phone call (hand-to-ear), shading eyes or blocking glare with a raised hand, pointing/gesturing toward the display or companions, or briefly adjusting glasses/hair with hands near the face. This may involve refining the MLLM's prompts to require more explicit evidence of filming (e.g., a visible device oriented toward the screen) and incorporating temporal consistency (e.g., multi-frame voting) to suppress transient, gesture-like false alarms. The high accuracy and F1-scores for standing and walking further demonstrate the model's capacity to support motion status classification.

4.2 Validation of motion behavior classification

To validate the classification of motion behavior, specifically differentiating between lingering and passing-by, we conducted a detailed evaluation using a random sample of approximately 20% of the total instances, resulting in a sample size of 3,600 instances. Two classification methods were compared: one relying solely on speed calculated from the depth camera's 3D spatial data and the other integrating speed-based calculations with MLLM visual calibration. The evaluation results are summarized in Table 3.

Table 3

Method	Motion behavior	Overall accuracy	Precision	Recall	F1-score
Speed	Lingering	75.51%	83.98%	73.13%	78.18%
Speed	Passing-by	75.51%	66.24%	79.08%	72.09%
Speed + GPT-4o	Lingering	89.69%	87.40%	96.77%	91.85%
Speed + GPT-4o	Passing-by	89.69%	94.22%	79.08%	85.99%

Comparison of motion behavior classification performance using speed-only and speed + GPT-4o methods.

The speed-only method, which used velocity thresholds derived from the 3D coordinates of viewers, demonstrated moderate performance. For lingering behaviors, this method achieved an accuracy of 75.51%, with a precision of 83.98% and an F1-score of 78.18%. However, the accuracy for passing-by behaviors was lower, with a precision of 66.24% and an F1-score of 72.09%. These discrepancies are primarily due to inaccuracies in depth data at greater distances from the ZED 2i camera, where speed calculations tend to become less reliable.

Incorporating LLM-based visual classification significantly enhanced the accuracy of motion status detection. The combined method utilized LLM prompts to visually assess walking and standing behaviors, which were then integrated with speed metrics to refine the classification. This approach corrected many instances where speed alone misclassified lingering as passing-by. The combined method achieved an accuracy of 89.69% for lingering, with a precision of 87.40% and an F1-score of 91.85%. Similarly, for passing-by behaviors, precision improved to 94.22%, with an F1-score of 85.99%.

While the combined method effectively recalibrated passing-by instances misclassified as lingering, it could not address errors in lingering behaviors that were already misclassified as such in the speed-only approach. Nevertheless, the substantial improvement in accuracy and precision highlights the effectiveness of integrating speed-based metrics with LLM visual assessments. This multi-modal approach provides a robust solution for motion status classification, particularly in challenging real-world environments where depth camera measurements alone may not suffice.

4.3 Statistical analysis of behavioral occurrence and instance intensity

We analyzed behavioral metrics on two complementary levels: (i) occurrence—whether a given behavior happened at all, and (ii) intensity—the number of episodes per viewer conditional on occurrence. Occurrence-rate comparisons used Yates-corrected chi-square tests (three behaviors; Bonferroni-adjusted family-wise α = 0.05, corrected α = 0.0167) with Cramer's V as the effect size. Shapiro–Wilk diagnostics indicated marked deviations from normality for per-viewer episode counts; accordingly, intensity differences were evaluated using Mann–Whitney U-tests on non-zero observations, with Bonferroni-adjusted p-values and rank-biserial r as the effect-size metric. All descriptive and inferential results are summarized in Tables 4, 5.

Table 4

Behavior	Scene	Total users	Users with behavior	Frequency (%)	χ²	p_adj	Cramer'sV	Significance
Passing-by	Sequence A	304	299	98.36	1.28	0.773	0.046	n.s.
Passing-by	Sequence B	291	281	96.56	1.28	0.773	0.046	n.s.
Lingering	Sequence A	304	283	93.09	0.38	1.000	0.025	n.s.
Lingering	Sequence B	291	266	91.41	0.38	1.000	0.025	n.s.
Shooting	Sequence A	304	75	24.67	11.47	0.002	0.139	**
Shooting	Sequence B	291	39	13.40	11.47	0.002	0.139	**

Behavior frequency comparison between Sequence A (Rainbow) and Sequence B (Space) scenes.

Bonferroni-adjusted α = 0.0167. The bold values highlight p-values less than 0.05 as significant.

Table 5

Behavior	Scene	Median [Q1–Q3] (instances)	p_adj	r	Significance
Passing-by	Sequence A	16 [8–23]	0.015	0.23	*
Passing-by	Sequence B	10.5 [5–14]	0.015	0.23	*
Lingering	Sequence A	28 [17–57]	1.000	–0.04	n.s.
Lingering	Sequence B	39 [14–69.5]	1.000	–0.04	n.s.
Shooting	Sequence A	12 [6–23]	1.000	–0.07	n.s.
Shooting	Sequence B	12 [7–30.75]	1.000	–0.07	n.s.

Mann–Whitney U comparison of behavioral instance counts per viewer between scenes (Bonferroni-adjusted α = 0.0167).

The bold values highlight p-values less than 0.05 as significant.

The occurrence-rate analysis shows that both “Passing-by” and “Lingering” occur at comparable rates in the two sequences and do not differ in a statistically meaningful way. For Passing-by, 98.36% of visitors in Sequence A (Rainbow Arcade) and 96.56% in Sequence B (Space in Between) engaged in the behavior; the chi-square statistic was 1.28 with an adjusted p = 0.773 and a trivial Cramer's V of 0.046. Lingering displayed a similarly negligible divergence: 93.09% in Sequence A vs. 91.41% in Sequence B, χ² = 0.38, adjusted p = 1.000, V = 0.025. Shooting, in contrast, proved sensitive to the visual context. The behavior occurred in 24.67 % of Rainbow viewers but in only 13.40% of Space viewers, yielding χ² = 11.47, an adjusted p = 0.002, and V = 0.139. Although the effect size is conventionally classified as “small,” the result clearly indicates that the high-saturation static imagery of Rainbow Arcade nearly doubles the likelihood that visitors take photographs or videos.

The instance-count analysis refines the frequency results. Among viewers who actually performed a Passing-by action, the median number of walking episodes in Sequence A was 16 (IQR 8–23), compared with 10.5 (IQR 5–14) in Sequence B. This difference remains significant after Bonferroni adjustment (p = 0.015) and is accompanied by a small rank-biserial effect (r = 0.23), indicating that the Rainbow scene not only triggers the behavior more often but also increases the number of walking episodes by roughly five to six per viewer. For Lingering, the median per-viewer counts were 28 episodes (IQR 17–57) in Sequence A and 39 (IQR 14–69.5) in Sequence B; the adjusted p-value is 1.000 and r = −0.04, showing that scene type has no systematic effect on how many times a visitor remains stationary. Shooting exhibits complete parity between sequences (both medians = 12 episodes) and is likewise non-significant after correction (adjusted p = 1.000, r = −0.07).

4.4 BiD heatmap analysis of behaviors in sequence A and B

To gain deeper insights into viewers' behaviors and reactions within the media space, we conducted a comparative analysis of Sequences A and B, each lasting 60 min. The behaviors analyzed include passing-by, lingering, and shooting. The BiD heatmap was used to visualize the behavioral distribution across the media space, offering a spatial understanding of how viewers interacted within the environment. Sequences A (static high-contrast imagery) and B (dynamic low-contrast animation) display distinct spatial patterns when examined through BiD heatmaps (Figure 7).

Figure 7

4.4.1 Passing-by

Sequence A: In the entrance area (y = 0 to y = 5), there is moderate activity with pockets of higher density at y = 2 to y = 3. In the middle area (y = 5 to y = 15), a noticeable line of higher density exists along x = 1 and x = 2 from y = 10 to y = 15. Near the exit area (y = 15 to y = 21), particularly from y = 17 to y = 21, there is an increase in passing-by behavior density.

Sequence B: The entrance area (y = 0 to y = 5) shows moderate passing-by activity. In the middle area (y = 5 to y = 15), passing-by behavior is more evenly distributed. Near the exit (y = 15 to y = 21), particularly around y = 20, the passing-by behavior density increases sharply.

4.4.2 Lingering

Sequence A: A clear hotspot of lingering activity is present in the entrance area (y = 0 to y = 5). The middle area (y = 5 to y = 15) shows more scattered and less frequent lingering behavior. Near the exit (y = 15 to y = 21), there is an increase in lingering, especially around y = 17.5 to y = 20.

Sequence B: Significant lingering activity occurs near the entrance (y = 0 to y = 5), particularly between y = 0 and y = 3. In the middle area (y = 5 to y = 15), lingering activity is less overall, with concentrated points like the red spot near y = 12.5 and x = 2. Near the exit (y = 15 to y = 21), particularly around y = 20, another high-density lingering zone exists.

4.4.3 Shooting

Sequence A: A noticeable hotspot of shooting behavior is found in the entrance area (y = 0 to y = 5). In the middle area (y = 5 to y = 15), shooting behavior is almost absent, with only isolated spots (e.g., near y = 12.5). Near the exit area (y = 15 to y = 21), there is almost no significant shooting behavior.

Sequence B: A clear concentration of shooting behavior exists near the entrance (y = 0 to y = 3). In the middle area, there is very little shooting behavior, with a few hotspots between y = 10 and y = 15. Near the exit area (y = 15 to y = 21), shooting behavior is sporadic and low in density.

The BiD heatmaps reflect behavioral differences between Sequences A and B. In Sequence A, people are more likely to enter through the main entrance and linger within the y = 1 to y = 5 range. In Sequence B, people tend to linger and take photos near the main entrance (y = 0 to y = 2.5) rather than moving further into the corridor. Spatial preferences within the Now Arcade differ between sequences; however, at middle and exit locations (y = 10 to y = 21), both sequences show similarities as people gather around the west screen area (x = 0 to x = 2). Those taking photos are usually on the east side, possibly due to a better view of the west screen. Overall, engagement is higher in Sequence A, where the static and colorful imagery encourages viewers to stop and take photos, compared to Sequence B.

4.5 BiD heatmap analysis of behaviors in subdivided sequences

The 60-min sequence was divided into six independent subsequences for a detailed comparison. We examined the consistency of BiD heatmap patterns within each subsequence to understand whether behaviors exhibited spatial consistency over time. This analysis provided a deeper temporal understanding of viewer interactions within the media space (Figure 8).

Figure 8

4.5.1 Passing-by

Subsequence A: The entrance areas (y = 0 to y = 5) show moderate passing-by activity across all subsequences, but no significant congestion or intense movement. The middle areas (y = 5 to y = 15) display moderate to low passing-by behavior across most spaces, with few high-density areas. The exit areas (y = 15 to y = 21) show varying levels of passing-by activity: A01, A02, and A06 show less passing-by activity; A03 and A05 show some moderate activity; and A04 exhibits relatively high density, especially around y = 17.5 and above.

Subsequence B: The entrance areas (y = 0 to y = 5) show moderate passing-by activity across all subsequences, but B02 shows a noticeable hotspot near the entrance (y = 0 to y = 2.5). Most subsequences show scattered passing-by activity throughout the middle areas (y = 5 to y = 15), and B06 shows a somewhat concentrated zone of passing-by behavior around y = 7.5. Most subsequences show moderate passing-by activity near the exit (y = 17.5 to y = 20).

4.5.2 Lingering

Subsequence A: The entrance area (y = 0 to y = 5) shows significant lingering activity across most heatmaps, including A01, A02, A04, A05, and A06. The middle areas (y = 5 to y = 15) show moderate lingering activity in most groups, but the distribution is more varied compared to the entrance areas. A04 and A05 have more concentrated lingering activity around y = 10 to y = 15. The exit areas (y = 15 to y = 21) also exhibit more variability in lingering behavior. A03 and A06 show very little lingering near the exit, while A05 has notable lingering near the exit, with significant clusters around y = 17.5 to y = 20.

Subsequence B: Most subsequences—such as B01, B02, B03, B04, and B06—have significant lingering activity near the entrance, particularly concentrated in the y = 0 to y = 2.5 range. Most subsequences display more widespread lingering behavior in the middle areas (y = 10 to y = 15). B01, B02, B04, and B06 have several clusters, particularly from y = 7.5 to y = 12.5. Lingering near the exit areas (y = 15 to y = 21) varies; nearly all subsequences have some concentrated lingering near the exit, with clusters appearing around y = 17.5 to y = 20.

4.5.3 Shooting

Subsequence A: A01, A02, A04, and A05 all show higher activity in the entrance areas (y = 0 to y = 5). In the middle areas (y = 5 to y = 15), A05 shows the most consistent shooting behavior, and A02, A03, A04, and A06 also exhibit some moderate shooting behavior. A02, A03, A04, and A05 show some activity near the exit areas (y = 15 to y = 21).

Subsequence B: Near the entrance areas (y = 0 to y = 5), B01, B02, and B06 show some significant shooting activity, while B03, B04, and B05 show almost no activity. In the middle areas (y = 5 to y = 15), B01, B02, and B06 have the most shooting activity, and B03 and B04 show lighter and more dispersed shooting behavior. Shooting activity near the exit areas (y = 15 to y = 21) is relatively low across all groups.

The analysis of subsequences revealed that, although the presence of very active persons was noted, their behaviors did not significantly distort the overall BiD heatmap patterns. This is because they constituted a small fraction of the total participants, and their behaviors were typically localized to specific areas, such as around visually prominent content. The spatial patterns in other areas remained consistent with the general pedestrian flow, indicating that these outliers did not influence the broader behavioral trends.

Additionally, the differences between Sequence A and Sequence B were observed, even when the analysis was conducted on 10-min subsequences. These differences are particularly noticeable at the entrance, and there are also some distinctions at the exit. These findings are consistent with our earlier analyses of the complete 60-min sequences. Therefore, reducing the time scale to 10 min does not alter the overall impact of subsequences with similar content on behavior. Within the same building, subsequences with similar media content exhibit comparable behavior patterns, whereas those with differing media content show varied effects on people's interactions.

4.6 BiD heatmap analysis of behavior across varied content types

In this section, we conduct a BiD heatmap analysis to compare viewers' behavior across different content types (ClipA, ClipB1, ClipB2, and ClipB3), following the content analysis outlined in Section 3.3. This comparative heatmap analysis helps identify how different content types influence viewer interaction patterns in the media architecture (Figure 9).

Figure 9

4.6.1 Passing-by

ClipA shows moderate passing-by activity near the entrance area (y = 0 to y = 5) and the exit area (y = 15 to y = 21). The passing-by behavior is consistent throughout the middle area (y = 5 to y = 15), with clusters.

ClipB1 exhibits high passing-by activity near the entrance (y = 0 to y = 5). This high density gradually decreases through the middle area (y = 5 to y = 15). Near the exit (y = 15 to y = 21), the passing-by behavior remains moderate, with clusters scattered evenly throughout.

ClipB2 shows low to moderate passing-by activity at the entrance (y = 0 to y = 5). This dispersed pattern continues through the middle area (y = 5 to y = 15). The exit area (y = 15 to y = 21) follows a similar pattern, with clusters showing a low level of passing-by activity.

ClipB3 shows moderate passing-by activity near the entrance (y = 0 to y = 5). The middle area (y = 5 to y = 15) also exhibits consistent passing-by behavior. Near the exit (y = 15 to y = 21), passing-by activity remains light to moderate.

4.6.2 Lingering

ClipA shows high lingering activity near the entrance area (y = 0 to y = 5). The middle area (y = 5 to y = 15) also exhibits significant lingering behavior with clusters dispersed throughout. Near the exit (y = 15 to y = 21), there is moderate lingering activity, particularly in the y = 17.5 to y = 20 range.

ClipB1 exhibits moderate lingering behavior near the entrance area (y = 0 to y = 5) with clusters. The middle area (y = 5 to y = 15) shows scattered lingering activity, with clusters around y = 12.5. Near the exit (y = 15 to y = 21), lingering behavior is more dispersed, with some clusters.

ClipB2 shows low to moderate lingering activity near the entrance area (y = 0 to y = 5). In the middle area (y = 5 to y = 15), there is light and scattered lingering, with a few clusters around y = 10 to y = 12.5. Near the exit (y = 15 to y = 21), lingering activity remains low with spots.

ClipB3 exhibits moderate lingering activity near the entrance area (y = 0 to y = 5) with clusters. The middle area (y = 5 to y = 15) shows consistent lingering behavior, with several clusters around y = 12.5. Near the exit (y = 15 to y = 21), lingering is more dispersed, with a few spots.

4.6.3 Shooting

ClipA exhibits significant shooting activity near the entrance area (y = 0 to y = 5) with clusters. The middle area (y = 5 to y = 15) shows moderate shooting behavior, with clusters scattered throughout. Near the exit (y = 15 to y = 21), there is minimal shooting activity, with only a few spots.

ClipB1 displays moderate shooting behavior near the entrance area (y = 0 to y = 5) with clusters. The middle area (y = 5 to y = 15) shows scattered shooting behavior, with a few clusters but mostly spots. Near the exit (y = 15 to y = 21), shooting activity is minimal, with only a few spots.

ClipB2 demonstrates low shooting activity throughout the space. Near the entrance (y = 0 to y = 5), there are a few spots. The middle area (y = 5 to y = 15) shows very little engagement, with only occasional clusters. Near the exit (y = 15 to y = 21), there is minimal shooting behavior, similar to the middle and entrance areas.

ClipB3 exhibits moderate shooting activity near both the entrance (y = 0 to y = 5) and the middle area (y = 5 to y = 15). Clusters appear around y = 2.5 and y = 12.5, with the middle area showing more engagement compared to the other clips. Near the exit (y = 15 to y = 21), there is minimal shooting activity, with only a few spots.

Based on the above analysis, differences are shown in the BiD heatmap between ClipA from Sequence A and ClipB1, ClipB2, and ClipB3 from Sequence B. Firstly, ClipA is far more active than all clips in Sequence B across all dimensions. However, within Sequence B, the distribution patterns of the three clips are quite similar; particularly for lingering behavior, although there are intensity differences, the distribution remains consistent. For shooting behavior, however, ClipB3 with its bright-dim alternations, flickering, and geometric patterns—shows higher activity compared to ClipB1 and ClipB2, especially in the middle region and exit area. Although ClipB2 is mainly static in content and form, it is the least active part of Sequence B, possibly due to its relatively dim colors and low contrast.

4.7 Analysis of selected viewer behaviors based on instance trails and BiD heatmaps

In this section, we analyze the behaviors of selected examples of highly active viewers, identified by their unique IDs, using instance trails and Behavior Instance Density (BiD) heatmaps (Figure 10). This method also allows the reconstruction of each viewer's path. The following figures provide insights into the movement patterns and behavior instances of these viewers across two sequences: Rainbow Arcade (Sequence A) and Space in Between (Sequence B). These four viewers were selected as they represent typical patterns of viewer behavior observed in our dataset.

Figure 10

Viewer A24: The instance trail of viewer A24 shows that they traversed the entire space but moved back and forth in the entrance area while doing so. The BiD heatmap reveals that viewer A24 exhibited a balanced combination of passing-by and lingering behaviors, with a notable concentration in the entrance area. Their shooting behavior aligned with their lingering, indicating that any photography or videography was done while stationary.

Viewer A90: The instance trail of viewer A90 indicates that they moved back and forth in the entrance area without traversing the entire space. The BiD heatmap reveals that viewer A90 displayed a unique pattern, with lingering and shooting behaviors concentrated in the entrance area. The overlap of shooting behavior with lingering areas suggests that they captured images or videos during their extended stay.

Viewer B7: The instance trail of viewer B7 is confined to a small part of the entrance area, indicating that they mostly stayed in place or moved back and forth within a small range. Viewer B7 exhibited minimal movement, focusing on lingering and occasional shooting behaviors in a small part of the space. It is likely that most of their time was spent engaging in static activities, such as observing or capturing photos or videos.

Viewer B36: The instance trail of viewer B36 shows a wide coverage of the space, indicating that they entered the space at the entrance, traversed the entire area, and then returned to the entrance. Viewer B36 demonstrated a balanced combination of passing-by, lingering, and shooting behaviors, covering a large portion of the space. The overlapping instances of lingering and shooting suggest that when they focused on specific areas, they slowed down to capture photos or videos.

In conclusion, analyzing the behaviors of selected viewers through their instance trails and BiD heatmaps allows us to understand how active individuals interact with the environment over time and space. This level of detail provides valuable insights for optimizing space design, enhancing viewer experience, and informing strategies for crowd management and engagement in media architecture environments.

5 Discussion

5.1 From manual observation to fully automated analysis

for many years, studying human behavior in media architecture has been hindered by technological limitations. Early work was dependent on manual annotation, observer notes, or video playback—a human-centered paradigm that limited sample size, raised privacy concerns, and lacked objectivity (Behrens et al., 2013; Memarovic et al., 2016; Fischer, 2015; Qin et al., 2020). The fully automated three-stage pipeline introduced in this study marks a decisive break from these constraints. By fusing a ZED 2i stereo depth camera, a YOLOv10x + DeepSORT detection-and-tracking module, and GPT-4o for zero-shot multi-modal behavior recognition, the pipeline spans data capture, individual trajectory generation, and behavior labeling, with human involvement restricted solely to designing prompts for GPT-4o. Apart from a brief post-processing step that merges split IDs produced by DeepSORT (affecting only visitor counts, not the heatmaps), the entire workflow—from raw video to BiD heatmap—runs unattended. Importantly, the high-precision depth imagery delivered by the ZED 2i obviates floor markers or overhead cameras, while GPT-4o tags behaviors such as “walking,” “standing,” and “taking photos” with no pre-defined training labels, underscoring the generalisability of multi-modal LLMs in complex urban scenes. End-to-end automation not only increases efficiency but also avoids long-term retention of personally identifiable footage, thereby mitigating privacy risks, though the findings stem from a single case and call for further validation in broader settings.

5.2 Validating and extending spatial interaction models to immersive 3-D media façades

The outcome of the spatial behavior visualization revealed that the BiD heatmap aligns closely with the Interaction Space hypothesized in Urban HCI models (Fischer and Hornecker, 2012): Behavior density exhibits a sharp rise within 5 m of the entrance, with peaks for lingering and shooting nearly coinciding. This supports the ELSI model's prediction that the Potential Active Engagement Space may collapse into the Active Engagement Space (Memarovic et al., 2015). In the configuration of Now Arcade, the doorway functions simultaneously as a physical passage and a behavioral threshold where visitors pivot from passive watching to active photography. The micro-spatial salience of that threshold illustrates how a media space layout exerts structural pull on social action. By contrast, a stable “transit band” emerged mid-arcade (y = 10–15 m): sparse interaction, higher walking speed, and fewer engagements—echoing the function of the spatial layout as a transient space. Crucially, our study extends these flat-display theories by empirically mapping crowd behavior within multi-screen, three-dimensional immersive digital media spaces.

5.3 How media content shapes spatial behavior

Comparing Rainbow Arcade (static, highly saturated imagery) with Space in Between (dynamic, low-saturation imagery) reveals a differentiated pattern: the Rainbow scene significantly increases the probability of shooting but leaves the per-viewer shooting count unchanged, while it raises the number of passing-by instances without affecting their occurrence rate. The combined effect is that visitors initiate more walking segments and are more likely to take photographs, yielding a measurably slower traversal of the corridor; lingering behavior shows no systematic difference between the two scenes. Clip-level analysis reveals further nuance: Clip A triggers the densest photography near the entrance, supporting the hypothesis that “color contrast + gradual evolution” exerts strong visual magnet effect. Although Clip B2 is relatively static, its dim palette attracts far less attention, demonstrating that behavioral drive is not governed by motion alone but by a composite of color saturation, brightness, and shape clarity. In short, variations in the visual mood and intensity modulate spatial behavior more than the simple binary of “moving vs. still.” These findings have implications for the design of immersive environments. Designers aiming for immersive yet smoothly flowing environments should balance visual arousal with pedestrian flow. In our data analysis, static, highly saturated imagery encouraged more photo taking and produced additional walking segments, implying a slower overall passage. Such content may be useful when the goal is to stimulate social-media engagement, provided crowd density is monitored. Conversely, low-saturation motion graphics were associated with fewer photo triggers and quicker traversal, suggesting a combined effect of visual design and corridor layout in maintaining flow and reducing mid-corridor congestion.

5.4 Cross-scale corroboration between individual trajectories and aggregate patterns

Beyond aggregate heatmaps, analysis of representative individual tracks provides a two-scale lens. Viewer A24 and A90 illustrate repeated photography at both entrance and exit, implying that static, high-contrast visuals elicit deep engagement at multiple anchor points. Viewer B36 follows a classic “pass-through + momentary-photo” path, corroborating the mid-corridor transit band identified in the BiD map. Conversely, Viewer B7 confines activity to the doorway, exhibiting a “brief-pause + quick-retreat” pattern. Such fine-grained traces enrich the semantic reading of density plots and caution against letting statistical averages obscure meaningful behavioral heterogeneity.

5.5 Challenges and value of research in the wild

Compared with the controlled variables of a laboratory, the Now Arcade—a live urban media corridor—presents a high degree of unpredictability. Weather fluctuations, restricted camera angles, and visitor surges driven by ad-hoc performances all complicate data acquisition. The camera had to be positioned outside the entrance, eliminating the canonical top-down view. Yet stereo depth reconstruction still yields accurate spatial localization, even under local occlusion. Although the venue provided a nominal content schedule,¹² frequent deviations meant that the actual content displayed during the selected hours could differ from the planned program. Such volatility, however, is not purely negative. The study eventually captured two distinct content types in situ: the permanent loop Space in Between and a Pride-month special Rainbow Arcade. Securing two complete 60-min datasets (each comprising six 10-min sub-loops) under such real-world conditions—and doing so for content that is directly comparable in static vs. dynamic form—constitutes a methodological achievement in its own right.

5.6 Limitations and future research

Despite the pipeline's high level of automation and accuracy, several constraints persist. First, the ZED 2i's optimal depth range (≈18 m) covers only three-quarters of the 24-m arcade (Tadic et al., 2022); full coverage would require multiple cameras. Extreme LED glare can still degrade depth quality, even with a polarizing filter, by saturating the stereo imagers and introducing depth outliers. In future setups, glare may be mitigated through: a higher-extinction external linear polariser with careful rotation-angle calibration to match the LED panel's polarization (Zhao et al., 2025) and a lens hood/baffle to reduce stray light at oblique angles (Li et al., 2022). We also plan to explore replacing or augmenting cameras with commodity WiFi channel-state information. Recent approaches show that deep networks can map the phase and amplitude of WiFi signals to dense human-pose representations, enabling multi-person pose estimation from WiFi alone (Geng et al., 2022). Second, YOLOv10x is not fine-tuned for dense crowds; its training remains grounded in generic imagery. Moreover, GPT-4o, while accurate, incurs non-trivial cloud inference costs and seconds-scale latency that hinder long-term or real-time deployment. In our current implementation, behavior labeling requires per-person crop inference with two GPT-4o prompts (shooting vs. non-shooting; walking vs. standing) and is executed serially; consequently, the end-to-end processing time and cloud cost per sampled frame scale approximately linearly with the number of visible people and the crop resolution. This makes 24/7 deployment impractical under the current design; accordingly, we use 1 Hz sampling on recorded data for post-hoc analysis in this study. Lightweight distilled models (e.g., LLaVA¹³) and/or event-triggered inference (e.g., only when a track slows/stops) could enable more efficient on-device processing, lowering latency, cost, and exposure of raw visual data. However, MLLMs are not error-free: hallucination, prompt sensitivity, and error compounding across detection-tracking-labeling can bias estimates (Li et al., 2023; Chen et al., 2025). Third, the 120-min dataset, though containing over 600 visitors, is skewed toward clear weekday conditions. Broader representativeness will require extended sampling across weather, workdays, and holidays, or even stratified prompting to capture specific cohorts. Finally, deeper and targeted collaboration with screen operators could unlock content-management logs and real-time control APIs, allowing behavior-driven adaptations such as dimming brightness when dwell counts exceed a threshold. Building on such infrastructure, a further future direction is to incorporate Theory-of-Mind (ToM) and metacognition-inspired AI to support deeper interpretation of social interaction: e.g., maintaining explicit hypotheses about viewers' intent under ambiguity and applying reflective self-checks that report uncertainty and reduce prompt sensitivity and error compounding. Such ToM/MC layers could help distinguish superficially similar behaviors (e.g., filming vs. phone-checking) and enable context-aware adaptations while enforcing privacy-preserving constraints and avoiding over-inference (Bamicha and Drigas, 2024b,a; Xu et al., 2024; Zhang et al., 2025). Together, these feedback loops would propel media screens toward responsive architecture, where the screen can become a sensor for urban emotion.

6 Conclusion

This study contributes to human-building interaction research through the design and real-world deployment of a novel, fully automated multi-modal pipeline, capable of producing high-resolution evidence about how people inhabit media-rich architectural spaces. Leveraging a single stereo-depth camera mounted unobtrusively outside the entrance of London's Now Arcade, we integrated depth-based localization with deep-learning-based multi-object tracking and zero-shot behavioral labeling using MLLMs. The result is an end-to-end workflow that captures trajectories, extracts motion states, recognizes specific actions such as filming, and projects everything onto BiD heatmaps, representing the spatial localization of actions without any manual annotation. By removing the need for overhead rigs, floor markers, or pre-trained task-specific classifiers, our approach lowers the threshold for longitudinal “in-the-wild” studies and offers a privacy-aware alternative to conventional RGB video logging.

Applying the pipeline to two 1-h content sequences—Rainbow Arcade (static, high-saturation imagery) and Space in Between (dynamic, low-contrast animation)—yielded more than 600 individual-viewer records and roughly 13,000 behavior instances. Comparative BiD visualizations show that the static, high-saturation scene chiefly affects two aspects of movement: it raises the likelihood of shooting and increases the number of passing-by instance. The spatial pattern of shooting was concentrated near the entrance, indicating that the colorful still motifs act as a visual anchor for photo capture. In contrast, the darker, animated sequence sustains an almost uninterrupted transit flow; shooting was infrequent, and dwell events are neither more numerous nor longer than in the static scene. Exemplar trajectories corroborate the aggregate picture: viewers A24 and A90 paused repeatedly at the entrance to photograph Rainbow motifs, whereas B36 executed a single “pass-through plus quick snapshot” loop in the Space sequence, illustrating how vivid color rather than mere motion amplifies attention and gently modulates corridor speed and reinforcing its transient function without systematically altering lingering behavior.

Beyond empirical insights, our findings validate and refine established theoretical models. The 5-m zone of intense BiD activity around the doorway corresponds to the Interaction Space posited in the ELSI model, whereas a mid-arcade “transit band” aligns with Urban HCI's Comfort Space (Fischer and Hornecker, 2012). By situating these abstract zones in a three-dimensional, multi-screen corridor, we extend their applicability from single flat displays to immersive media architecture and supply quantitative thresholds that can inform layout, content scheduling, and crowd-management policies.

The work also surfaces several technical and methodological limitations that chart a path for future research. First, the ZED 2i measures depth reliably only within 18–20 m; covering larger venues will require multi-camera mosaics and automated point-cloud stitching. Second, high-luminance LEDs occasionally saturate the stereo imagers, suggesting a need for adaptive exposure control or polarization-aware HDR pipelines. Third, cloud-based GPT-4o inference delivers excellent overall accuracy (≈90% for behavior recognition), yet incurs latency and cost that preclude 24/7 deployment; distilled on-device vision–language models could offer a privacy-preserving, energy-efficient alternative. Finally, although two weekdays yielded content-sensitive behavior signatures, broader generalisability demands seasonal sampling, larger demographic diversity, and integration with screen content-management system so that displays can respond to live engagement metrics in real time.

In sum, we present the first large-scale, depth-enabled, LLM-augmented analysis of spatial behavior in a multi-screen immersive media environment, revealing how visual tone, brightness, and rhythm modulate not only what visitors do but also where they choose to do it, and showing that the influence of the spatial layout is further amplified—or in some cases altered—by the content of the media screens (Behrens et al., 2013). By coupling cost-effective sensing hardware with off-the-shelf AI, the approach offers researchers, designers, and building operators a scalable instrument for evidence-based media-architecture design and for future adaptive systems in which screens double as sensors—turning built form into an active participant in urban life.

Statements

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Ethics statement

The studies involving humans were approved by University College London Ethics Committee. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation was not required from the participants or the participants' legal guardians/next of kin because the study was carried out in the wild, to understand behavior of people who occupy media spaces and move through the building as part of the urban flow. Written informed consent was not obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article because the study was carried out in the wild, to understand behavior of people who occupy media spaces as part of the urban flow.

Author contributions

ZW: Writing – original draft. AF: Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. Open access funding provided by University College London (UCL).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1
AbdelsalamA.MansourM.PorrasJ.HapponenA. (2024). Depth accuracy analysis of the zed 2i stereo camera in an indoor environment. Rob. Auton. Syst. 179:104753. doi: 10.1016/j.robot.2024.104753
- CrossRef
- Google Scholar
2
AbughaliehK. M.AlawnehS. G. (2020). Predicting pedestrian intention to cross the road. IEEE Access8, 72558–72569. doi: 10.1109/ACCESS.2020.2987777
- CrossRef
- Google Scholar
3
AharonyN.MeshurerA.KrakovskiM.ParmetY.MelzerI.EdanY.et al. (2024). Comparative analysis of cameras and software tools for skeleton tracking. IEEE Sens. J. 24, 32302–32312. doi: 10.1109/JSEN.2024.3450754
- CrossRef
- Google Scholar
4
Alaily-MattarN.ArvanitakisD.KrohbergerH.LegnerL. F.ThiersteinA. (2023). The performance of exceptional public buildings on social media-the case of depot boijmans. PLoS ONE18:e0282299. doi: 10.1371/journal.pone.0282299
5
AlaviH. S.ChurchillE. F.WibergM.LalanneD.DalsgaardP.Fatah gen SchieckA.et al. (2019). Introduction to human-building interaction (HBI): interfacing HCI with architecture and urban design. ACM Trans. Comput.-Hum. Interact. 26, 1–10. doi: 10.1145/3309714
- CrossRef
- Google Scholar
6
BamichaV.DrigasA. (2024a). Human-social robot interaction in the light of tom and metacognitive functions. Sci. Electron. Arch. 17, 1–20. doi: 10.36560/17520241986
- CrossRef
- Google Scholar
7
BamichaV.DrigasA. (2024b). Strengthening ai via tom and mc dimensions. Sci. Electron. Arch. 17, 1–14. doi: 10.36560/17320241939
- CrossRef
- Google Scholar
8
BehrensM. Fatah gen Schieck, A.KostopoulouE.NorthS.MottaW.YeL.et al. (2013). “Exploring the effect of spatial layout on mediated urban interactions,” in Proceedings of the 2nd ACM International Symposium on Pervasive Displays, PerDis '13 (New York, NY: Association for Computing Machinery), 79–84. doi: 10.1145/2491568.2491586
- CrossRef
- Google Scholar
9
BerberB.Fatah Gen. SchieckA.RomanoD. M. (2024). “Towards evaluating effects of digital sensory environments on human emotions in the wild,” in Proceedings of the 6th Media Architecture Biennale Conference, MAB '23 (New York, NY: Association for Computing Machinery), 146–159. doi: 10.1145/3627611.3627626
- CrossRef
- Google Scholar
10
BrignullH.RogersY. (2003). “Enticing people to interact with large public displays in public spaces,” in IFIP TC13 International Conference on Human-Computer Interaction (Zurich).
- Google Scholar
11
ChenJ.RajasegararS.LeckieC.GygaxA. (2016). “Pedestrian behaviour analysis using the microsoft kinect,” in 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops) (Sydney, NSW), 1–6. doi: 10.1109/PERCOMW.2016.7457094
- CrossRef
- Google Scholar
12
ChenX.MaZ.ZhangX.XuS.QianS.YangJ.et al. (2025). “Multi-object hallucination in vision language models,” in Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24 (Red Hook, NY: Curran Associates Inc).
- Google Scholar
13
Fatah gen SchieckA.AlaviH. S.Zielinska-DabkowskaK. M.Tajadura-JiménezA.WardJ. A.Bianchi-BerthouzeN. (2025). Editorial: re-imagining mediated human building interaction and sensory environments. Front. Comput. Sci. 7:1603742. doi: 10.3389/fcomp.2025.1603742
- CrossRef
- Google Scholar
14
FinkeM.TangA.LeungR.BlackstockM. (2008). “Lessons learned: game design for large public displays,” in Proceedings of the 3rd International Conference on Digital Interactive Media in Entertainment and Arts, DIMEA '08 (New York, NY: Association for Computing Machinery), 26–33. doi: 10.1145/1413634.1413644
- CrossRef
- Google Scholar
15
FischerP. T. (2015). Urban HCI: Understanding and Conceptualizing Urban Situations Through Media Interventions. Glasgow: University of Strathclyde.
- Google Scholar
16
FischerP. T.HorneckerE. (2012). “Urban HCI: spatial aspects in the design of shared encounters for media façades,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '12 (New York, NY: Association for Computing Machinery), 307–316. doi: 10.1145/2207676.2207719
- CrossRef
- Google Scholar
17
FujimotoY.BasharK. (2024). “Automatic classification of multi-attributes from person images using gpt-4 vision,” in Proceedings of the 2024 6th International Conference on Image, Video and Signal Processing, IVSP '24 (New York, NY: Association for Computing Machinery), 207–212. doi: 10.1145/3655755.3655783
- CrossRef
- Google Scholar
18
GehlJ. (2001). Life Between Buildings: Using Public Space. Copenhagen: The Danish Architectural Press.
- Google Scholar
19
GengJ.HuangD.la TorreF. D. (2022). Densepose from wifi. arXiv. Available online at: https://arxiv.org/abs/2301.00250 (Accessed October 1, 2025).
- Google Scholar
20
HarisankarV.KarthikaR. (2020). “Real time pedestrian detection using modified yolo v2,” in 2020 5th International Conference on Communication and Electronics Systems (ICCES) (Coimbatore), 855–859. doi: 10.1109/ICCES48766.2020.9138103
- CrossRef
- Google Scholar
21
HiranoT.OzakiK.MoritaT. (2024). “Prediction of actions and objects through video analysis using stepwise prompt,” in 2024 IEEE 18th International Conference on Semantic Computing (ICSC) (Laguna Hills, CA), 289–293. doi: 10.1109/ICSC59802.2024.00052
- CrossRef
- Google Scholar
22
KajabadE. N.IvanovS. V. (2019). People detection and finding attractive areas by the use of movement detection analysis and deep learning approach. Procedia Comput. Sci. 156, 327–337. doi: 10.1016/j.procs.2019.08.209
- CrossRef
- Google Scholar
23
KirshD. (2019). Do architects and designers think about interactivity differently?ACM Trans. Comput.-Hum. Interact. 26, 1–43. doi: 10.1145/3301425
- CrossRef
- Google Scholar
24
KuligowskiE.PeacockR.HoskinsB. (2010). A Review of Building Evacuation Models, 2nd Edn. Technical Note (NIST TN), Gaithersburg, MD: National Institute of Standards and Technology. doi: 10.6028/NIST.TN.1680
- CrossRef
- Google Scholar
25
LiJ.YangY.QuX.JiangC. (2022). Stray light analysis and elimination of an optical system based on the structural optimization design of an airborne camera. Appl. Sci. 12:1935. doi: 10.3390/app12041935
- CrossRef
- Google Scholar
26
LiY.DuY.ZhouK.WangJ.ZhaoX.WenJ.-R.et al. (2023). “Evaluating object hallucination in large vision-language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, eds. H. Bouamor, J. Pino, and K. Bali (Singapore: Association for Computational Linguistics), 292–305. doi: 10.18653/v1/2023.emnlp-main.20
- CrossRef
- Google Scholar
27
LimbergC.GonçalvesA.RigaultB.PrendingerH. (2024). “Leveraging YOLO-World and GPT-4V LMMs for zero-shot person detection and action recognition in drone imagery,” in First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024. Available online at: https://openreview.net/forum?id=WWgUG8n4g4 (Accessed October 1, 2025).
- Google Scholar
28
LinT.-Y.MaireM.BelongieS.HaysJ.PeronaP.RamananD.et al. (2014). “Microsoft coco: common objects in context,” in Computer Vision-ECCV 2014, eds. D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Cham: Springer International Publishing), 740–755. doi: 10.1007/978-3-319-10602-1_48
- CrossRef
- Google Scholar
29
LiuH.LiX. R. (2021). How travel earns us bragging rights: a qualitative inquiry and conceptualization of travel bragging rights. J. Travel. Res. 60, 1635–1653. doi: 10.1177/0047287520964599
- CrossRef
- Google Scholar
30
MemarovicN.Fatah gen. SchieckA.SchnädelbachH.KostopoulouE.NorthS.YeL. (2016). “Longitudinal, cross-site and “in the wild”: a study of public displays user communities' situated snapshots,” in Proceedings of the 3rd Media Architecture Biennale Conference, MAB '16 (New York, NY: Association for Computing Machinery). doi: 10.1145/2946803.2946804
- CrossRef
- Google Scholar
31
Memarovic N. Gehring S. and, Fischer, P. T. (2015). Elsi model: bridging user engagement around interactive public displays and media facades in urban spaces. J. Urban Technol. 22, 113–131. doi: 10.1080/10630732.2014.942169
- CrossRef
- Google Scholar
32
MemarovicN.LangheinrichM.AltF.ElhartI.HosioS.RubegniE.et al. (2012). “Using public displays to stimulate passive engagement, active engagement, and discovery in public spaces,” in Proceedings of the Media Architecture Biennale Conference: Participation, MAB '12 (New York, NY: Association for Computing Machinery), 55–64. doi: 10.1145/2421076.2421086
- CrossRef
- Google Scholar
33
MichelisD.MüllerJ. (2011). The audience funnel: observations of gesture based interaction with multiple large displays in a city center. Int. J. Hum.-Comput. Interact. 27, 562–579. doi: 10.1080/10447318.2011.555299
- CrossRef
- Google Scholar
34
MüllerJ.WalterR.BaillyG.NischtM.AltF. (2012). “Looking glass: a field study on noticing interactivity of a shop window,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '12 (New York, NY: Association for Computing Machinery), 297–306. doi: 10.1145/2207676.2207718
- CrossRef
- Google Scholar
35
OgawaT.YoshiokaK.FukudaK.MoritaT. (2024). “Prediction of actions and places by the time series recognition from images with multimodal LLM,” in 2024 IEEE 18th International Conference on Semantic Computing (ICSC) (Laguna Hills, CA: IEEE), 294–300. doi: 10.1109/ICSC59802.2024.00053
- CrossRef
- Google Scholar
36
OlteanG.IvanciuL.BaleaH. (2019). “Pedestrian detection and behaviour characterization for video surveillance systems,” in 2019 IEEE 25th International Symposium for Design and Technology in Electronic Packaging (SIITME) (Cluj-Napoca: IEEE), 256–259. doi: 10.1109/SIITME47687.2019.8990686
- CrossRef
- Google Scholar
37
OpenAI (2023). Gpt-4v(ision) System Card. Available online at: https://openai.com/index/gpt-4v-system-card/ (Accessed October 1, 2025).
- Google Scholar
38
OpenAIHurst, A.LererA.GoucherA. P.PerelmanA.RameshA.et al. (2024). GPT-4o system card. arXiv [preprint]. Available online at: https://arxiv.org/abs/2410.21276 (Accessed October 1, 2025).
- Google Scholar
39
PanJ.ChoT. Y.SunM.SteemersK.BardhanR. (2025). Environmental and spatial dynamics in a flexible workspace for hybrid work: a data-driven design framework. Build. Environ. 270:112544. doi: 10.1016/j.buildenv.2025.112544
- CrossRef
- Google Scholar
40
PsarrasS.Fatah gen. SchieckA.ZarkaliA.HannaS. (2019). “Visual saliency in navigation: modelling navigational behaviour using saliency and depth analysis,” in International Space Syntax Symposium (Beijing).
- Google Scholar
41
QinZ.SchieckF.PsarrasS. (2020). “Three-dimensional Visual Attention Heatmap in Space,” in Design Computation Input/Output 2020 (London). doi: 10.47330/DCIO.2020.BNRH6093
- CrossRef
- Google Scholar
42
RashedM. G.SuzukiR.YonezawaT.LamA.KobayashiY.KunoY.et al. (2016). “Tracking visitors in a real museum for behavioral analysis,” in 2016 Joint 8th International Conference on Soft Computing and Intelligent Systems (SCIS) and 17th International Symposium on Advanced Intelligent Systems (ISIS) (Sapporo: IEEE), 80–85. doi: 10.1109/SCIS-ISIS.2016.0030
- CrossRef
- Google Scholar
43
RedmonJ.DivvalaS.GirshickR.FarhadiA. (2016). “You only look once: unified, real-time object detection,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Las Vegas, NV: IEEE), 779–788. doi: 10.1109/CVPR.2016.91
- CrossRef
- Google Scholar
44
RogersY.MarshallP. (Eds.) (2017). “Practical and ethical issues,” in Research in the Wild. Synthesis Lectures on Human-Centered Informatics (Cham: Springer International Publishing), 69–78. doi: 10.1007/978-3-031-02220-3_5
- CrossRef
- Google Scholar
45
SeerS.BrändleN.RattiC. (2014). Kinects and human kinetics: a new approach for studying pedestrian behavior. Transp. Res. C: Emerg. Technol. 48, 212–228. doi: 10.1016/j.trc.2014.08.012
- CrossRef
- Google Scholar
46
ShahriarS.LundB. D.MannuruN. R.ArshadM. A.HayawiK.BevaraR. V. K.et al. (2024). Putting gpt-4o to the sword: a comprehensive evaluation of language, vision, speech, and multimodal proficiency. Appl. Sci. 14:7782. doi: 10.3390/app14177782
- CrossRef
- Google Scholar
47
SheridanJ. G.DixA.LockS.BaylissA. (2005). “Understanding interaction in ubiquitous guerrilla performances in playful arenas,” in People and Computers XVIII – Design for Life, eds. S. Fincher, P. Markopoulos, D. Moore, and R. Ruddle (London: Springer London), 3–17. doi: 10.1007/1-84628-062-1_1
- CrossRef
- Google Scholar
48
ShiJ.LiuP. (2014). An agent-based evacuation model to support fire safety design based on an integrated 3D GIS and BIM platform. Civil Build. Eng. 2014, 1893–1900. doi: 10.1061/9780784413616.235
- CrossRef
- Google Scholar
49
Sosa-LeónV. A. L.SchweringA. (2022). Evaluating automatic body orientation detection for indoor location from skeleton tracking data to detect socially occupied spaces using the kinect v2, azure kinect and zed 2i. Sensors22:3798. doi: 10.3390/s22103798
50
Stereolabs (2025). Zed 2i Stereo Camera. Available online at: https://www.stereolabs.com/en-se/store/products/zed-2i (Accessed October 1, 2025).
- Google Scholar
51
TadicV.TothA.VizvariZ.KlincsikM.SariZ.SarcevicP.et al. (2022). Perspectives of realsense and zed depth sensors for robotic vision applications. Machines10:183. doi: 10.3390/machines10030183
- CrossRef
- Google Scholar
52
TEGARA Co L. (2022). [Verification with Video] How to tell if the ZED 2i has a Polarizer (Polarizing Filter) | Information Transmission Media for R&D. TEGAKARI. Available online at: https://www.tegakari.net/en/2022/07/stereolabs-zed-2i-how-to-check-polarizer/ (Accessed October 1, 2025).
- Google Scholar
53
TomitschM.McArthurI.HaeuslerM. H.FothM. (2015). “The role of digital screens in urban life: new opportunities for placemaking,” in Citizen's Right to the Digital City, eds. M. Foth, M. Brynskov, and T. Ojala (Singapore: Springer Singapore), 37–54. doi: 10.1007/978-981-287-919-6_3
- CrossRef
- Google Scholar
54
TzortziK. Fatah gen. Schieck, A. (2025). MUSEE: Understanding Museum Architecture for Digital Experiences. Athens: Papailiou Publications.
- Google Scholar
55
VeeramaniB.RaymondJ. W.ChandaP. (2018). Deepsort: deep convolutional networks for sorting haploid maize seeds. BMC Bioinformatics 19(S9):289. doi: 10.1186/s12859-018-2267-2
56
WangA.ChenH.LiuL.ChenK.LinZ.HanJ.et al. (2024). “YOLOv10: real-time end-to-end object detection,” in NIPS '24: Proceedings of the 38th International Conference on Neural Information Processing Systems, no. 3429, (Red Hook, NY: Curran Associates Inc.), 28.
- Google Scholar
57
WangJ.GaoZ.ZhangY.ZhouJ.WuJ.LiP.et al. (2022). Real-time detection and location of potted flowers based on a zed camera and a yolo v4-tiny deep learning algorithm. Horticulturae8:21. doi: 10.3390/horticulturae8010021
- CrossRef
- Google Scholar
58
WhyteW. (1980). The Social Life of Small Urban Spaces. Arlington, VA: Conservation Foundation.
- Google Scholar
59
WoutersN.DownsJ.HarropM.CoxT.OliveiraE.WebberS.et al. (2016). “Uncovering the honeypot effect: how audiences engage with public interactive systems,” in Proceedings of the 2016 ACM Conference on Designing Interactive Systems, DIS '16 (New York, NY: Association for Computing Machinery), 5–16. doi: 10.1145/2901790.2901796
- CrossRef
- Google Scholar
60
XuH.ZhaoR.ZhuL.DuJ.HeY. (2024). “OpenToM: a comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), eds. L.-W. Ku, A. Martins, and V. Srikumar (Bangkok: Association for Computational Linguistics), 8593–8623. doi: 10.18653/v1/2024.acl-long.466
- CrossRef
- Google Scholar
61
YangJ.YuH.JingxinY.XuC, Biao, Y.SunY.et al. (2024). Visual-linguistic agent: towards collaborative contextual object reasoning. arXiv [preprint]. Available online at: https://arxiv.org/abs/2411.10252 (Accessed October 1, 2025).
- Google Scholar
62
YangZ.LiL.LinK.WangJ.LinC.-C.LiuZ.et al. (2023). The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv [preprint]. Available online at: https://arxiv.org/abs/2309.17421 (Accessed October 1, 2025).
- Google Scholar
63
ZhangX.ChenY.YehS.LiS. (2025). “MetaMind: modeling human social thoughts with metacognitive multi-agent systems,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems. Available online at: https://openreview.net/forum?id=rGMaZkn1ve (Accessed January 1, 2026).
- Google Scholar
64
ZhaoK.QiuH.GaoC.JiangL. (2025). Glare suppressed 3D mapping system based on linear polarization filtering and binocular vision fusion. Sci. Rep. 15:20998. doi: 10.1038/s41598-025-06887-w

Summary

Keywords

HBI, AI, Multimodal LLM, depth camera, spatial analysis, behavior modeling, media spaces, in the wild

Citation

Wu Z and Fatah gen. Schieck A (2026) Beyond the looking glass: multimodal LLM-based depth-sensing for spatial behavior modeling in media architecture. Front. Comput. Sci. 8:1746674. doi: 10.3389/fcomp.2026.1746674

Received

14 November 2025

Revised

29 January 2026

Accepted

06 February 2026

Published

07 April 2026

Volume

8 - 2026

Edited by

Athanasios Drigas, National Centre of Scientific Research Demokritos, Greece

Reviewed by

Victoria Bamicha, National Centre of Scientific Research Demokritos, Greece

Vana Gkora, University of West Attica, Greece

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Ava Fatah gen. Schieck, ava.fatah@ucl.ac.uk

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

ORIGINAL RESEARCH article

Beyond the looking glass: multimodal LLM-based depth-sensing for spatial behavior modeling in media architecture

Abstract

1 Introduction

2 Related work

2.1 Theoretical models of behavior

2.2 Methods for tracking viewers' behavior

2.3 Integrating large language models

3 Methodology

3.1 Context analysis

3.1.1 Site analysis

3.1.2 Screen content analysis

3.1.3 Observed behaviors

3.2 Technical framework development

3.2.1 Hardware setup

3.2.2 Data processing and analysis framework

3.3 Data collection

3.4 Data visualization

4 Results

4.1 Validation of multi-modal LLM classification

4.2 Validation of motion behavior classification

4.3 Statistical analysis of behavioral occurrence and instance intensity

4.4 BiD heatmap analysis of behaviors in sequence A and B

4.4.1 Passing-by

4.4.2 Lingering

4.4.3 Shooting

4.5 BiD heatmap analysis of behaviors in subdivided sequences

4.5.1 Passing-by

4.5.2 Lingering

4.5.3 Shooting

4.6 BiD heatmap analysis of behavior across varied content types

4.6.1 Passing-by

4.6.2 Lingering

4.6.3 Shooting

4.7 Analysis of selected viewer behaviors based on instance trails and BiD heatmaps

5 Discussion

5.1 From manual observation to fully automated analysis

5.2 Validating and extending spatial interaction models to immersive 3-D media façades

5.3 How media content shapes spatial behavior

5.4 Cross-scale corroboration between individual trajectories and aggregate patterns

5.5 Challenges and value of research in the wild

5.6 Limitations and future research

6 Conclusion

Statements

Data availability statement

Ethics statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

Footnotes

References

Summary

Outline

Figures

Cite article

Share article

Article metrics