Edited by: Frédéric Gosselin, Université de Montréal, Canada
Reviewed by: Caroline Blais, Université du Québec en Outaouais, Canada; Takahiro Kawabe, Nippon Telegraph and Telephone, Japan
This article was submitted to Perception Science, a section of the journal Frontiers in Psychology
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Humans can rapidly discriminate complex scenarios as they unfold in real time, for example during law enforcement or, more prosaically, driving and sport. Such decision-making improves with experience, as new sources of information are exploited. For example, sports experts are able to predict the outcome of their opponent's next action (e.g., a tennis stroke) based on kinematic cues “read” from preparatory body movements. Here, we explore the use of psychophysical classification-image techniques to reveal how participants interpret complex scenarios. We used sport as a test case, filming tennis players serving and hitting ground strokes, each with two possible directions. These videos were presented to novices and club-level amateurs, running from 0.8 s before to 0.2 s after racquet-ball contact. During practice, participants anticipated shot direction under a time limit targeting 90% accuracy. Participants then viewed videos through Gaussian windows (“bubbles”) placed at random in the temporal, spatial or spatiotemporal domains. Comparing bubbles from correct and incorrect trials revealed how information from different regions contributed toward a correct response. Temporally, only later frames of the videos supported accurate responding (from ~0.05 s before ball contact to 0.1 s afterwards). Spatially, information was accrued from the ball's trajectory and from the opponent's head. Spatiotemporal bubbles again highlighted ball trajectory information, but seemed susceptible to an attentional cuing artifact, which may caution against their wider use. Overall, bubbles proved effective in revealing regions of information accrual, and could thus be applied to help understand choice behavior in a range of ecologically valid situations.
Imagine yourself driving your car one evening. As you turn a bend, a cat appears in your headlights. Should you brake hard, or perhaps swerve left or right? Seemingly without your conscious intervention, your body has decided, and you are relieved to find that your reaction has avoided the cat without causing a more dangerous collision.
Successful speeded decision-making of this kind has been fundamental to our survival as a species, and continues to pervade everyday life. However, it is not always obvious what particular information is exploited to make speeded choices, and which potentially relevant cues are left unused. For example, when avoiding the cat, was the upcoming curvature of the road or the presence of another vehicle in the rear-view mirror taken into account? If not, might a better driver have exploited these cues?
In real-life scenarios, many cues to speeded decision-making are subtle, and training or extensive experience may be required to facilitate their use. Competitive sport provides a good example. How is it that experts are able to quickly and accurately discriminate sporting scenarios as they unfold? Previous research has revealed that elite athletes make use of visual information from their opponents' bodies in order to predict what will happen next, for example using the movement of a cricket bowler's arm and hand, just before ball release, to anticipate the trajectory of the ball that will be delivered (Abernethy and Russell,
Our knowledge about this sport's “expert anticipatory advantage” has been garnered through the application of the spatial and temporal occlusion paradigms, developed by experimental psychologists (e.g., Jones and Miles,
In competitive sports, time is of the essence. While an unfolding scenario might ultimately provide unambiguous information about the appropriate response, this will often come too late for an athlete to simply wait and then react with certainty. Examples include reacting to bowling in cricket, pitching in baseball, serving in tennis, or penalty taking in soccer. In each case, the ball's trajectory provides the clearest information about the appropriate reaction, but the interval of time between receiving this information and having to initiate a response is very brief. This necessitates some degree of guessing if the ball is to be intercepted effectively. However, this guessing may still be informed by additional cues, for example the kinematics of the opponent's body prior to ball contact or release. To investigate this issue, multiple exemplars of a sports scenario can be filmed from a decision maker's perspective—for example, tennis serves coming to either forehand or backhand—so that a realistic decision with
Early studies degraded videos by limiting information in the temporal domain, known as temporal occlusion. For example, in tennis (the sport we investigate here) one early study showed that experts were above chance (and better than intermediate or novice players) at guessing the landing position of a serve when the video was stopped at (and thus information was occluded from) 0.042 s before ball contact (Jones and Miles,
Temporal occlusion approaches can be complemented by spatial occlusion, where the video is shown after having removed a spatially constrained source of information, in order to assess its impact. In tennis, this is typically accompanied by full (temporal) occlusion following racquet-ball contact in order to isolate the spatial location of cues utilized for
The temporal and spatial occlusion approaches have provided important information about how experts extract and use information in numerous sporting domains. In principal the approaches could even be generalized beyond sporting scenarios. However, they have some drawbacks as widely applicable methods. First, they depend upon the researcher's intuitions regarding the location of relevant information—the researcher is choosing what to occlude. It may be desirable to have sources of information emerge in a more bottom-up fashion, to make sure that cues are not overlooked (and avoid concerns over experimenter confirmation bias). Second, the creation of stimuli is time intensive. Video manipulation of this kind, particularly for spatial occlusion, is difficult to automate, providing a barrier to potential users from new fields of experimentation.
Spatial and temporal occlusion techniques were developed by researchers in applied cognitive psychology. However, as we outline next, parallel developments in other fields, most notably sensory psychophysics, provide a natural complement to these techniques that relies on a very similar basic logic, but replaces deliberate image occlusion with
Traditional psychophysics (e.g., Graham,
Unlike
Instead of manipulating non-target content systematically, Ahumada et al. (Ahumada and Lovell,
The traditional classification-image approach in visual psychophysics makes use of pixel-by-pixel additive luminance noise, and is conceptually closely related to the technique of spike-triggered averaging applied to single-cell recordings in neurophysiology (Marmarelis and Naka,
Example trial from a bubbles experiment, in which Gaussian profiled windows of visibility are placed at random positions.
The bubbles technique has previously been applied mainly to static images, although bubbles with temporal or spatiotemporal profiles have sometimes been applied in order to reveal information use through time (e.g., Vinette et al.,
Thirty participants (7 women and 23 men) aged 19–62 (mean = 32) took part in the various stages of this experiment (with 29 participants completing each of the stages, and most participants completing all three). Participants were recruited and assigned to one of two groups on the basis of their tennis playing experience/skill. Those in the novice group (5 women and 10 men) aged 20–51 years (mean = 30) had no experience of playing tennis competitively. Those in the tennis group (2 women and 13 men) aged 19–62 years (mean = 33) had 2–35 (mean = 11) years of experience playing competitive tennis and currently played between 0 and 150 (mean = 30) competitive matches per year
Video stimuli (available on request) were recorded at a tennis club using a tripod-mounted camera (frame rate 120 Hz, frame size 1,280 × 720 pixels). Four club coaches/hitters of a good but not elite standard acted as models, and were instructed to “hit winners” without attempting explicit deception. They were situated near the baseline, and recorded against a largely uniform blue backdrop. They were recorded serving (from the right-hand side of the court) or playing forehand ground strokes (running rightwards from a central position to return near the singles side line), directing their shots toward an imaginary receiver's forehand or backhand. To increase image resolution, the camera was positioned at the net, on a line projecting from the filmed player to the imaginary receiver at the opposite baseline (height = 1.6 m, left of center line by 1.25 m for ground strokes, right of center line by 1.5 m for serves). Balls were called in or out to facilitate later rejection of videos where the ball landed out. For ground strokes, one player delivered to all of the other three models, to ensure as constant a delivery as possible, and also called for line/cross strokes (i.e., toward the right-handed model's backhand and forehand, respectively) immediately after delivery to prevent early decisions that might introduce unnatural or pre-emptive postural cues. Only these three models were included in the experimental trials (see below). The final player received deliveries from a different model, and was consequently included only in practice trials.
Videos were first transformed to eight-bit gray scale. Of 350 initial videos, 215 contained shots that landed in. These videos were retained and then rated by two authors in order to pick a subset that were unambiguous (regarding the direction of the shot—line/cross for ground strokes, T/cross for serves), relatively homogeneous in terms of the position of the players at the time of ball contact, and lacking in artifactual cues that might allow the videos to be easily remembered for future classification (e.g., an unusual delivery trajectory for ground strokes). In each video, the frame corresponding to ball contact and the position at which the ball struck the racquet head on this frame were manually identified for use in subsequent presentation and analysis (see below).
The experiment was controlled by a PC running scripts written in Matlab (The Mathworks, Natick, U.S.A.) using the Psychophysics Toolbox extension (Brainard,
Participants completed three variants of the task in separate sessions, with a constant order (temporal, then spatial, then spatiotemporal)
Videos presentations began at −0.8 s relative to racquet-ball contact, and terminated at 0.2 s after racquet-ball contact, or at the time of response if earlier than this. We wished to push participants to respond as quickly as was feasible for them, while retaining some ability to perform the task, so as to extract sources of information that might be used during actual play. The practice block therefore served not only as a warm up, but also to estimate the time window within which participants could respond with ~90% accuracy. This was achieved via a
For the experimental blocks, 24 new videos (8 per player, 50% to forehand and 50% to backhand) were selected from the three players seen less often during practice. These videos were presented 16 times each in a random order, yielding a block of 384 trials. Participants were required to respond by their previously established deadline, and trials where they failed to do so (along with any trials with presentation glitches, i.e., where one or more frames were dropped after the −0.2 s time point) were re-randomized and repeated at the end of the block. Feedback about response times and correctness was provided after every trial.
Importantly, during experimental trials, the videos were subjected to random masking via the application of bubbles (see Figure
Bubble mean positions were generally selected at random within a domain extending throughout the relevant space of the video. However, in the spatiotemporal session, mean bubble positions were excluded from the first 25 frames of the video, and were further constrained to a rectangular spatial region of the video that varied across frames, capturing all player motion, in order to generate fewer bubbles in regions of null information
Pixel intensities were then calculated for display as the mean pixel intensity plus the difference between original and mean intensities (at each point) multiplied by the Bubbles profile (at that same point). Expressed in terms of Weber contrasts, pixels were displayed at their original Weber contrasts multiplied by the Bubbles profile.
The saved Bubbles profiles from each trial formed the starting point in generating classification sequences, images, or videos (for temporal, spatial, and spatiotemporal sessions, respectively), which reveal the regions from which information supporting a correct response has been extracted. We collectively term these
Next, for each participant, a weighted sum of (re-centered) Bubbles profiles (weighting profiles from correct trials positively and profiles from incorrect trials negatively) yielded the raw classification array:
However, in order to provide more intuitive values for visualizing and combining data across participants (and to make the method generalizable to cases where different participants completed different numbers of trials) raw classification arrays were normalized to a
In order to draw statistical inferences across large arrays while controlling familywise type 1 error appropriately, data from all participants were combined and assessed via both cluster and tmax (also known as pixel or single-threshold) corrected permutation tests (Blair and Karniski,
For the cluster test, a cluster was defined as the sum of contiguous
Subsets of trials forming repeated-measures comparisons (e.g., information accrued from shots to forehand vs. shots to backhand) were compared by subjecting
Response deadlines where imposed in experimental sessions, based on performance during practice, in order to ensure that participants used the earliest information source available to them. Deadlines in each group, experiment and condition are shown in Table
Mean (standard deviation) of response deadlines, reaction times (RT), accuracy, and number of bubbles for novices and experts responding to ground strokes (G.S.) and serves in temporal, spatial, and spatiotemporal experiments.
Temporal | G.S. | 0.40 |
0.24 |
69 |
12 |
0.36 |
0.20 |
68 |
11 |
Serves | 0.43 |
0.25 |
69 |
11 |
0.43 |
0.23 |
71 |
10 |
|
Spatial | G.S. | 0.42 |
0.25 |
66 |
14 |
0.42 |
0.26 |
68 |
13 |
Serves | 0.45 |
0.27 |
68 |
13 |
0.47 |
0.28 |
70 |
13 |
|
Spatio- |
G.S. | 0.43 |
0.29 |
66 |
59 |
0.38 |
0.22 |
62 |
61 |
Serves | 0.50 |
0.30 |
60 |
79 |
0.46 |
0.24 |
59 |
77 |
Although our
The mean
Mean classification sequences for all participants in temporal bubbles experiments.
The statistical significance of these regions was assessed using cluster and tmax permutation tests. tmax tests are well suited for detecting strong and highly localized regions of information, while cluster tests are well suited for detecting more diffuse regions (Chauvin et al.,
Analyzing responses to the serve stimuli generated a similar result (Figure
Just as with other forms of data, we can perform contrasts on classification arrays to determine whether particular regions are utilized more in one condition than in another. For the temporal data, we present an example of a between-participants contrast, by comparing the tennis-playing participants to the novices when responding to videos of serves. Results are illustrated in Figure
Figure
Classification image for all participants in the spatial bubbles experiment involving ground strokes. Results are overlaid on an image of the mean of all presented videos for the frames capturing racquet-ball contact, centered on the point of racquet-ball contact (hence constituent images do not perfectly align). However, the results of the spatial analysis are not specific to any one time point.
Previously, for the temporal experiments, we presented an example of a between-participants contrast of classification sequences. It is also possible to run within-participant contrasts on the data from bubbles experiments. For example, we might ask whether different regions of the video drove decisions when the ball was delivered to forehand (on one half of all trials) compared to when it was delivered to backhand (on the other half). The results of this contrast are shown in Figure
An illustrative within-participants contrast of classification images (rightward serves to forehand vs. leftward serves to backhand) for all participants in the spatial bubbles experiment.
For contrasts of this kind, both directions of difference are potentially interesting, but a 3D visualization (Figure
Illustrative results from the inferential analysis applied to the spatiotemporal experiment are shown in Figure
Thresholded classification video for all participants in the spatiotemporal bubbles experiment involving ground strokes. Results are overlaid on the mean of all presented videos (for each frame) centered on the point of racquet-ball contact (which occurred in frame 96). Solid red/yellow (dark/light gray) colored regions were significant in cluster/tmax permutation testing, respectively, suggesting information was extracted from these parts of the video (but see main text for caveat). Transparent red (gray) regions denote non-significant clusters. In the bottom part of the figure, three frames have been selected and magnified to illustrate the loss and re-emergence of cluster significance.
The earlier cluster in Figure
Secondly, our videos may have contained subtle differences that we failed to note, which, given that each video was presented several times, observant participants might have learnt in order to aid their discriminations. We cannot rule this out, as we did not attempt any formal investigation of potential information in this region via an ideal-observer approach. However, the earlier region of the video highlighted in Figure
This region is, however, remarkably consistent, spatially, with the later-emerging region that appears (based on the preceding analysis of our spatial and temporal experiments) to be a genuine locus of information accrual. Hence we suggest that the earlier region of significance may reflect an artifact caused by spatiotemporal bubbles sometimes acting as an
We have noted in previous sub-sections of the results that the informative regions suggested by a classification array should be treated with some caution, i.e., as containing, but potentially exaggerating in scale, regions of a video that contain information utilized by decision makers. Formally, we might consider the classification array a convolution of information-carrying regions with a filter. The properties of this filter reflect the spatiotemporal extent of the bubbles used to mask the video. While this idea is familiar to bubbles aficionados, having received discussion from the outset in the bubbles literature, it is likely less obvious to potential users from other fields. Hence, to illustrate this idea, we ran a set of simulated experiments and analyses, focussing on temporal and spatial (rather than spatiotemporal) experimental procedures (as these appear more likely to yield artifact-free results). In one set of simulations, all useful information was assumed to be contained in a single frame (temporally) or pixel (spatially). Observers' behavior (i.e., their chance of guessing correctly) was modeled as a cumulative Gaussian psychometric function of image visibility (i.e., the Bubbles profile) at the critical point,
Where ϕ denotes the Standard Normal cumulative density function with mean μ and standard deviation σ
Mean simulated data are presented in Figure
Results from illustrative simulations showing how the choice of bubble size affects the resulting classification array. Results are shown for simulations where information comes from a single frame/pixel
From the left-hand panels of Figure
This approximates situations in which the start and end of a larger contiguous region must be perceived to support accurate responding. Results are shown in Figures
Here, we set out to evaluate whether the bubbles variant of classification-image analysis (Gosselin and Schyns,
Our results demonstrate that the bubbles technique generalizes successfully from tightly controlled psychophysical stimuli (e.g., Gosselin and Schyns,
The strengths and limitations of bubbles need to be considered carefully when any new application is being planned. Relative to traditional spatial occlusion, the demands of stimulus preparation (i.e., frame by frame video manipulation) are reduced by a stochastic methodology. However, the bubbles method is correspondingly more complex, so the front-end investment may not be worthwhile unless a lab plans to test a range of scenarios across several experiments. We have highlighted some other considerations, for example the spatiotemporal scale of the bubbles. Small bubbles reveal information sources with high acuity, but may lack power to detect spatially or temporally extended cues. We have investigated only a single bubble size here, but some variation and/or combination of bubble sizes within a single experiment may prove more optimal when the scale of relevant information sources is hard to predict. Several ideas along these lines can be gleaned from previous work employing the bubbles technique (Chauvin et al.,
Our work here points to a possible attention-cuing artifact for spatiotemporal bubbles, albeit one that requires further verification. However, such an artifact would really be an extreme version of a general limitation with any masking approach, which is that the masking might itself influence an observer's strategy (or their automatic processing of information) by making the image unnatural. It remains to be seen whether other forms of masking (e.g., the additive noise used in reverse correlation) could prove less disruptive in the spatiotemporal case. Clearly, tennis players do not in general see the world through bubbles, and may adapt substantially when faced with this situation. While the possible cuing artifact in our spatiotemporal experiments appears particularly egregious, it should be borne in mind that any information source revealed by bubbles reflects performance only during a bubbles experiment, not during natural viewing. For example, consider the use of information from the head/gaze, found here when predicting the direction of forehand returns. Clearly our participants
To conclude—we have demonstrated that a combination of spatial and temporal bubbles in separate experiments can be used to determine the sources of information that guide correct decisions during the real-world scenario of tennis-shot anticipation. We recommend this approach more generally, as it does not require that experimenters are required to intuit potential sources of information in advance or deliberately manipulate videos in accord with these hunches. Although initially challenging, the technique is easily adapted once it has been implemented, and has potential for much wider application within psychological and human-factors research.
KY and JS conceived the experiments. SJ coded the experiments and analyses. SM ran the experiments. KY drafted the manuscript. SJ, SM, CM, JS, and KY contributed to the research design and critically revised the manuscript.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Preliminary reports based on these data were presented at three academic conferences (VSS2015, BASES 2016 & ECVP 2016) and published in abstract form in the following publications: Journal of Vision; Journal of Sports Sciences; Perception.
The Supplementary Material for this article can be found online at:
Video example of bubbled trials from the temporal experiment. Frame rates have been slowed to 1/4th actual presentation rate for clarity.
Video example of bubbled trials from the spatial experiment. Frame rates have been slowed to 1/4th actual presentation rate for clarity.
Video example of bubbled trials from the spatiotemporal experiment. Frame rates have been slowed to 1/4th actual presentation rate for clarity.
1One participant failed to provide this information.
2We viewed this systematic confound as acceptable, as we intended to assess the broad viability and compatibility of each approach, rather than make a detailed comparison between them, but we recognise that this choice was not ideal.
3To speed calculations, each bubble was rounded to zero beyond 4 (temporal) or 3 (spatial and spatiotemporal) σ from its centre. We selected a larger temporal bubble width in spatiotemporal compared to temporal sessions because a larger value allowed us to utilise fewer bubbles, and this proved important in terms of the time taken to generate each trial of the experiment.
4Motion in each video was detected via algorithm, and the estimated regions were then expanded slightly to ensure that no body motion was missed.
5In principal, this reframing can maximise power to detect information accrual at multiple points of interest in a series of analyses, but here we present data from a single coordinate transform for a relatively simple demonstration. We did explore a body-centred frame (using the navel) but it did not reveal additional sources of information missed by the analysis we present here.
6“4-connected” is a term from image processing and describes the manner in which connectivity is determined in a 2D or 3D space. Four-connected pixels are considered neighbours to (i.e. connected with) pixels that share a side, but not pixels that share only a corner.
7One typical approach to clustering in 3D data would be to use 3D connectivity to establish 3D clusters. Here, we instead used 2D connectivity
8We also found no differences between these groups for serves, or in our spatial and spatiotemporal experiments, but do not illustrate all null results in order to maintain a focussed presentation.