Differential Eye Movements in Verbal and Nonverbal Search

In addition to “nonverbal search” for objects, modern life also necessitates “verbal search” for written words in variable configurations. We know less about how we locate words in novel spatial arrangements, as occurs on websites and menus, than when words are located in passages. In this study we leveraged eye tracking technology to examine the hypothesis that objects are simultaneously screened in parallel while words can only be found when each are directly foveated in serial fashion. Participants were provided with a cue (e.g. rabbit) and tasked with finding a thematically-related target (e.g. carrot) embedded within an array including a dozen distractors. The cues and arrays were comprised of object pictures on nonverbal trials, and of written words on verbal trials. In keeping with the well-established “picture superiority effect,” picture targets were identified more rapidly than word targets. Eye movement analysis showed that picture superiority was promoted by parallel viewing of objects, while words were viewed serially. Different factors influenced performance in each stimulus modality; lexical characteristics such as word frequency modulated viewing times during verbal search, while taxonomic category affected viewing times during nonverbal search. In addition to within-platform task conditions, performance was examined in cross-platform conditions where picture cues were followed by word arrays, and vice versa. Although taxonomically-related words did not capture gaze on verbal trials, they were viewed disproportionately when preceded by cross-platform picture cues. Our findings suggest that verbal and nonverbal search are associated with qualitatively different search strategies and forms of distraction, and cross-platform search incorporates characteristics of both.


INTRODUCTION
Humans regularly engage in visual search to navigate our cluttered world. We engage in "nonverbal search" for objects throughout the day, for example by picking out clothes in the morning, finding a parking space at work, or gathering ingredients to prepare dinner. Whereas many aspects of nonverbal search are shared with other animal species (Bichot et al., 2005;Goto et al., 2014), humans are unique in having written language, and we are frequently required to engage in "verbal search" through complicated arrays of words. A typical day may involve finding stories to read on a newspaper or website, emails within our inbox, songs within a playlist, or items on a restaurant menu. This type of verbal search differs from stereotyped environment of traditional text reading, in which your eyes scan from left to right along each line of text, from the top to the bottom of a page, as sentences and paragraphs are scanned in sequential order (Rayner, 1998;Rayner, 2009). Instead, words in our modern environment are often arranged in complicated, variable, and widely spaced spatial configurations, as occurs on websites, magazines, maps, menus, calendars, schedules, and advertisements. Far less is known about how we locate word targets in these novel spatial arrangements.
Few studies have examined verbal and nonverbal search under fully-comparable experimental conditions. Paivio and Begg (1974) provide an early example of one such study. Participants were shown a cue followed by an array of 25 items (including a target and 24 distractors). Stimuli appeared as either written words or pictures, and the modality of both cue and array were manipulated in a factorial design. Behavioral results differed between the two within-platform task conditions, in which picture cues were followed by picture arrays versus when word cues were followed by word arrays. Participants were faster and more accurate on the nonverbal (picture) version of the task, compared to the verbal (word) version. This finding is consistent with a more general phenomena known as the "picture superiority effect," which describes how pictures are processed more rapidly than words, and subsequent memory for pictures is stronger as well (Nelson et al., 1976;Stenberg et al., 1995). Paivio and Begg (1974) hypothesized that more efficient nonverbal search could reflect participants' engagement in a "parallel sweep" for pictures, versus a serial (item-by-item) search for words. They reasoned that visual features present in line drawing pictures would be discriminable in peripheral vision, allowing multiple objects to be screened simultaneously via covert mechanisms. In contrast, they noted that all words share similar sets of constituent letters, and may thus need to be foveated in order to be differentiated.
The primary goal of the current study is to leverage eye tracking to more directly evaluate Paivio and Begg's hypothesis. They inferred that parallel and serial search strategies (for objects and words, respectively) were being employed based on reaction time of pointing responses, but such strategies should also result in distinct patterns of eye movements. The use of serial search should be most obvious, as evidenced by a series of low-amplitude saccades to adjacent items in an array, until the target item is finally foveated. This yields the further prediction that serial search will result in roughly half of the items in the array being fixated before finding the randomly placed target.
In general it is only possible to foveate one item at a time, so any sort of parallel processing of multiple items must depend on the use of parafoveal (peripheral) vision. This should allow participants to "skip over" (i.e. pass by rather than fixate on) items that have been peripherally screened as not being the target. In eye tracking terms, this skipping behavior should be indexed by high amplitude saccades, and by fewer than half of array items being fixated on average (Seckin et al., 2016b).
In the current study we showed participants arrays of words, and on other trials arrays of object pictures, under reasonably comparable experimental conditions, while eye movements were monitored. Participants were provided with a cue (e.g. rabbit), and then tasked with locating a thematically associated target (e.g. carrot) within each array. We expected to replicate the wellestablished picture superiority effect, such that manual touchscreen responses would be more rapid on nonverbal trials, compared to verbal trials. In keeping with Paivio and Begg's original hypothesis, we predicted that verbal trials would be more closely associated with serial viewing behavior (lower amplitude saccades and more items viewed), while nonverbal trials would be associated with parallel search behavior (higher amplitude saccades and fewer items viewed).

Additional Factors Governing Verbal and Nonverbal Search
In circumstances where participants engage in parallel search, this brings up a related question: by which basis would you skip over some objects but be drawn to foveate others? Would this be based on surface-level features, or deeper conceptual similarity to the target? Results from the commonly employed "visual world paradigm" (VWP) (Cooper, 1974), in which participants are presented with four object pictures and a spoken word cue that matches one of them, suggest both types of attributes can engender distraction. Object distractors of a similar shape to the target are viewed disproportionately longer in the VWP (Dahan and Tanenhaus, 2005;Huettig and Altmann, 2007;Yee et al., 2011), as are objects from the same taxonomic category (Sorensen and Bailey, 2007;Kalenine et al., 2012;Mirman and Graziano, 2012a;Mirman and Graziano, 2012b).
We included both types of distractors in the current paradigm, with additional controls not present in the standard VWP. Firstly, we independently manipulated shape and taxonomy, by including distractors from the same category as the target but with a dissimilar overall shape, and conversely distractors sharing shape but not category in common. This is necessary if both factors are to be examined separately, as members of the same taxonomic category almost always share visual features in common (Mirman et al., 2017), and often have the same overall shape (e.g. many tools have an elongation ending in a handle; many mammals have legs, arms, and a head). Secondly, we employed a full factorial design so that the cues and target arrays could be either words or pictures (unlike the original VWP, which is a strictly cross-platform task with word cues and picture arrays), allowing us to better disentangle the mechanisms involved in verbal versus nonverbal search. Based on previous results from the VWP, we predicted that both shape and taxonomic competitors would be viewed disproportionately on non-verbal trials. If so, this would suggest that nonverbal search is guided by a mixture of perceptual and conceptual processing of picture candidates (possibly in parallel).
In contrast, if Paivio and Begg's hypothesis is correct, and the search for well-spaced words is strictly serial, then neither shape nor taxonomy should affect processing for words in the parafoveal field. Rather, participants should be equally likely to view any sort of distractor, as they work their way around the array serially. It remains possible, however, that these factors would cause gaze to linger on words after foveation. We examined Frontiers in Communication | www.frontiersin.org September 2021 | Volume 6 | Article 654575 these possibilities by evaluating both the number of distractors viewed and the duration of gaze upon viewing (Seckin et al., 2016b). Finally, our full factorial design included not only withinplatform conditions (i.e. verbal and nonverbal trials), but also cross-platform trials where word cues are followed by picture arrays and vice versa. These not only reflect every day circumstances in which words and pictures are intermixed (e.g. advertisements, signs, etc.), but allow us to compare our results to commonly employed cross-platform tasks such as the VWP. In general we expected the cross-platform results to be largely similar to their within-platform counterparts, such that participants would engage in a parallel search through pictures even when preceded by word cues, and to engage in serial search through words preceded by picture cues.

Participants
Twenty-two undergraduates completed the study for course credit, including 13 males and 9 females, as approved by the Institutional Review Board at Cleveland State University. Participants reviewed and signed a written consent form prior to engaging in the study, and the study procedures were in accordance with the ethical standards of the Declaration of Helsinki. All participants were right-handed native English speakers, with an average age of 19.7 ± 2.6 years (M ± SD). All were screened for any history of medical or developmental conditions that could affect language or other domains of cognition.

Apparatus
A 20.5 × 11.5″ touchscreen monitor was used to present visual stimuli (1920 × 1080 resolution) and to collect touch responses. The screen was placed at an optimal distance for touch responses (M ± SD 20.4 ± 1.2″), based on each participant's arm length. Eye movements were monitored and recorded using an EyeLink 1000 Plus eye tracking system (SR Research, Mississauga, ON, Canada). The eye tracking camera was placed above the participant's head, housed in a tower mount, so touch responses did not occlude the camera's view of the participant's eyes. A chin rest also housed in the tower mount minimized head movements. A nine-point calibration procedure was administered before the experiment to ensure accurate estimates of gaze position.

Materials
One hundred and seventy-five concrete object stimuli were selected for the eye tracking test (Supplementary Table S1). All items were presented as both pictures and written words throughout the course of the experiment. Picture stimuli were shaded gray scale drawings, drawn from the Rossion and Pourtois (2004) image set (N 44) and internet image searches (N 131). The pictures were scaled to 150 × 150 pixels (visual angle 4.5°× 4.5°from the average viewing distance of 20.4″). Word stimuli were the names for each item, written in lowercase 30-point Arial font. Words ranged in size from 41-226 pixels horizontally (1.2°-6.8°) and 22-37 pixels vertically (0.6°-1.1°). All stimuli were presented on a white background.
Twenty-four items were employed as cues, and paired with another 24 thematically-associated target items (e.g. "rabbit" and "carrot"). 1 On each trial the target was accompanied by 12 distractors, whose taxonomic and visual relation to the target was systematically manipulated (Supplementary Table S1). Some distractors came from the same taxonomic category as the target, while others came from a different category. Specific categories employed included animals, body parts, foods, clothing, and tools. Some distractors had a similar overall shape to the target, for example by having outer boundaries that were long and thin ("asparagus," "flute"), round ("egg," "balloon"), or having an extended process ("key," "guitar"), while others had a different shape. Distractors were equally divided (3 each) into those sharing both category and shape in common with the target (e.g. "asparagus" for the target "carrot"), same-category but different-shape (e.g. "onion"), differentcategory but same-shape (e.g. "flute"), and different-category/ different-shape (e.g. "shirt"). This allowed us to separately evaluate the influence of taxonomic and visual (shape) similarity.
Cues were presented in the center of the screen, while targets and distractors were presented in an elliptical array with a horizontal axis of 1152 pixels (33.7°) and a vertical axis of 878 pixels (26°). This aspect ratio equates parafoveal acuity across positions when centrally-fixating (Iordanescu et al., 2011). At this degree of eccentricity from center, participants are typically able to detect whether an object is present (Thorpe et al., 2001) but are unable to discern their shapes (Nelson and Loftus, 1980). As such, the target could not be covertly identified from an initial central fixation, soliciting overt eye movements. Targets on each trial were equally distributed across the 13 possible array positions, as were the four types of distractors. Items in the array were spaced an average of 7.3°apart from one another.
Relative salience of each item in the array was examined, in order to ensure that low-level visual characteristics were not directing gaze preferentially towards the target or towards a particular class of distractors. Screen captured images of each word array and each picture array were converted into salience images using the Saliency Toolbox (Walther and Koch, 2006), based on the relative spatial distribution of intensity and line orientations across the screen. Average saliency values were averaged around the 150 pixels encompassing each of the 13 stimuli in the arrays. Helmert repeated measures ANOVA contrasts showed that the target items did not differ in salience from the distractors on each trial (F (1,95) 0.028, p 0.87). The main effect of distractor type was also non-significant (F (3,285) 1.4, p 0.24), suggesting the four types of distractors were comparable in salience.

Procedure
After a brief 9-point calibration procedure, participants completed the thematic eye tracking task. There were 96 trials in the eye tracking task, which took approximately 15 min to complete. On each trial the cue appeared in the center of the screen for 2.5 s, followed by a fixation cross for 0.5 s, followed by an elliptical array of 13 items, which remained on the screen until a touch response was detected ( Figure 1). The array included the thematically-associated target and 12 distractors (including three each of same-category/same-shape, same-category/differentshape, different-category/same-shape, and different-category/ different-shape).
Stimulus modality varied across trials. On nonverbal trials (N 24) cue and array were both pictures, and on verbal trials (N 24) cue and array were both words ( Figure 1). In the remaining cross-platform trials word cues were followed by picture arrays or vice versa (N 24 each). The experiment was administered in blocks of six trials, each starting off with an instruction screen indicating the relevant modality for that block (e.g. "Now you will match words with pictures") along with an example cue and array. Stimulus modality was changed after each block, preventing results in any modal version from being confounded by practice, fatigue, or other ordering effects. The first block of six trials were nonverbal (picture-to-picture), followed by a block of cross-platform word-to-picture trials, followed by a block of verbal trials (word-to-word), followed by a block of cross-platform picture-to-word trials. There were four cycles of this repeated block pattern throughout the experiment, resulting in 96 total trials. Each cue and target appeared four times throughout the experiment: twice as word stimuli (once within-platform, once cross-platform), and twice as picture stimuli (once within-platform, once cross-platform). Repetition of stimuli was balanced across blocks and cycles, such that the stimuli in each modal block were matched for the number of prior exposures (avoiding potential confounds between modality and degree of exposure). All participants completed these trials, blocks, and cycles in the same pseudorandomized order.

Eye Movement Acquisition and Coding
Eye movements were recorded at a sampling rate of 500 Hz, in epochs starting with the onset of the array and ending with each touch response. The EyeLink Data Viewer software package was used to classify each continuous recording into blinks, saccades, and fixation. Periods without a detectable pupil were classified as blinks. Saccades were identified as eye movements that crossed motion (0.15°), velocity (30°/s), and acceleration thresholds (8,000°/s 2 ). Any remaining period lasting ≥40 ms was classified as a fixation.
Areas of interest (AOIs) were constructed around each of the 13 items in the array (Supplementary Figure S1), and were used to classify fixations and saccades. An imaginary outer iso-acuity ellipse (with the same proportions as the stimulus array) was plotted, which extended to the left, right, top, and bottom outer borders of the screen. An inner concentric ellipse was also plotted, at the same distance from the stimulus ellipse but located closer to the center of the screen. Straight lines were then plotted between FIGURE 1 | Schematic of the eye tracking task design. On nonverbal trials picture cues were followed by an array of pictures including the target and 12 distractors. On verbal trials cue and array items were written nouns corresponding to the same items. Participants were instructed to touch the target item.
Frontiers in Communication | www.frontiersin.org September 2021 | Volume 6 | Article 654575 the inner and outer boundary ellipses, such that each line bisected the space between adjacent items in the array. The inner and outer points of each bisecting line were then connected to create a series of trapezoidal AOIs, each centered on an item in the array. This maximized our ability to assign mildly mis-calibrated fixations to the nearest item in the array. Fixations falling outside the AOIs (in the center or peripheral corners of the screen), and saccades with a starting or ending point outside the AOIs, were excluded from analysis. Eye movements on inaccurate trials were also excluded from analysis (<6% of trials per condition). Percent viewing time for each item was calculated as the summed duration of all fixations in that item's AOI, on that trial, divided by the total length of time spent in all AOIs on that trial. Percent viewing times were then used as a dependent variable in statistical analysis.

Statistical Analysis
Normality of the dependent variables, including touch response accuracy and reaction times as well as percent dwell times, was assessed via Kolmogorov-Smirnov tests. All variables were found to be normally distributed [p > 0.05 in all task conditions (verbal, non-verbal, and cross-platform trials)], with the exception of accuracy (p < 0.05 in all task conditions). Accuracy data were therefore examined via non-parametric Wilcoxon signed-ranks tests. Reaction times were analyzed via paired-samples t-tests, and percent viewing times on distractors were analyzed with 2 × 2 repeated-measures ANOVAs across shape and taxonomic category. A series of additional factors for percent viewing times (e.g. lexical frequency) were analyzed at the trial level using generalized linear mixed-effects models (GLMMs). More details are presented throughout the Results. All statistical analyses were conducted with SPSS v25 (IBM Corp, Armonk, NY, United States) and SAS version 9.2 (SAS Institute Inc., Cary, NC, United States).

RESULTS
Results from nonverbal and verbal trials will first be described in detail, followed by a summary of cross-platform results.

Saccade Classification
Scan paths such as the ones shown in Figure 2 demonstrate that participants tended to engage in a mixture of low and high amplitude saccades on nonverbal trials, but tended to adopt a more stereotyped and exhaustive "serial search" strategy (Seckin et al., 2016a;Seckin et al., 2016b) on verbal trials. In order to quantify these patterns, saccades were classified into three types based on their start and end points: within-item saccades starting and ending in the same AOI, local saccades where the start and end points are in adjacent AOIs (corresponding to neighboring items along the array), and long-range saccades where the end point is in a non-adjacent AOI. Local saccades that were preceded by another local saccade in the same direction (clockwise or counterclockwise) were further coded as being serial saccades, with the remainder of local saccades being labeled as non-serial. The mean amplitude of within-item saccades was 1.31 ± 0.17°, local saccades (including serial) had a mean amplitude of 4.38 ± 0.19°, and long-range saccades had an average amplitude of 11.34 ± 0.67°.
As shown in Figure 3, serial saccades were more prominent on verbal trials (29.45 ± 9.96% of all saccades) than nonverbal trials FIGURE 2 | Sample scan paths. Eye movement scan paths are shown from a participant during a sample nonverbal trial and a corresponding verbal trial. In both trials the cue was a whisk, and the target was a bowl.
The prevalence of long-range over serial saccades suggests participants employed covert attentional mechanisms during nonverbal search, which may have contributed to capture on those trials. A parafoveal mechanism of capture would draw gaze towards an object distractor in the periphery, making it more likely to be viewed on each trial (Seckin et al., 2016b). To assess this possibility, we calculated the number of distractors viewed on each trial, such that at least one fixation fell within their respective AOIs.

Modeling Fixation Predictors
Neither shape nor category affected percent viewing times on words, but participants were slower and less accurate on verbal trials, suggesting that other perceptual or conceptual factors may have governed word-based interference. Univariate and multivariable generalized linear mixed-effects models (GLMM) were constructed to examine which distractor attributes were associated with greater percent viewing times (as the dependent outcome measure). Predictors included distance from target (number of intervening items between the distractor and target item along the array), saliency, category (same/different), shape (same/different), word length (number of letters in each distractor word), word length difference from target (computed as the absolute value of the length of distractor word minus the length of the target word), and lexical frequency. Trials were nested within participants, with random-effects accounting for clustering. Random intercept was included with unstructured covariance structure. Distractor percent viewing times were included as outcome measures in separate GLMMs for verbal and nonverbal trials. Bonferroni correction was applied to the significance threshold in univariate models, to control for elevated type 1 error when testing seven predictors independently (α 0.007).
Rather than shape and category, lexical predictors accounted for variance on verbal trials ( Table 1). Participants spent increased FIGURE 4 | Each trial included distractors that were from the same or different category as the target, and had the same general shape or a different shape from the target. Average time spent fixating each class of distractor is shown as a percentage of total fixation time during that trial. *: p < 0.001.
Frontiers in Communication | www.frontiersin.org September 2021 | Volume 6 | Article 654575 6 time viewing words with more letters (absolute word length), words of a similar length to the target, and lower frequency words. Longer words also tend to be less frequent, causing the latter to be non-significant in the multivariable model. Lexical predictors were not associated with viewing times on nonverbal trials (Table 2), where shape and category again proved to be the relevant factors. Participants spent more time viewing items that were closer to the target (distance to target) on both trial types. Salience did not affect viewing times, and did not interact with other factors in the multivariable models.

Results From Cross-Platform Trials
There were two types of cross-platform trials: word cues followed by picture arrays (word-to-picture), and picture cues followed by word arrays (picture-to-word). Accuracy was 95.26 ± 6.07% on word-to-picture trials and 96.96 ± 4.66% on picture-to-word trials. Reaction times were 2,780 ± 613 ms on word-to-picture trials, and 2,910 ± 535 ms on picture-to-word trials. When compared directly to their within-platform counterparts, word-to-picture accuracy (Z 1.5; p 0.135; η 2 0.107) and reaction times (t (21) 0.29; p 0.772; η 2 0.004) did not differ significantly from those on nonverbal (i.e. picture-to-picture) trials, but responses on picture-to-word trials were significantly more accurate (Z 2.3; p 0.021; η 2 0.252) and faster (t (21) 2.47; p 0.022; η 2 0.225) compared to performance on verbal trials.
GLMMs were conducted to identify additional factors related to distractor viewing times on cross-platform trials. Shape and category were significant predictors on word-to-picture viewing times, as was the case for nonverbal trials (Supplementary Table  S2). Unlike nonverbal trials, absolute word length was also significantly associated with word-to-picture viewing times in the univariate model. This effect did not achieve significance in the multivariable model (p 0.055) due to shared variance with other predictors (shape, and word length difference from target). In contrast, absolute word length and lexical frequency were associated with viewing times on picture-to-word trials ( Supplementary Table S3), as was the case for verbal trials.

DISCUSSION
Participants in the current study engaged in rigorous verbal and nonverbal search tasks under comparable conditions, while eye movements were monitored. Participants were highly accurate in all task conditions, suggesting that the verbal and nonverbal tasks were matched in difficulty (Thierry and Price, 2006). There were, however, subtle but significant differences in performance between platforms. Nonverbal search was 11% faster and 3% more accurate than verbal search, consistent with the wellestablished picture superiority effect (Nelson et al., 1976).
Eye movements revealed some of the mechanisms underlying picture superiority in the context of visual search. Verbal search was dominated by serial eye movements, as evidenced by a series of clockwise or counter-clockwise low-amplitude saccades to adjacent words in the array. In contrast, the more efficient nonverbal search was associated with more frequent long-range saccades to nonadjacent items, and fewer total items being viewed on each trial.
These discrepant strategies are supportive of Paivio and Begg's initial hypothesis that, unlike words which require serial search, we are able to engage in a "parallel sweep" for objects (Paivio and Begg, 1974). Only one object can be foveated at a time, so parallel viewing depends upon the effective use of peripheral vision, leveraging covert attention to scan small clusters of objects simultaneously (Smith et al., 2014;Fluharty et al., 2016). Saccadic patterns observed in the nonverbal task are consistent with this interpretation. After executing a long-range saccade to a sector of the array, participants seem able to simultaneously screen not only the foveated item (overtly) but also adjacent pictures in the array (covertly). Participants can then execute a local saccade to the target if it happens to be present within the effective parafoveal field. If the target is not present, the participant can skip over covertly screened adjacent pictures, and instead execute another long-range saccade to a new sector of the array.
As further evidence that object features were discernable in the periphery, participants were more likely to fixate on distractors from the same category and of a similar shape to the target on nonverbal trials (i.e. viewed a greater number of those distractors on each trial). Participants fixated on fewer total distractor pictures than distractor words, however, and identified picture targets more rapidly than word targets, suggesting that employment of covert attention was largely advantageous to nonverbal search. In addition to parallel capacity, covert attention can be directed faster than overt eye movements can be executed (Posner et al., 1980), which may further serve to accelerate nonverbal search and promote picture superiority in search contexts.
Although word stimuli in the current study were presented at roughly the same size as picture stimuli, and placed at the same eccentricity along each array (7.3°), results from verbal trials suggest that word features were not processed to any meaningful extent in peripheral vision. Participants viewed an average of 6.5 out of the 13 words on each trial before initiating a touch response, which is precisely what would be predicted in a scenario where the target is placed randomly and every item needs to be directly foveated in order to be rejected. The finding that peripheral pictures but not words were discernable in our arrays is consistent with the general principle that the size of the effective parafoveal field (also known as "perceptual span") depends on the modality of the stimuli being viewed. Estimates of the effective field size for objects in scenes range from 4-10° (Parker, 1978;Henderson et al., 2003), while the effective field for letters and words is limited to 3-4 character spaces to the left of fixation and 14-15 spaces to the right of fixation, at least in the context of passage reading (Rayner, 2009).
A variety of visual attributes inherent to written English words likely contribute to their being difficult to discern in the periphery. All words have a similar overall shape, with longer horizontal than vertical axes. Whereas the visuospatial arrangement of object features is meaningful (e.g. the "business end" of objects is often distal, and many are symmetrical), letters are instead ordered via a Frontiers in Communication | www.frontiersin.org September 2021 | Volume 6 | Article 654575 8 symbolic mapping with temporal speech sounds. Theoretically, each letter within a word may be any of 26 in the English alphabet, but in practice the six most frequent letters represent over half of those encountered in text (Zim, 1966;Soanes and Stevenson, 2004;Norvig, 2012), resulting in substantial letter overlap across words. Letters themselves are comprised of a limited set of features, including curves, lines, and dots, and are closely spaced within each word. Physical similarity between letters acts to further reduce the effective span (Rayner and Fisher, 1987). Put more simply, all words look the same, and the lack of distinguishing features renders them less actionable in the periphery.
When visual features in array items are highly overlapping and "inseparable," a serial rather than parallel strategy becomes more effective (Treisman and Gelade, 1980), as was apparently the case for words in the current study. Serial viewing is also dominant in traditional reading contexts, during which some words are skipped over, but the majority are sequentially fixated in passages (Rayner, 1998;2009). We should point out that the use of a serial strategy is not mutually exclusive with use of covert attention (Cave and Wolfe, 1990). Typical adults covertly screen adjacent words during passage reading, particularly to the right of those being fixated, which conveys a "preview benefit" (Briihl and Inhoff, 1995;Miellet and Sparrow, 2004). Words in the current study were spaced further apart than in passages, however, and were vertically displaced from one another, which drastically curtails peripheral processing of word and letter content in horizontal lines and strings (Inhoff and Briihl, 1991;Pollatsek et al., 1993).
The current findings may therefore be more relevant for contemporary everyday reading behaviors, where words are distributed in variable configurations rather than traditional text passages. The content of words is less likely to draw gaze in such configurations, as further evidenced by the finding that banner advertisements are no more likely to be viewed when they are conceptually-congruent with web sites (Hervet et al., 2011). Instead, important verbal material may be more effectively highlighted by low-level manipulations such as font, color, and placement (Lohse, 1997;Kuisma et al., 2010;Slattery and Rayner, 2010). The current results may also help explain why pictures are more likely to draw gaze to advertisements than text (Pieters and Wedel, 2004), as they are identified in a wider effective parafoveal field.

Different Factors Capture Gaze During Verbal and Nonverbal Search
Saccadic patterns indicated that objects can be screened in parallel, so additional fixation analyses were conducted to determine why some objects were skipped over while others were more effective at capturing attention. Distractors similar in shape to the target object, as well as distractors from the same taxonomic category, were viewed disproportionately during nonverbal search. These findings are consistent with previous results from the VWP (Dahan and Tanenhaus, 2005;Huettig and Altmann, 2005;Huettig and Altmann, 2007;Sorensen and Bailey, 2007;Yee et al., 2009;Yee et al., 2011;Kalenine et al., 2012;Mirman and Graziano, 2012a;Mirman and Graziano, 2012b), and demonstrate that both factors are influential in entirely nonverbal task conditions as well. Importantly, taxonomic distractors were effective at capturing gaze even when they had a dissimilar shape to the target, suggesting that taxonomic competition between pictures was not limited to shared perceptual characteristics. This finding is consistent with theoretical accounts in which category coordinates are co-activated via conceptual priming (Collins and Loftus, 1975;Masson, 1995;Hutchison, 2003;Proklova et al., 2016). In summary, the relative efficiency of nonverbal search appears to be facilitated by parallel screening; objects which are perceptually or conceptually similar to the target are more likely to be foveated, while unrelated objects are more likely to be skipped over.
In contrast, the serial strategy taken by participants during verbal trials ensured that no type of distractor was more likely to be viewed. Instead, adjacent words were viewed one by one until the target was identified, regardless of their composition. It remained theoretically possible that shape and taxonomy could still capture gaze through foveal mechanisms, such that gaze would linger on those types of distractors after being fixated (Seckin et al., 2016b). These distractors were viewed for the same duration as unrelated words, however, refuting the idea that either shape or taxonomy could act as an attentional "sinkhole." Lack of shape effects for words is consistent with results from Huettig and McQueen (2007), who also failed to find any influence of shape when they conducted an all-verbal adaptation of the VWP. As described in the picture superiority literature (Nelson et al., 1976;Nelson et al., 1977;Sperber et al., 1979), features including shape are directly represented in pictures, but would need to be retrieved from long-term memory when viewing words, potentially explaining why shape did not influence verbal search.
The absence of a taxonomic effect during verbal search is more perplexing, as taxonomic priming is consistently observed when processing pairs of words in psycholinguistic paradigms, resulting in faster reaction times, lower amplitude N400 electro-and magnetoencephalographic potentials, and hemodynamic suppression for words from the same category compared to unrelated words (Bentin et al., 1985;Rossell et al., 2003;Wheatley et al., 2005;Lewis et al., 2015). The most obvious explanation would be that lack of peripheral information for words may have therefore mitigated their ability to capture attention, but this explanation is unsatisfying as taxonomic capture was observed on cross-platform trials with the exact same word arrays (see next section for discussion of cross-platform results). We are therefore left with the curious finding that taxonomic capture emerges when either cues or targets are pictures, but evaporates in entirely verbal task circumstances.
Taxonomic processing was somehow marginalized during verbal search in the current study, suggesting that, rather than being obligatory, automatic, and ubiquitous, at least some types of lexico-conceptual information are evoked in a task-dependent manner. This is consistent with (Friedman and Bourne's, 1976) contention that ". . . multiple codes or representations do not exist to be activated by the appropriate stimuli, but rather the stimuli themselves embody levels of information which are encoded and used as needed." According to their perspective (which was motivated by a levels-of-processing framework), words and objects may each be flexibly processed at different levels of depth, extracting qualitatively different types of information according to the demands of the situation. Word category becomes pertinent when search is guided Frontiers in Communication | www.frontiersin.org September 2021 | Volume 6 | Article 654575 9 by picture cues, but not when guided by word cues, suggesting that pictures somehow elicit a taxonomic "mindset" that is applied to the search task. This possibility requires further study.
The absence of taxonomic effects during verbal search begs the question as to what sorts of lexico-conceptual information, if any, were accessed as participants viewed words in the current study. Alternatively, were participants merely screening for the sequence of letters that comprise the target word, constraining processing to a superficial orthographic level rather than engaging in lexicoconceptual access? A purely orthographic strategy would be efficient, as participants could rapidly reject each distractor when the first letter does not match the target, but the results from GLMM analyses suggest this was not the case. Lexical frequency was found to significantly modulate fixation durations, such that low frequency word distractors were viewed for longer, which is a common finding in studies of passage reading (Rayner, 1998;Kliegl et al., 2004;White, 2008). This suggests that stored lexical representations corresponding to each word distractor were indeed accessed during verbal search.
Viewing time GLMMs revealed additional physical characteristics of word distractors which caused gaze to linger. The absolute length of words affected gaze, such that words with more characters were viewed for longer (irrespective of their relationship with the target word), which is also common when reading text passages (Kliegl et al., 2004). More interestingly, words closer in length to the target were also viewed for a greater duration. This suggests that participants were employing some sort of whole-word recognition strategy (Ellis and Young, 1988) as the general form of the target word was employed as a template during verbal search, providing top-down guidance (Wolfe et al., 2004;Schmidt and Zelinsky, 2009).
In summary, qualitatively different factors were found to govern verbal compared to nonverbal search, with the former being modulated by orthographic and lexical factors such as word length and frequency, while the latter was affected by the taxonomic category and shape of the picture distractors.

Characteristics of Cross-Platform Search
In addition to within-platform conditions, the factorial design of the current study also included blocks of cross-platform trials where word cues were followed by picture arrays and vice versa. As expected, saccadic patterns indicated that the strategies employed in cross-platform search were largely dictated by the modality of the stimulus arrays. Long-range saccades were prevalent in word-topicture trials, as was the case for nonverbal trials. Likewise, pictureto-word search was dominated by serial viewing of word stimuli, as was the case on verbal trials. These strategies are plausibly driven by the quality of parafoveal information available in the display, allowing pictures but not words to be viewed in parallel. Serial search is inherently slower, as reflected by greater reaction times on picture-to-word as compared to word-to-picture trials.
There were, however, subtle differences in performance on cross-platform compared to within-platform trials. Given that the physical image of the array on the retina was identical on crossplatform trials and their within-platform counterparts, these differences are attributable to the platform of the cue being held in working memory. Although Paivio and Begg (1974) found no differences in search times for word targets preceded by picture cues versus word cues, we found that search was 7% faster and 3% more accurate on cross-platform picture-to-word trials (compared to verbal trials). Analysis of fixation times revealed that participants spent proportionately less time viewing distractors on picture-to-word trials compared to verbal trials, further indicating that top-down guidance from picture cues was more effective when searching for words.
Previous studies employing picture targets found they were identified more rapidly when preceded by picture cues, compared to word cues (Paivio and Begg, 1974;Wolfe et al., 2004;Schmidt and Zelinsky, 2009). In the current study, there were no significant differences in reaction times or accuracy on word-to-picture compared to picture-to-picture (nonverbal) trials. This is unlikely to represent a type II error, as the effect was not only non-significant but the reaction time means were patterned in the opposite direction as predicted, being 25 ms faster on word-to-picture trials.
Picture cueing benefits in previous studies were interpreted as reflecting the creation of a more effective guidance template, which could then be applied during the search for a subsequent picture target. Schmidt and Zelinsky (2009) found that the more specific the cue was, the larger the benefit, with category names being the least effective cue, followed by specific object names, and picture cues being the most effective. Unlike previous studies, picture cues in the current study were thematic associates of the subsequent target, rather than identically-matching pictures. As such, the resultant guidance templates may have been less specific, failing to convey a substantial benefit for picture cues (as compared to word cues) when searching for picture targets.
In general, the same factors that captured gaze on withinplatform trials tended to elicit distraction in their cross-platform counterparts: shape and category were influential on word-topicture trials (as was the case for nonverbal trials), while word length and frequency were influential on picture-to-word trials (as was the case for verbal trials). There were some notable exceptions, however, demonstrating that the mixed cues on cross-platform trials affected subsequent distraction. Firstly, the shape effect was of higher amplitude on word-to-picture compared to entirely nonverbal trials. More strikingly, other factors were found to selectively affect viewing times only on cross-platform trials. Although performance on nonverbal trials was unaffected by any orthographic or lexical factors, word length significantly modulated gaze on word-to-picture trials. Participants viewed object pictures with longer names for a greater amount of time, suggesting that those names were covertly accessed to an extent that was consequential for performance, but only when those pictures were preceded by a word cue. Covert naming effects have also been observed when viewing pictures in the crossplatform VWP (Dahan et al., 2001).
Taxonomic capture was of equivalent amplitude on word-topicture and nonverbal trials, and was far greater on picture-to-word compared to verbal trials (with the effect being non-significant in the latter). Although taxonomic capture was observed in both types of cross-platform trials, there appeared to be different underlying mechanisms. Participants spent a disproportionate percentage of time viewing same-category distractors on both types of crossplatform trials, but only viewed a greater number of samecategory distractors on word-to-picture trials. This indicates parafoveal involvement on word-to-picture trials, such that gaze was drawn towards same-category pictures in the periphery, as was the case on nonverbal trials. In contrast, lack of peripheral discriminability meant that participants were no more likely to view same-category words on picture-to-word trials, so preferential viewing of lexical competitors was more likely based on foveal mechanisms.
Two conclusions can be drawn based on the results from crossplatform trials. Firstly, although it would be reasonable to expect that within-platform cues would provide more direct and therefore stronger guidance, the inclusion of cross-platform cues does not appear to be deleterious for visual search. Word-to-picture search was equally efficient as picture-to-picture search, while picture-to-word search was slightly more efficient than word-to-word search. Secondly, whereas separate and distinct sets of factors were found to create distraction during verbal and nonverbal search, these sets overlapped during cross-platform search. Orthographic and lexical factors affected word viewing during verbal search, while taxonomy and shape affected picture viewing during nonverbal search, but a mixture of both sets of factors were influential in the cross-platform conditions. This makes sense given that both words and pictures are present, and must be considered in relation to one another, on each cross-platform trial. Word-to-picture search appears to have encouraged verbal mediation while viewing objects, as evidenced by a word length effect. Likewise, picture-to-word search apparently encouraged nonverbal mediation while viewing words, as evidenced by taxonomic capture (which was not present in the entirely verbal condition). This raises the possibility that many of the most commonly-administered language tasks such as the VWP and the picture-word interference task (Schriefers et al., 1990) may also encourage nonverbal mediation (via the inclusion of picture stimuli), resulting in effects which would otherwise not be present in a purely linguistic design.

Limitations
The current results indicate that parallel viewing is one of the core mechanisms underlying picture superiority in the context of visual search. Picture superiority is undoubtedly supported by additional mechanisms, however, as superiority effects are observed in situations where parallel processing is not possible, such as when judging solitary items (Sperber et al., 1979;Bajo, 1988;Stenberg et al., 1995). Future studies, particularly those involving novel paradigms and convergent techniques, may further our understanding of the differential mechanisms underlying word and object recognition.
We tried to place words and objects on an equal footing in the current eye tracking paradigm, equating them for size and position in each array. The effective parafoveal field is larger, however, for objects than for words, so there was greater opportunity for parafoveal involvement on nonverbal trials. Although one could argue that these differences in effective span are an important aspect of verbal and nonverbal search in real world circumstances, future studies could also attempt to equate both types of modal arrays for effective span, in order to address a slightly different set of research questions.
We also repeated the same set of cues, targets, and distractors across platforms. Again, this was done in the interest of placing words and objects on equal footing, as the exact same items were present on verbal, nonverbal, and cross-platform trials, which eliminated a host of potential confounds. Repetition surely affects processing however, and "long-term" repetition (i.e. with lag times that fall outside the span of working memory) has been found to have differential effects across stimulus modalities, with effects being strongest in cross-platform conditions (Stenberg et al., 1995). It would be interesting to observe whether the effects reported in this study would change if repetition was either removed altogether (Schmidt and Zelinsky, 2009), or systematically manipulated.
We manipulated the shape of pictures in the current study, in order to disentangle perceptual from conceptual influences on visual search, and to further isolate the mechanisms of taxonomic capture (by removing the visual similarity between category competitors). Our shape manipulation was based on the overall outline of objects, however, which were crudely classified into being roughly round or roughly elongated. It is possible, however, for two objects to have different outer contours, but still share interior features in common, particularly when those objects are from the same category. For example, an eel and a blowfish both have eyes, mouths, and other features that allow them to be recognized as fish. It may be impossible to fully separate the physical similarity between taxonomic coordinates, as shared features are part of the rationale for grouping objects into categories in the first place (Mirman et al., 2017), but there are certainly ways in which shape can be more assiduously controlled. For example, the overlap of specific features (e.g. eyes, claws, handles) rather than contours could be manipulated, and objective computational techniques could be leveraged to quantify similarity in shape.
The current set of results revealed differential verbal and nonverbal mechanisms as typical adults searched for thematic associates. This fits comfortably with a dual route account (Hurley et al., 2012;Hurley et al., 2018), in which humans are able to draw upon both verbal and nonverbal routes to access conceptual knowledge such as thematic associations. 2 The presence of two routes would provide robustness to a conceptual system: when one route is blocked by neurological damage, knowledge may instead be accessed via an alternate route. In the healthy brain, however, the neural pathways underling word and object recognition are richly interconnected (Mesulam, 1998;Binney et al., 2012;Zhang et al., 2016), which supports our ability to rapidly invoke imagery based on nouns (Reddy et al., 2010) and name pictures covertly (Ellis et al., 2006). To the extent that there are indeed separate routes for verbal and nonverbal knowledge, it is likely that typical adults in the current study flexibly drew upon both as they searched for thematic associates, as to do so would convey a conceptual advantage (Paivio, 1986). The current study design attempted to quarantine verbal from nonverbal stimuli, but we cannot dictate where those stimuli are processed internally after being transduced. The likelihood of "transmodal promiscuity" in the healthy brain limits our ability to attribute any given effect to a single network or pathway, even on within-platform trials.
A logical next step would be to examine how search is affected by neurological syndromes where one route is selectively blocked, as occurs in disorders of language (aphasias) and object recognition (visual agnosias). Although the ventral streams of the object recognition and language networks run alongside one another in close proximity along the temporal lobe (Mesulam, 1998;Hickok and Poeppel, 2004), the critical nodes of each network may be differentially instantiated in each hemisphere. Corruption of word knowledge has been linked to left temporal lesions (Dronkers et al., 2004;Hurley et al., 2012), and inability to recognize objects has been linked to right temporal lesions (Butler et al., 2009;Hurley et al., 2018). One may thus predict temporal lobe damage to differentially affect either verbal or nonverbal search, depending on which hemisphere is most compromised.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Cleveland State University IRB. The participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
RH designed the paradigm, tested participants, analyzed data, prepared figures, and wrote the manuscript. JS analyzed data, prepared figures, and edited the manuscript. KN tested participants, analyzed data, prepared figures, and edited the manuscript. BL analyzed data, prepared figures, and edited the manuscript. WH designed the paradigm, prepared figures, and edited the manuscript. MS designed the paradigm, prepared figures, and edited the manuscript.