Learning Spoken Words via the Ears and Eyes: Evidence from 30-Month-Old Children

From the very first moments of their lives, infants are able to link specific movements of the visual articulators to auditory speech signals. However, recent evidence indicates that infants focus primarily on auditory speech signals when learning new words. Here, we ask whether 30-month-old children are able to learn new words based solely on visible speech information, and whether information from both auditory and visual modalities is available after learning in only one modality. To test this, children were taught new lexical mappings. One group of children experienced the words in the auditory modality (i.e., acoustic form of the word with no accompanying face). Another group experienced the words in the visual modality (seeing a silent talking face). Lexical recognition was tested in either the learning modality or in the other modality. Results revealed successful word learning in either modality. Results further showed cross-modal recognition following an auditory-only, but not a visual-only, experience of the words. Together, these findings suggest that visible speech becomes increasingly informative for the purpose of lexical learning, but that an auditory-only experience evokes a cross-modal representation of the words.


INTRODUCTION
From the very first moments of their lives, infants experience the multisensory nature of speech. Through face-to-face interactions, infants simultaneously hear the auditory speech signal and see the accompanying movements of the speaker's face (Altvater-Mackensen and Grossmann, 2015). In adults, visible speech conveys redundant and complementary information (Miller and Nicely, 1955;Robert-Ribes et al., 1998) that reliably enhances auditory phonetic perception (i.e., Samuel and Lieblich, 2014) and facilitates lexical recognition (Brancazio, 2004;Barutchu et al., 2008;Buchwald et al., 2009;Fort et al., 2010Fort et al., , 2013Havy et al., 2017). Here we ask whether young children, who have considerably less experience in watching others' articulators, can benefit from visible speech as they learn new words.
Studies with infants document a surprisingly early preparedness for audio-visual speech perception. As they enter the world, infants are sensitive to the dynamic properties of faces (Guellaï et al., 2011) and can link specific movements of the visual articulators to the auditory speech signal across different dimensions (temporal: Lewkowicz and Pons, 2013;Baart et al., 2014, spectral: Kuhl et al., 1991. For instance, when presented with a side-by-side display of two talking faces producing two different vowels, and using the auditory track of a single vowel, infants spontaneously look more at the face which matches the auditory track (i.e., by 2 months for vowels: Kuhl and Meltzoff, 1982;Patterson and Werker, 2003;Streri et al., 2016; by 6 months for consonants: Pons et al., 2009). Infants do not merely associate auditory and visible speech, but can integrate both sets of information. This can typically be seen in the McGurk effect, in which conflicting auditory (i.e., /ba/) and visible speech information (i.e., /ga/) of syllables lead to a unified percept, which integrates both sensory modalities (i.e., /da/) (i.e., at 2-5 months: Burnham and Dodd, 2004;Bristow et al., 2009).
Over the course of development, multisensory perceptual abilities become increasingly more sophisticated and greatly influence language acquisition. By 6-10 months, infants deploy selective attention to the mouth of a talking face (Lewkowicz and Hansen-Tift, 2012) and can discern one language from another simply from viewing the silently articulated face (Weikum et al., 2007;Lewkowicz and Pons, 2013;Kubicek et al., 2014). Around the same age, infants use visible speech to resolve the identity of individual phonemes and learn novel phonetic categories (Teinonen et al., 2008). As they accumulate language experience, infants retain the intersensory relations that are phonemically relevant to their native language (rhythmic: Lewkowicz and Pons, 2013;Kubicek et al., 2014;phonetic: Pons et al., 2009;Danielson et al., 2017). However, little is known about how, and how early in development, visible speech begins to shape lexical representations.
Given their initial predispositions, infants should be able to represent visible speech information as they learn their first words. This is supported by evidence linking early attentiveness to visible speech to later lexical achievements. In particular, infants who pay more attention to the mouth of their interlocutor (Young et al., 2009) and respond more to incongruent audiovisual vowels at 6 months (Altvater-Mackensen et al., 2016) are more likely to have higher receptive and expressive vocabularies at 12 (Altvater-Mackensen et al., 2016) and 24 months, respectively (Young et al., 2009).
Yet, alternative evidence paints a more nuanced picture and suggests instead a relative independence of visible speech from the lexical domain until late in development. First, studies have revealed that infants' attention is not equal across auditory and visual modalities. When processing audio-visual compounds involving objects and non-speech sounds, infants exhibit an auditory dominance and allocate more attention to the auditory than to the visual information (i.e., at 8-16 months: Robinson and Sloutsky, 2004;Sloutsky and Robinson, 2008). This bias attenuates as they advance in age, especially during pre-school years (Robinson and Sloutsky, 2004).
Second, there is evidence from speech perception studies that audio-visual processing of non-sense forms (sounds, sound combinations, pseudo-words) continues to improve throughout childhood. For example, audio-visual matching is not consistently found for certain vowels until 9 months (Altvater-Mackensen et al., 2016;Streri et al., 2016) and for sine-wave speech until 6 years (Baart et al., 2015). Further, infants (i.e., at 4 months: Desjardins and Werker, 2004) and school-age children from three to 8 years old are less amenable to the McGurk illusion than older children (i.e., at 11 years) and adults and are more likely to have an auditory capture (i.e., /ba/) of the conflict (i.e., auditory /ba/ and visual /ga/; McGurk and MacDonald, 1976;Sekiyama and Burnham, 2008).
Third, the literature on familiar word recognition offers diverging evidence in childhood. For example, there is evidence that infants/children can differentiate words from pseudo-words in both the auditory and the visual modalities (i.e., at 12-13 months: Weatherhead and White, 2017) and can use visible speech to boost lexical recognition in normal (i.e., at four and 10-14 years: Jerger et al., 2009) and adverse listening conditions (i.e., at 3-4 years: Grieco-Calub and Olson, 2015;Lalonde and Holt, 2015;at 6-14 years: Ross et al., 2011;Maidment et al., 2014). Yet, contrasting evidence also reports that children do not reliably attend to visible speech. For example, Jerger et al. (2014) found that 4-year-old children were able to use the visual input to restore the excised onset of pseudo-words presented auditorily but did not show the visible speech fill-in effect for words (Jerger et al., 2014). In conjunction with this, Fort et al. (2012) found that 6-to-10-year-old children were able to identify the presence of a target phoneme within words and pseudo-words embedded in noise (Fort et al., 2012). Phoneme identification was higher in the audio-visual than in the auditory modality and overall was higher for words than for pseudo-words (word superiority effect). The word superiority effect was evident in the auditory and the audio-visual modalities. Yet, and unlike for adults, it was not increased by the presence of visible speech, as visible speech enhanced perception equally for words and pseudo words. This was interpreted as evidence that early on, visible speech contributes to pre-lexical units (sounds and sound combinations which do not take into consideration their associated meaning) more than to lexical units (sounds and sound combinations which take into consideration their associated meaning).
In general, the current literature does not provide any clearcut evidence as to whether visible speech is contained in early lexical representations. Task demands may account for part of the observed variability, as a greater visual influence has been observed on indirect measures (implicit retrieval), which require less processing resources than direct measures (overt responses) (Jerger et al., 2009(Jerger et al., , 2014. Critically, much of the current research does not distinguish between the two mechanisms whereby visible speech may become part of lexical representations. First, visible speech may be stored directly by encoding the visually available information from the input. Second, visible speech may be incorporated through cross-modal translation of the auditory input. To what extent these mechanisms shape early lexical representations and their precise time course of development remains largely unknown. However, one research project has begun to address this issue. In their research, Havy et al. (2017) asked whether 18month-old English-learning infants were able to learn new lexical mappings in either auditory or visual modality (Havy et al., 2017). The purpose was twofold: firstly to determine whether visible speech alone can be used to guide lexical learning, and secondly whether information from either auditory or visual modalities is available through cross-modal translation of the input. The task involved learning the name of two objects (Word A-Object A and Word B-Object B) in either auditory (auditory stream only) or visual modality (silent talking face), while asked when tested to look at the object being named. Lexical recognition was tested either in the same modality as the one used during the learning phase (same modality test condition, i.e., auditory after auditory learning, visual after visual learning) or in the other modality (cross-modality test condition, i.e., visual after auditory learning, auditory after visual learning). The results revealed that infants were able to learn new word-referent mappings in the auditory modality and when tested could recognize the mapping when presented either in the auditory or visual modality. However, unlike adults, infants did not show evidence of lexical learning in the visual modality. This pattern was interpreted as evidence that at 18 months, infants favor auditory speech information as they learn new lexical mappings and represent visible speech through cross-modal translation of the auditory input. This finding is consistent with the view discussed earlier, suggesting a general auditory dominance in how infants process audiovisual information (Robinson and Sloutsky, 2004;Sloutsky and Robinson, 2008).
Following on from Havy et al. (2017), the purpose of the current study is to determine at what age children reliably start to use visible speech as they learn words. We have focused on the age of 30 months, as it corresponds to a precisely timed change in the maturation of several capacities relevant to audio-visual word learning. By this age, children show greater audio-visual sensitivities (Grieco-Calub and Olson, 2015), greater fast-mapping capacities (Vlach and Sandhofer, 2012;Wojcik, 2013), more detailed word form representations (Floccia et al., 2014) and higher receptive and expressive vocabularies (Ganger and Brent, 2004;McMurray, 2007). Building on the same design as the one used by Havy et al. (2017), we ask (1) whether 30-month-old children are able to establish new lexical representations based on visible speech information alone, and (2) whether information from either auditory or visual modalities can be part of representations through cross-modal translation of the input. We reason that general achievements in different aspects of perception, language, and cognition at 30 months may positively influence how children attend to the visible correlates of speech during word learning. Alternatively, it may be the case that children are still encountering obstacles in navigating across the auditory and visual modalities. If so, children may experience difficulties in learning words visually and/or difficulties in recovering the auditory correspondents of the words which have been learned visually.

Participants
Forty monolingual French-learning children participated in the study at the University of Geneva, Table 1. All the children were healthy, full-term and with no known developmental delay or history of vision or hearing impairments. The children came from families living in Geneva and were exposed to French for more than 80% of the time. All were recruited through birth records and enrolled with their parents' consent in Frontiers in Psychology | www.frontiersin.org accordance with the recommendations of the Ethics Committee of the University of Geneva. The children's families received a small gift by way of compensation. An additional 26 children were tested but excluded from the analysis due to excessive fussiness/crying (n = 11), failure to contribute to each test condition (n = 4), calibration issue (n = 3) or poor tracking ratio (n = 8). The tracking ratio was defined as the amount of time the eye-tracker recorded the gaze coordinates over the entire task and was deemed unreliable if lower than 30%.

Speech Stimuli
The speech stimuli consisted of four pairs of Frenchsounding words with a CVC structure (consonant -vowelconsonant: /byp/-/var/, /rik/-/fal/, /fyf /-/gel/, /mum/-/tit/). The pseudo-words were selected to afford maximum distinctiveness in both the auditory and visual modalities (see Binnie et al., 1974;Jackson et al., 1976 for phonetic confusion matrices). These contrasted by at least two features in each segment (manner, place and voicing for the consonants, backness, height and roundness for the vowels) and could be easily discriminated by 30-month-old children (Kuhl and Meltzoff, 1982;Patterson and Werker, 2003;Desjardins and Werker, 2004;Werker and Curtin, 2005;Pons et al., 2009;Yeung and Werker, 2013;Tsuji and Cristia, 2014;Streri et al., 2016). The pseudo-words were recorded by a native French female speaker in a child-friendly form of directed speech. The pseudo-words were produced in a carrier phrase: determiner ('un') + pseudo-word, so as to highlight the referential status to the word (Fennell and Waxman, 2010) and to ensure that children began attending to the speech information prior to the word onset. Three tokens of each pseudo-word were selected and matched for duration and intonation contour. Two of them (two for each pseudo-word) were used for the familiarization, learning and test phases and another one (one for each word) was used solely for the learning and test phases.
All speech stimuli were recorded using a Sony HDR-CX730 video camera at 50 frames per second. Digital capture and editing were done using Audacity (Audacity, version 2.0.5) and Final Cut Pro (Final Cut Pro, version 7.0.3). For each word, three media sequences were created: an audio-visual sequence, an auditory-only sequence (sound stream with the video stream removed), and a visual-only sequence (video stream with the sound stream removed). The mean duration of audio-visual sequences was 1208 ms (SD = 58 ms, range = 1125-1291 ms). The auditory stream was edited to ensure similar sound levels across the stimuli (60 dB). The video sequences were cropped to remove all external features above the hairline and below the neck of the talking face. The background detail of the talking face was replaced by a uniform light gray background. The videos were positioned at the center of the screen, with a display size of approximately 16 • × 16 • of visual angle at a viewing distance of 60 cm.

Object Stimuli
Four pairs of novel objects were created using Photoshop (Adobe Photoshop CS4, version 11.0) and Final Cut Pro (Final Cut Pro, version 7.0.3). All objects were colorful and had similar levels of detail. Within each pair, the two objects differed by about 55.86% of their RGB value, brightness and shape [pair 1: 58.08%, pair 2: 45.92%, pair 3: 52.04%, pair 4: 67.40% (Resemblejs, version 2.2.0)] and were easily discriminable. To sustain the children's visual interest, each object was presented against a black background and rotated along a vertical axis for 2750 ms (Johnson, 2010). The objects were presented alone at the center of the screen for the learning phase and together, side by side, for the test phase. From a viewing distance of 60 cm, the objects subtended approximately 16 • × 16 • of the visual angle for the learning phase and 10 • × 10 • for the test phase. There was a gap of about six visual degrees between the objects when tested. Each pair of objects was randomly assigned to a unique pair of pseudowords (pair 1: objects 1 and 2 with pseudo-words 1 and 2; pair 2: objects 3 and 4 with pseudo-words 3 and 4; pair 3: objects 5 and 6 with pseudo-words 5 and 6; pair 4: objects 7 and 8 with pseudo-words 7 and 8, Figure 1). A smooth, undulating shape with a display size of 3 • × 3 • of the visual angle was used as an attention-getter.

Apparatus
The visual stimuli were presented on a 22-inch Dell E2209W monitor with a resolution of 1680 pixels × 1050 pixels per inch and a refresh rate of 60 Hz. The auditory stimuli were presented through left-right loudspeakers at a conversational level. Calibration procedure and stimuli presentations were run using a Dell Latitude E6520 laptop. Data were monitored using I-view (I-view, version 2.8.26) and Experiment Center (Experiment Center, version 3.2.17) native to SMI (SensoMotoric Instruments GmbH, Teltow, Germany). The children's eye-movements were recorded by means of a SMI RED500 eye-tracking device with a sampling rate of 60 Hz.

Procedure
The study was conducted in a dimly lit, sound-attenuated room at the University of Geneva. The children were seated on a caregiver's lap at a viewing distance of approximately 60 cm from the eye-tracker monitor set-up. The caregiver was instructed to keep their eyes closed, and not to talk, point to the screen or influence the child's attention. The session was initiated by a five-point calibration routine, where FIGURE 1 | The four word-object pairs used. The word-object association is counterbalanced across participants.
a spinning wheel was shown individually at five points on the screen: one at each corner and one at the center. The calibration procedure was gaze-contingent and was repeated as necessary.
Following a successful calibration, the children were given four experimental trials. Each trial consisted of three phases: (i) a pre-familiarization phase, (ii) a learning phase and (iii) a test phase (Figure 2). The children completed the three phases of one trial before moving to the next trial.
(i) During the pre-familiarization phase, the children were introduced to two pseudo-words in their audio-visual mode (seeing and hearing a talking face). This audio-visual presentation of the pseudo-words was intended to attract the children's attention to the multisensory aspect of speech. This pre-familiarization potentially provided specific information that could facilitate learning and support cross-modal recognition of the target words. However, there was no evidence of such an effect in Havy et al. (2017), as the infants failed to learn in the visual modality. Besides, care was taken to insure that the children would not consider the pre-familiarization as a hint for cross-modal recognition. The pre-familiarization was kept to a minimum, and even shortened in relation to the original design (Havy et al., 2017). This was motivated by evidence from categorization studies, which showed that cross-modal influences of auditory speech information on visual object processing are only evident after a certain amount of trials (Althaus and Plunkett, 2016). Here, the pseudo-words were uttered only twice in alternation, with a different realization each time (pseudo-word 1: token 1, pseudo-word 2: token 1, pseudo-word 1: token 2, pseudo-word 2: token 2). A variability in realization was expected to help infants identify the relevant contrastive dimension of the words being learned (Rost and McMurray, 2009).
(ii) Immediately after the pre-familiarization, the children entered a lexical learning phase. During the lexical learning, the same two pseudo-words were presented in association with two distinct objects. One group of children experienced the words in the auditory modality (hearing the sound but seeing a black screen, Figure 2). Another group experienced the words in the visual modality (seeing the silent talking face). Each word immediately preceded the object's appearance. To foster learning and highlight the lexically relevant phonological information, the learning phase contained three different tokens of each pseudo-word which were played twice (two tokens from the prefamiliarization phase and a novel one). The object pairs were arranged in the following order: three iterations of one pair followed by three iterations of the other, two iterations of the first pair followed by two iterations of the other and, lastly one iteration of each pair. This order was intended to engage the infants' selective attention on one object at a time and thus facilitate learning. Details of the learning phase progression are as follows: pseudo-word 1: token 1 -object 1, pseudo-word 1: token 2 -object 1, pseudo-word 1: token 3 -object 1, pseudoword 2: token 1 -object 2, pseudo-word 2: token 2 -object 2, pseudo-word 2: token 3 -object 2, pseudo-word 1: token 1object 1, pseudo-word 1: token 2 -object 1, pseudo-word 2: token 1 -object 2, pseudo-word 2: token 2 -object 2, pseudoword 1: token 3 -object 1, pseudo-word 2: token 3 -object 2, with a 1 s interval between each pair. During the interval, and in order to sustain attention throughout the learning phase, children saw a smooth, undulating shape at the center of the screen.
(iii) Immediately after the learning phase, the test phase was initiated. The test phase began with the presentation of the two previously seen objects, side by side and in silence, for 4 s. (pre-naming period). Next, both objects disappeared and one of them was labeled three times, each time with a different realization (pseudo-word 1: token 1, pseudo-word 1: token 2, pseudo-word 1: token 3). After labeling, both objects reappeared side by side in silence for 4 s (post-naming period). The modality of labeling varied depending on the test condition. In the 'same modality' test condition, labeling occurred in the same modality as the one used at learning: i.e., auditory after auditory learning (hearing the pseudo-word and seeing a black screen, Figure 2A), and visual after visual learning (seeing the silent talking face). In the 'cross-modality' test condition, labeling occurred in the other modality to the one used at learning: i.e., visual after auditory learning ( Figure 2B); and auditory after visual learning. Each trial tested one test condition only: either the 'same modality' or the 'cross-modality' condition. Out of the four experimental trials children received during the session, two of them tested the 'same modality' condition and the other two the 'cross-modality' condition. Sessions lasted approximately 20 min. Counterbalancing Participants of each learning group were assigned randomly to one out of three protocols. Protocols were made up by varying the order of presentation of each word-object pair. For each word-object pair, the modality of labeling during the test phase ('same modality' vs. 'cross-modality') and which of the two objects was labeled (object A vs. object B) were balanced. All protocols started with a 'same modality' test trial to facilitate the children's understanding of the task, followed by three other trials testing either condition ('same modality' vs. 'cross-modality') in a different order.

Data Analyzes
Analyzes were performed using BeGaze (BeGaze, version 3.2.28). Data were analyzed with respect to two areas of interest (AOI): one corresponding to the location of the target object, the other corresponding to the location of the distractor. AOIs were defined by dividing the entire screen into two equal parts. We

Data Cleaning
Data cleaning consisted of a series of six filters successively applied to the initial dataset. The dataset initially included 176 trials for 44 participants, and upon filtering consisted of 137 trials for 40 participants (auditory learning: 68/84 trials; visual learning: 69/76 trials). Filters were applied in line with Havy et al. (2017). We first trimmed ten trials from the initial dataset in which the children were not fixating on the monitor during the familiarization and learning phases (Filter 1: auditory learning: 6/88 trials; visual learning: 4/88 trials). We then removed six visual test trials (auditory learning: cross-modality; visual learning: same modality) in which the children were not looking at the model during the visual-only naming period of the test phase (Filter 2: auditory learning: 4/44 trials; visual learning: 2/44 trials). Based on criteria commonly used in two-choice word learning and word recognition tasks (Swingley and Aslin, 2000;Havy et al., 2017), we discarded 11 additional trials in which children were not fixating on the monitor during the pre-naming and/or post-naming periods of the test phase (Filter 3: auditory learning: 7/88 trials; visual learning: 4/88 trials). We then controlled for biases in spontaneous object preferences and identified three trials in which children attended to either one of the objects during the pre-naming period (auditory learning: 2/88; visual learning: 1/88). These trials were included in the analyses, as the bias did not last throughout the post-naming period (Filter 4, Delle Luche et al., 2015). On the remaining 149 trials, we screened for atypical data points falling outside normality. We considered data points larger than 2 SD from the mean as outliers (Fernald et al., 2006) and disregarded four additional trials (Filter 5: auditory learning: 1/88 trials; visual learning 3/88 trials). The exclusion of these four trials resulted in the exclusion of the four corresponding participants, who no longer contributed to the experimental conditions (Filter 6: auditory learning: n = 1; visual learning n = 3; Fernald et al., 2006). The exclusion of these four participants resulted in the removal of eight trials (Filter 6: auditory learning: 2/88 trials; visual learning: 6/88 trials). A total of 39 trials (auditory learning: 20/88 trials; visual learning: 19/88 trials) were thus removed from the original dataset.

Mixed Effects Model
Statistical analyzes were performed using SPSS (SPSS, version 21).
To assess the contribution of learning and test conditions to the children' performance, we ran a linear mixed effects model, using the percentage of target-looking time per trial and per child as a dependent measure. As fixed effects, we entered the learning condition, the test condition and the naming period in the model, as well as all interactions between these three predictors. The learning condition referred to the modality of labeling at learning ('auditory' vs. 'visual'). The test condition referred to the modality of labeling during the test, namely the same modality as the one used during learning ('same modality' test) or the other modality ('cross-modality' test). The naming period corresponded to the period of time prior to labeling ('pre-naming' period) and after labeling ("post-naming" period) of one of the two objects. The random part of the model initially included random intercepts for participants and items and random slopes which allowed for the differing effects of the naming period and the test conditions across participants. The inclusion of random slopes typically corrects type 1 error rates and ensures that the results are not driven by a restricted set of participants or items (Baayen et al., 2008;Jaeger, 2008). However, these terms were subsequently removed due to a lack of convergence. The results reported here stem from an intercept-only model. The model was applied using a maximum-likelihood estimation. Estimates, standard errors and t-values are reported with t > 2 being interpreted as significant. T-tests were also performed to compare the mean proportion of looking times (averaged over the trials for each condition) against chance (set at 50%, since each response involved a choice between two equally probable possibilities). The results of the mixed effects model yielded a significant main effect of the naming period [β = 13.69%, SE = 4.70%, t(223.94) = 2.91, p < 0.01]; a significant naming period * test condition interaction [β = −15.65%, SE = 6.69%, t(223.94) = 2.34, p = 0.02]; and a significant naming period * learning * test condition interaction [β = 19.51%, SE = 9.50%, t(223.94) = 2.05, p = 0.04], Table 2.
Other effects and interactions did not reach significance (All ts < 2). Wald Z statistics revealed that the variation in the participants' (Wald Z = 0.78, p = 0.44) and items' (Wald Z = 1.74, p = 0.08) intercepts was not confounded by our effects of primary theoretical interest.

DISCUSSION
The purpose of the current study was to identify two mechanisms whereby visible speech might become part of lexical representations. First, we asked whether 30-month-old Parameter estimates include the mean performance (Mean) in the different conditions, the estimated coefficient (Estimate) of the fixed effects and t-test statistics, the Variance of the random effects and Wald Z statistics. Planned comparisons test naming effects (naming period: pre-naming vs. post-naming) in each learning (auditory vs. visual) and test (same modality vs. cross-modality) condition. Significant main effects and interactions are in bold FIGURE 3 | Mean proportion of target-looking times [Target/(Target + Distractor)] after adjusting for baseline preferences (post-naming period minus pre-naming period). Positive values indicate greater looking at the target object rather than the distractor upon naming. Individual data points are overlaid on group means for the auditory learning group (A) and the visual learning group (B) in the two 'same modality' and the two 'cross-modality' test trials. Each participant contributes to at least one data point in each test condition ('same modality' vs. 'cross-modality'). This makes a total of 66 observations for the auditory learning group (same modality: 34/42, cross-modality: 34/42) and 69 observations for the visual learning group (same modality: 35/38, cross-modality: 34/38). Error bars represent the standard deviation from the mean.
children are able to use visible speech alone to guide lexical learning. Secondly, we examined whether information from either auditory or visual modalities could be part of new lexical representations through cross-modal translation of the input.
To test this, we used the same word-learning task as in Havy et al. (2017). Children were taught new lexical mappings in either auditory or visual modalities, then tested for recognition either in the same modality as at learning ('same modality' test condition) or in the other modality ('cross-modality' test condition). First, our results revealed that 30-month-old children are able to form new lexical mappings after a short auditory exposure to the word forms. This pattern is consistent with previous evidence documenting early auditory word-learning capacities in different laboratory settings (Werker and Curtin, 2005;Havy and Nazzi, 2009;Vouloumanos and Werker, 2009;Yoshida et al., 2009;Havy et al., 2017). Of primary interest, our results further revealed that 30-month-old children are able to learn new lexical mappings solely based on visible speech information. This finding is in line with Havy et al. (2017), which shows that in adults, visible speech may be stored directly from the input without support from the auditory domain. Critically, the current finding, in conjunction with Havy et al. (2017), revealed a transition in the development of this mechanism. In particular, it showed that, unlike 18month-olds who primarily attend to the auditory speech signal, 30-month-olds exploit either source of information. This suggests that visible speech becomes increasingly informative for lexical learning.
With regard to our second question, our results demonstrated that visible speech can be part of lexical representations, even after an auditory-only experience of the words. This pattern fits well with Havy et al. (2017) which shows cross-modal translation of the auditory input at 18 months and in adults. However, unlike in adults, auditory recognition was not found for visually learned words. This suggests that despite greater sensitivities to the visible correlates of speech, representations of visually learned words are still not adult like.
To summarize, the current evidence underscores the existence at 30 months of two individual mechanisms for encoding visible speech into the lexicon: one building directly on the visually available information and another building on cross-modal translation of the auditory input. This work also highlights an important milestone in how children appreciate visible speech information. While younger children incorporate visible speech into the lexicon only indirectly, older ones use both a direct and an indirect processing route. This distinction, which highlights the contribution of visible speech, also raises important questions about what could influence the observed developmental change.
Firstly, it is possible that the observed change could be attributed to a general improvement in audio-visual speech perception. This is supported by accumulated evidence showing greater responsiveness to visible speech information in normal and adverse listening conditions (Ross et al., 2011) and greater visual capture of the McGurk effect over the course of development (McGurk and MacDonald, 1976;Desjardins and Werker, 2004;Sekiyama and Burnham, 2008).
However, broader perceptual reorganizations occur concurrently that may also influence performance. During the 2nd year of life, children show an increased interest in a broad range of visual stimuli (i.e., objects: Bornstein and Arterberry, 2010;Zamuner et al., 2014;faces: Frank et al., 2014). This greater precision in processing visual information may also promote the use of visible speech. Yet, initial interest in visual information is not the same in all situations and is substantially higher when visual input is accompanied by speech sounds (Waxman and Braun, 2005;Ferry et al., 2010) or non-speech sounds that are given a clear communicative function (tones: Ferguson andWaxman, 2016, conspecific calls: Perszyk and than non-speech sounds that have no linguistic interpretation (Robinson and Sloutsky, 2004). This suggests that multisensory associations are constrained with regards to the linguistic nature of the information under consideration, which in turn challenges the possibility that the observed shift is linked to domain-general mechanisms.
This leads us to move beyond considerations of perceptual sensitivities alone and to identify what could possibly influence the use of visible speech in the lexical domain. Several achievements are noteworthy. One of them concerns the overall attention devoted to the word form. Early on, infants possess remarkable discrimination capacities. However, there is considerable variability in how they appreciate the auditory detail of word forms as they associate words with objects (Werker et al., 2002;Havy and Nazzi, 2009). As they advance in age, children construct more accurate phonetic representations of words (Floccia et al., 2014). These capacities, which have been documented for the most part in the auditory speech domain, may also extend to the visible speech domain. Future studies will aim to test this possibility by considering how children visually treat words that are minimally distinct.
Another possibility is that children form different expectations about the type of signal that can be accepted as a word. For example, there is evidence that by 30 months, children are less likely to accept manual gestures as a word (Sheehan et al., 2007;Suanda et al., 2013). This decline in receptivity to gestural labels, in conjunction with our finding of an increased sensitivity to visible speech, suggest a different selectivity for the type of visual signals that are primarily linked to the lexicon. Future research will have to test this possibility by comparing how children learn from visible speech and manual gestures in the same design.
Finally, children's emerging ability to produce words may change the emphasis and weight given to visible speech. During the 2nd year of life, lexical production dramatically increases and becomes more accurate (Fenson et al., 1994;Ganger and Brent, 2004). Yet, studies document that infants (Young et al., 2009;Yeung and Werker, 2013;Altvater-Mackensen and Grossmann, 2015;Streri et al., 2016) and children (Desjardins et al., 1997) are more sensitive to the visible speech information they can appropriately produce. This functional link with production and its feed-back to perception may influence children's selective attention to visible speech. Future research will be aimed at exploring visual word learning when articulators are temporally restrained, or more permanently impaired, as in individuals with cleft palates.
In total, the current work demonstrates that children attend to visible speech more reliably as they establish new lexical representations. But it is important to note that, unlike adults, they show cross-modal recognition in the auditory modality only. This indicates that the representations/mechanisms involved during auditory and visual learning may be substantially different. What could explain such differences? In the literature, there is evidence that auditory speech and visible speech provide different amounts of information about the identity of the signal. While auditory speech has only one visual correspondence, visible speech can be matched with more than one auditory template. This uncertainty in the auditory translation of the visual input may place a particular demand on children who have to keep several alternative possibilities in their memories. Future research will need to evaluate this claim by manipulating the number of auditory candidates for the same visual input.
In line with the above, it is useful to consider the methodological limitations of the task. In everyday life, it is very common to hear a sound without seeing the corresponding articulatory movement, yet it is much less common to watch a silent face. In keeping with this, children in the current task noticed the absence of sound but did not react to the absence of a face. The lack of naturalness of the visual learning condition may be detrimental to performance, especially at a younger age. One direction for future studies could be to modify this condition by simultaneously playing the corresponding auditory signal at a low signal-to-noise ratio.
As part of the methodological limitations, it is also important to note the influence of the pre-familiarization phase. Prior to lexical learning, children were briefly introduced to the audiovisual form of the words. This pre-familiarization may have fostered lexical learning and assisted in the establishment of a recoverable cross-modal representation. But the audio-visual experience was short and there was no evidence of crossmodal recognition upon visual learning. Besides, the literature concurrently reports that audio-visual experience with a word may either facilitate or hinder auditory and visual recognition of this word when learning word-object associations in adults (Bernstein et al., 2013;Eberhardt et al., 2014). It is therefore unclear as to whether and how the multisensory information available during pre-familiarization influences lexical learning at 30 months and, if so, whether this effect changes over the course of development. Future studies will need to investigate the role of the pre-familiarization phase in the current task, possibly by excluding this phase or by testing a condition using different words during pre-familiarization and lexical learning.
Along with this methodological limitation, it is important to note the very limited number of pseudo-words that has been considered. Future studies will need to replicate this work whilst using a broader set of pseudo-words. Language-specific effects may have also influenced the current finding. Previous evidence documents various sensitivities to the McGurk illusion depending on the language being used (Sekiyama and Burnham, 2008). Future studies will need to investigate how the sensory format of lexical representations develops across languages.
Another avenue for future research will be to determine the nature and number of representations that are at play. One possibility is that, at 30 months, children form two modality-specific representations: one based on the available input and another that is uniquely activated through crossmodal translation of the auditory input. If so, differences in performance between the auditory and visual learning conditions may reflect differences in the cross-modal transfer mechanisms, either in terms of presence/absence of the mechanisms or in terms of relative robustness. The visual-to-auditory transfer may emerge later than the auditory-to-visual transfer or be already present but not reliably used. Another possibility is that children form one single representation. This means that after auditory learning, the representation that is formed is crossmodal and contains information from both sensory modalities. After visual learning, the representation may be cross-modal or sensory-specified. If cross-modal, the observed differences may reflect differences in the robustness and confidence associated with the representations. If sensory-specified, the observed differences may reflect differences in the sensory format of the representations, from sensory-specified to cross-modal. It is also important to note that the nature and the number of representations involved may be susceptible to developmental change. Children may move from two separate representations to a more abstract cross-modal representation as they gain more language experience.

CONCLUSION
The current study reveals the contribution of visible speech in 30-month-old lexical representations. It provides evidence of two mechanisms for encoding visible speech, each associated with a different timing of emergence. Futures studies will need to explore more closely the developmental transition observed between 18 and 30 months and how cross-linguistic differences in attending to visible speech influence the sensory format of early lexical representations.

ETHICS STATEMENT
This study was carried out in accordance with the recommendations of the ethic committee of Geneva University with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the ethic committee of Geneva University.

AUTHOR CONTRIBUTIONS
MH designed the study, tested the participants and wrote the paper. PZ provided insightful comments that improved the quality of the manuscript.

FUNDING
This work was funded by a SNSF grant (100014_159402) to MH and PZ.