Three Kinds of Rising-Falling Contours in German wh-Questions: Evidence From Form and Function

Zahner-Ritter, Katharina; Einfeldt, Marieke; Wochner, Daniela; James, Angela; Dehé, Nicole; Braun, Bettina

doi:10.3389/fcomm.2022.838955

ORIGINAL RESEARCH article

Front. Commun., 14 April 2022

Sec. Psychology of Language

Volume 7 - 2022 | https://doi.org/10.3389/fcomm.2022.838955

Three Kinds of Rising-Falling Contours in German wh-Questions: Evidence From Form and Function

Katharina Zahner-Ritter ^1,2^*

Marieke Einfeldt ²

Daniela Wochner ²

Angela James ²

Nicole Dehé ²

Bettina Braun ²

1. Phonetics, Department II, University of Trier, Trier, Germany
2. Department of Linguistics, University of Konstanz, Konstanz, Germany

Article metrics

View details

Citations

6,4k

Views

978

Downloads

Abstract

The intonational realization of utterances is generally characterized by regional as well as inter- and intra-speaker variability in f0. Category boundaries thus remain “fuzzy” and it is non-trivial how the (continuous) acoustic space maps onto (discrete) pitch accent categories. We focus on three types of rising-falling contours, which differ in the alignment of L(ow) and H(igh) tones with respect to the stressed syllable. Most of the intonational systems on German have described two rising accent categories, e.g., L+H^* and L^*+H in the German ToBI system. L+H^* has a high-pitched stressed syllable and a low leading tone aligned in the pre-tonic syllable; L^*+H a low-pitched stressed syllable and a high trailing tone in the post-tonic syllable. There are indications for the existence of a third category which lies between these two categories, with both L and H aligned within the stressed syllable, henceforth termed (LH)^*. In the present paper, we empirically investigate the distinctiveness of three rising-falling contours [L+H^*, (LH)^*, and L^*+H, all with a subsequent low boundary tone] in German wh-questions. We employ an approach that addresses both the form and the function of the contours, also taking regional variation into account. In Experiment 1 (form), we used a delayed imitation paradigm to test whether Northern and Southern German speakers can imitate the three rising-falling contours in wh-questions as distinct contours. In Experiment 2 (function), we used a free association task to investigate whether listeners interpret the pragmatic meaning of the three contours differently. Imitation results showed that German speakers—both from the North and the South—reproduced the three contours. There was a small but significant effect of regional variety such that contours produced by speakers from the North were slightly more distinct than those by speakers from the South. In the association task, listeners from both varieties attributed distinct meanings to the (LH)^* accent as opposed to the two ToBI accents L+H^* and L^*+H. Combined evidence from form and function suggests that three distinct contours can be found in the acoustic and perceptual space of German rising-falling contours.

Introduction

In spoken communication, speakers use intonation, primarily cued by f0, to mark sentence type (e.g., question vs. statement), information structure (e.g., focus and topic), information status (e.g., given vs. new information), hierarchical discourse structure, and attitudinal meaning (cf. Lehiste, 1975; Ladd, 2008; Prieto, 2015). The intonational realization of utterances is characterized by a lot of variability in f0—both within and across speakers (Atkinson, 1976; Gandour et al., 1991; Niebuhr et al., 2011; Grice et al., 2017), as well as across regional varieties, as shown, for instance, for German (Atterer and Ladd, 2004; Ulbrich, 2005; Braun, 2007; Mücke et al., 2009), or English (Grabe, 2004; Fletcher et al., 2005; Smith and Rathcke, 2020). Such variability includes, among others, the alignment of tonal targets [i.e., the position of low (L) and high (H) turning points with respect to the segmental string], their scaling (i.e., the tonal height), or the shape of intonational events (e.g., the slope or curvature). Category boundaries hence remain “fuzzy” and the question of whether and how the acoustic space can be split into distinct categories is non-trivial (cf. Arvaniti, 2019; Lohfink et al., 2019, for discussion). On the one hand, these categories need to do justice to the variability in the signal; on the other hand, they need to allow for generalizations. In the present paper, we contribute to this debate by addressing the distinctiveness of rising-falling f0 contours in German in an integrative approach that accounts for both form and function.

Previous research has demonstrated that nuclear intonation contours in German crucially differ with respect to f0-peak alignment (Kohler, 1991b; Grice et al., 2005; Niebuhr, 2022). The f0 peak may either precede the stressed syllable (H+L^*, early-peak accent), or follow it (L^*+H, late-peak accent), or be aligned within the stressed syllable (L+H^*, medial-peak accent). While in H+L^* the accentual movement falls onto the stressed syllable, L+H^* and L^*+H accents are considered rising accents, with the rising movement being perceptually very prominent (Baumann and Röhr, 2015; Baumann and Winter, 2018). In the present paper, we focus on the two rising accents L+H^* and L^*+H with a subsequent low boundary tone, along with a third rising-falling contour that lies between the two [henceforth (LH)^*, which is our own descriptive label], see Figure 1. An earlier study has highlighted the potential existence of a third category between L+H^* and L^*+H in German (termed “late-medial peak,” Kohler, 2005; cf. Niebuhr, 2022). The present study is designed to corroborate this preliminary evidence and sharpen the scope of the category, specifically with reference to rhetorical questions where (LH)^* was recently observed in production (Braun et al., 2019).

Figure 1

Schematic representation of three rising-falling contours in German realized on a four-syllable sequence *denn Mandalas* “PRT Mandalas;” gray shading indicates the stressed syllable (tonic syllable) with which the pitch accent is associated. **(A–C)** show the three different alignment configurations analyzed in the present study.

Rising-falling contours in German have received different phonological representations in intonational phonology (Kohler, 1991a; Mayer, 1995; Grice et al., 2005; Peters, 2014). Most of these descriptions distinguish between two kinds of rising-falling contours (Figures 1A,C), transcribed as L+H^* and L^*+H in the German ToBI system (Grice et al., 2005). These accents have been related to differences in meaning: new information vs. self-evident information/information conflicting a speaker's belief (Kohler, 1991b; Grice and Baumann, 2002; Niebuhr, 2007b; Kügler and Gollrad, 2015) or attitudinal information, such as sarcasm (Lommel and Michalsky, 2017). Recent production data on German rhetorical questions (Braun et al., 2019) and verb-first exclamatives (Wochner, 2021) reveal another accent type which falls between the two more established ones (Figure 1B). In this contour, both the low and the high tonal target are realized within the stressed syllable (Figure 1B), similar to the late-medial peak reported in Kohler (2005, p. 90). This alignment pattern differentiates (LH)^* from the more established accents L+H^* and L^*+H. Here, we take a fresh look at the acoustic and interpretative space of German rising-falling contours to discuss whether there is evidence to model three kinds of rising-falling contours: L+H^*, L^*+H, and (LH)^*. To this end, we employ combined evidence from imitative productions (form) and judgments on the connotative meaning (function) to determine whether (LH)^* is a pitch accent category on its own in German or, alternatively, whether (LH)^* might be a variant of one of the two other pitch accents (L+H^* or L^*+H). If there are three distinct pitch accents in speakers' mental grammars, we expect three distinct contours in production (form, Experiment 1) and different connotative meaning attributions in perception (function, Experiment 2). The overall aim of our study is hence to probe the fuzziness in German rising-falling contours and discuss ways to model them appropriately.

In section “Background”, we first provide background information on rising-falling contours in German in the different systems of intonational description, before we review approaches attempting to model variability in intonational contours from a broader perspective. Section “The Present Study: Rationale and Hypotheses” outlines the rationale of our study and our hypotheses. In sections “Experiment 1: Delayed Imitation Study” and “Experiment 2: Paraphrasing of connotative question meaning”, we present the two experiments before discussing the combined experimental results in section “General Discussion”.

Background

Rising-Falling Contours in German

According to the autosegmental-metrical (AM) theory of intonation (Arvaniti and Fletcher, 2020 for overview; Pierrehumbert, 1980; Ladd, 2008), pitch accents are represented by sequences of low and high tonal targets. In intonation languages, pitch accents are associated with the metrically stressed syllable, which is a lexical property of the word functioning as an anchor point for pitch accents (highlighted with gray shading in Figure 1). The actual alignment of the tonal targets with regard to the position in the stressed syllable varies, which results in different pitch accent types. Different models of German intonation, i.e., German Tones and Break Indices (GToBI, Grice et al., 2005), The Kiel Intonation Model (KIM, Kohler, 1991a; Niebuhr, 2022), Intonationsgrammatik des nördlichen Standarddeutschen “Intonation grammar of the Northern Standard German” (Peters, 2006, 2014) and Transcription of German Intonation: The Stuttgart System (STGTsystem, Mayer, 1995) have separated the space of possible pitch accents in rising-falling contours differently: The stylized realizations in Figures 1A,C are modeled after the tonal contrast in GToBI (Grice et al., 2005), which is widely used in research on German intonation and which is easily comparable to other ToBI systems in different languages. The realization depicted in Figure 1B, i.e., the contour found in rhetorical questions and exclamatives, (LH)^*, does not occur in this system. The STGTsystem (Mayer, 1995), an alternative version of GToBI developed in Stuttgart, lists only one accent type for rising-falling contours (L^*HL), whose phonetic description resembles the stylization in Figure 1C (L^*+H in GToBI). Peters (2014, pp. 45–48) describes the contour in Figure 1A as a fall (H^*L) and the contour in Figure 1C as L^*H. KIM (Kohler, 1991a; Niebuhr, 2022) is a contour-based account and distinguishes so called “medial peaks” from “late peaks,” whereby the medial peak can be projected onto H^*/L+H^* and the late peak onto L^*+H in an AM framework (cf. Niebuhr and Ambrazaitis, 2006; Niebuhr, 2007b). Importantly, KIM added a contour to its system, based on subsequent research done on the model (Kohler, 2005; Niebuhr, 2022, for overview): In particular, Kohler (2005) used semantic scales to show that the peak alignment continuum contains an additional category that falls between the medial (L+H^*) and late peak (L^*+H)—a contour called “late-medial peak” in KIM and described as (LH)^* in the present paper.

Recent production data further provide evidence for a consistent and meaningful use of (LH)^* as a pitch accent signaling rhetorical illocution in wh-questions (Braun et al., 2019). This accent was specific to rhetorical questions and did not occur in information-seeking wh-questions (Dehé et al., 2022). Crucially, it did not only occur in contexts of tonal crowding, but also when post-tonic syllables were available. (LH)^* has also been observed with verb-first exclamatives, while string-identical information-seeking questions were realized with a high-rising contour (Wochner, 2021). The occurrence of (LH)^* in these (non-canonical) utterance types, as opposed to information-seeking questions, suggests that it may be phonemic, rather than a phonetic variant of another accent. At the same time, (LH)^* has been described as an allophonic variant of the established accents described above, occurring in suboptimal segmental contexts, in which there are not enough syllables to realize the late-peak contour (cf. Mayer, 1995; Kügler, 2007; Peters, 2014), hence resembling the configuration in Figure 1B. In practice, (LH)^* realizations caused difficulties in transcription because they share the alignment of the L tone with L^*+H and that of the H tone with L+H^*. Taken together, the status of (LH)^* in German (i.e., whether it is phonetic or phonological) is by far not clear and it raises issues for the mapping between acoustic realization and phonological categories. The specific question of the present paper is how many distinct (meaningful) contours need to be modeled within the broad category of rising-falling contours in German.

A complicating factor for this question is that natural productions are not as clearly distinct as the stylizations in Figure 1 may suggest, but are subject to variability both within and across speakers (Atkinson, 1976; Gandour et al., 1991; Niebuhr et al., 2011; Grice et al., 2017; Lohfink et al., 2019; Roessig et al., 2019; Roessig, 2021).¹ Clearly, such individual variation blurs the boundaries of intonational categories. Regional variety, which is one of the foci of the present paper, additionally pushes the notion of categories to its limits as distributions between categories might overlap (Atterer and Ladd, 2004; Grabe, 2004; Gilles, 2005; Peters, 2006; Braun, 2007; Mücke et al., 2009): Indeed, a main discriminating aspect of the above-cited models on German intonation are their geographical origins. On a north-south axis, KIM (Kohler, 1991a) is located farthest in the north, followed by the system developed by Jörg Peters in Oldenburg (Peters, 2006, 2014). The STGTsystem (Mayer, 1995), in turn, originates in the South of Germany (Stuttgart). GToBI is a collaborative approach developed at universities in Saarbrücken, Stuttgart, Munich, and Braunschweig (Grice et al., 2005, p. 62). It is possible that the apparent differences in intonation labels and pitch accent contrasts are in part influenced by differences in regional variety (cf. Gilles, 2005; Peters, 2006; Kügler, 2007).

In fact, there is experimental evidence that Southern German speakers produce pitch accents in declarative sentences differently from Northern German speakers, at least in prenuclear position: Atterer and Ladd (2004), for instance, reported that Southern German speakers (from Bavaria) aligned prenuclear accentual rises significantly later than speakers from the North-West of Germany (cf. Braun, 2007; Mücke et al., 2008); in nuclear position, alignment differences went in the same direction but were not significant (Mücke et al., 2009). In his analysis of the tonal inventory of Swabian, an Alemannic variety in the South of Germany, Kügler (2007) shows that speakers predominantly produced L^*+H accents in declarative sentences (see also Kügler, 2004). Distributional analyses of Northern German speakers (Kiel), in turn, reveal medial peaks to occur more frequently than late peaks (Peters et al., 2005). This suggests that the distribution frequency in tonal inventories might also differ across regions (cf. Fitzpatrick-Cole, 1999; Leemann, 2012, on Swiss German), such that the acoustic space of rising-falling contours in Southern German speakers is shifted toward the right end of the spectrum [recall that the southern STGTsystem (Mayer, 1995) only accounts for one rising-falling contour]. In the present paper, we directly compare speakers from two different regions (North vs. South) on the three-way tonal alignment contrast.

Modeling Intonational Categories

The question of how phonological representations—typically thought of as distinct categories—and phonetic modification—typically understood as a gradual change—interrelate has been an issue of on-going debate (e.g., Ohala, 1990; Niebuhr, 2007a; Pierrehumbert, 2016; Arvaniti, 2019; Barnes et al., 2021; Roessig, 2021). It is uncontroversial that some form of generalization is necessary to systematize interfaces with other core areas, such as semantics or pragmatics. At the same time, clear-cut boundaries cannot be maintained given the variability in the speech signal and the fuzziness of the mapping between acoustic form and phonological category.

In intonational research, different tasks have been employed to study the relation between the continuous signal and intonational categories (cf. Prieto, 2012 for overview): Focusing on intonational form, identification and discrimination tasks have been used in classic categorical perception paradigms (Kohler, 1987, 1991b; Ladd and Morton, 1997; Schneider and Lintfert, 2003; Niebuhr, 2007b). Kohler (1991b), whose work is directly related to our question, showed categorical perception for early vs. medial peaks (i.e., H+L^* vs. L+H^*), two accents that differ in the direction of the accentual movement. The difference between the two rising-falling contours (medial vs. late peaks, i.e., L+H^* vs. L^*+H), in turn, was less clear-cut. Categorical perception results were similar for speakers from Northern and Southern Germany (Kiel vs. Munich, Kohler, 1991b, p. 149ff.). Beyond tonal alignment, the shape of the contour also seems to influence the categorical perception of rising-falling contours, leading to a more or less clear-cut perception between L+H^* and L^*+H (Niebuhr, 2007a, for effects of peak shape and intensity transitions); see also Barnes et al. (2021) for a study corroborating the relevance of the shape of the interpolation between L and H for the distinction between L+H^* and L^*+H in English. Another paradigm testing the distinctiveness in intonational form is imitation, which is based on the idea of a perception-production loop (e.g., Pierrehumbert and Steele, 1989; Braun et al., 2006; Dilley and Brown, 2007; Dilley, 2010; Chodroff and Cole, 2019b; Petrone et al., 2021). In imitation tasks, participants are typically presented with one stimulus at a time and have to imitate it. The productions are analyzed in terms of the overlap (or non-overlap) in the distributions of relevant parameters (such as tonal alignment) or overall shape. In imitation, task difficulty or working memory seem to affect outcome patterns: Braun et al. (2006), for instance, employed an iterative imitation paradigm in which speakers first imitated a set of randomly generated f0 tracks and then iteratively repeated their previous productions. They showed that speakers retained some detail in immediate imitation, which was lost, however, over successive repetitions. The authors argue for attractors in the perceptual space of intonation that function as a perceptual magnet. Participants in Chodroff and Cole (2019b), American English speakers, had to imitate one of eight nuclear contours and transfer the respective contour to a novel sentence with the same rhythmic structure, hence making generalization necessary. In their study, speakers primarily maintained the distinction between rising and falling contours, similar to the attractor contours in Braun et al. (2006). Petrone et al. (2021) showed that when working memory capacity is smaller, speakers have difficulties in reproducing contours correctly: Specifically, speakers with high working memory capacity were more accurate in the imitation of phonological events, both for obligatory events (pitch accents and boundary tones) and optional events. In sum, the harder the imitation task (due to either task demands or cognitive capacities), the smaller the set of reproduced contours. A challenging imitation task hence seems to us an appropriate method for the question of whether there are three distinct rising-falling contours.

Studies that have addressed the functional distinction between intonational contours have employed semantic scales (e.g., Dombrowski, 2003; Kohler, 2005; Dombrowski and Niebuhr, 2010; Kügler and Gollrad, 2015; Wochner, 2021), free association tasks (Kohler, 1991b), acceptability judgment tasks (Baumann and Grice, 2006), or psycholinguistic methods such as eye-tracking (e.g., Braun and Biezma, 2019). Kügler and Gollrad (2015), for instance, showed that German listeners differentiated between a contrastive and a broad focus reading based on differences in the scaling of the H tone (the L tone did not affect perceptual ratings, but see Ritter and Grice, 2015). Based on a free association task, Kohler (1991b) reports that medial peaks were associated with information that was new to the discourse in declaratives and with an information-seeking notion in questions. Late peaks also signaled new information (similar to medial peaks) but also added attitudinal meanings, such as astonishment or self-evidence (see also Grice et al., 2005; Lommel and Michalsky, 2017). Kohler (2005) corroborated these findings using semantic differentials; the contour that falls between the medial and late peak, the late-medial peak, tended to be associated with unexpectedness or surprise. Braun and Biezma (2019) used an eye-tracking paradigm to investigate the contrastive nature of nuclear L+H^*, prenuclear L+H^*, and prenuclear L^*+H. They showed that listeners interpreted prenuclear L^*+H and nuclear L+H^* contrastively (more fixations to a referent that contrasted with the accented word) as opposed to prenuclear L+H^*. Here, we use a combined approach of form and function to understand the sources of fuzziness surrounding rising-falling accents in German and to model it successfully.

The Present Study: Rationale and Hypotheses

To test the distinctiveness of the three kinds of nuclear rising-falling contours in German [L+H^*, L^*+H, (LH)^*; cf. Figure 1], we employ wh-questions and make use of two tasks: a delayed imitation task and a free association task. The delayed imitation task requires a kind of storage (beyond access to echoic memory), and hence taps into phonological representations of intonational contours. The working memory model by Baddeley and Hitch (1974) assumes that acoustic information decays after ~2 s (phonological short-term memory, cf. Plomp, 1964; Gathercole et al., 1997), unless it is refreshed by a sub-vocal articulatory rehearsal process (Baddeley and Hitch, 1974; Baddeley, 1986, 2003). Moreover, Crowder (1982) reports that in terms of discrimination accuracy for vowel formants “[t]he auditory memory loss seems to be asymptotic at about 3 s” (Crowder, 1982, p. 197). Hence, there seems to be a threshold of about 2 to maximally 3 s up to which acoustic information is readily available and after which acoustic information decays. Based on this threshold, we designed our delayed imitation task with a 2,000 ms delay and a following sine tone with a duration of 500 ms. The free association task seems to be the best-suited paradigm for our study since the functional scope of (LH)^* is not clear yet, which makes it hard to establish pre-defined connotative meanings required in other tasks.

For both experiments, speakers from Southern and Northern Germany were recruited in order to investigate regional variation (Mayer, 1995; Atterer and Ladd, 2004; Ulbrich, 2005; Braun, 2007; Kügler, 2007; Mücke et al., 2009). Speakers were allocated to either the Northern or the Southern German group according to where they were born and grew up. The Northern German group comprised speakers north of the Benrath line, an isogloss separating Low German and High German dialects (based on the High German consonant shift). The Southern German group comprised speakers from Baden-Wuerttemberg and Bavaria (south of the Speyer line, an isogloss that additionally separates Upper German dialects from Central German dialects), cf. Waterman (1991/1966).

In Experiment 1 (form) participants imitate three resynthesized nuclear rising-falling contours on wh-questions [L+H^*, (LH)^*, and L^*+H] in a delayed imitation paradigm addressing phonological processing (cf. Baddeley and Hitch, 1974; Crowder, 1982; Baddeley, 1986, 2003). Methodologically, we input a three-way alignment contrast, and analyze the productions of Experiment 1 holistically, using general additive mixed models (GAMMs, Wood, 2006, 2017) on time-normalized utterances. This method allows us to capture the f0 contours as a whole and compare when in time two contours differ from each other significantly (cf. Wieling, 2018; van Rij et al., 2019; Sóskuthy, 2021). Using GAMMs hence not only provides information about tonal alignment (Atterer and Ladd, 2004), but also about tonal onglides (Ritter and Grice, 2015; Roessig et al., 2019), f0 excursions and scaling, and the overall shape of the contour (Niebuhr, 2007b; Niebuhr et al., 2011; Barnes et al., 2012, 2013, 2021). GAMMs furthermore allow us to test for interactions between intonation condition and regional variety over time, hence informing us on whether regional variation affects the distinctions between contours differently. In that sense, GAMMs represents an ideal statistical technique to disambiguate the fuzzy data patterns existent in f0 contours in order to unravel the meaningful underlying structure of intonational phonology.

Participants' imitative productions will be informative on how many distinct contours we need to model in the acoustic space of German rising-falling contours: Three distinct rising-falling contours in the imitative productions of the speakers will provide evidence for (LH)^* as a third kind of rising-falling contour in German next to the two more established L+H^* and L^*+H contours, hence corroborating the three-way-contrast initially laid out in Kohler (2005) and also observed in Braun et al. (2019). Given that our delayed imitation task requires storage of the contours, the evidence would go beyond phonetic details and clearly speak in favor of phonological processing. If, on the other hand, speakers reproduce two contours in their imitative productions, this will provide evidence in favor of collapsing the range of rising-falling contours into two contours (cf. Braun et al., 2006, on English; Chodroff and Cole, 2019b). Reproduction of only one contour would suggest that the task is too hard (since there is plenty of independent evidence in favor of two rising-falling contours in German, see sections “Introduction” and “Background”). With respect to regional variation—although a direct comparison of studies reporting occurrence frequency is difficult, medial-peak contours (H^*/L+H^*) have been shown to be more frequent than late-peak contours (L^*+H) for Northern German speakers (Peters et al., 2005). For Southern German speakers, in turn, rising accents with a late L and H alignment have been reported to occur frequently (described as L^*+H in Kügler, 2004; see also Truckenbrodt, 2007, for the prenuclear position). Based on these differences in occurrence frequency, it is conceivable that L+H^* functions as a perceptual attractor (magnet) for Northern German speakers, while L^*+H serves this function for Southern German speakers, along the lines of what is known on magnets on the segmental level (Anderson et al., 2003; cf. Braun et al. (2006) and Roessig et al. (2019) for attractor-based accounts of intonation). Given that (LH)^* may be less strongly anchored in the intonational grammar due to its more restricted function and hence less frequent occurrence, it may be yet more prone to merger effects (cf. Braun et al., 2006). Under this assumption, we predict the distinction of contours to differ between regions, with a smaller distinction between L+H^* and (LH)^* in the North than in the South [L+H^* as merger with (LH)^*], and conversely, a smaller distinction between L^*+H and (LH)^* in the South than in the North [L^*+H as merger with (LH)^*].

Experiment 2 (function) tests whether the three rising-falling contours in wh-questions are interpreted differently. To this end, we conducted a qualitative study in which participants, different from the ones in Experiment 1, paraphrased the connotative meaning of the stimuli in Experiment 1 in their own words. The paraphrases of Experiment 2 were recoded into superordinate categories and analyzed using conditional inference trees (CTrees, Hothorn et al., 2006). This method allows us to test whether and how intonation condition and regional background affect participants' responses. We predict that if there are in fact three distinct contours, they will lead to different interpretations: Drawing on the available literature, we expect that L+H^* leads to descriptions relating to an information-seeking nature (Kohler, 1991b; Baumann and Grice, 2006; Braun et al., 2019), while L^*+H is expected to trigger descriptions related to contrast and/or attitudinal meanings (Grice et al., 2005; Niebuhr, 2007b; Lommel and Michalsky, 2017); (LH)^* is hypothesized to be interpreted as rhetorical (Braun et al., 2019), or to signal surprise or obviousness (Wochner, 2021), or unexpectedness (Kohler, 2005). In terms of regional variation, we cannot make strong predictions regarding meaning—recall that Kohler (1991b, p. 149ff.) showed constant semantic judgments for different contours across Northern and Southern German listeners. If anything, we expect a merger effect in terms of meaning for the most frequent accent type (L+H^* in Northern and L^*+H in Southern German speakers).

Experiment 1: Delayed Imitation Study

Methods

Participants

In total, 28 monolingual native German participants, half from Northern Germany (mean age = 25.7 years, SD = 5.0 years, 10 female, 4 male) and half from Southern Germany (mean age = 25.5 years, SD = 4.4 years, 1 diverse, 8 female, 5 male), who had not learned a second language before the age of six, took part in the imitation study. Speakers from the Southern German group spent most of their lives in Baden-Wuerttemberg (N = 14), while speakers in the Northern German group came from Berlin (N = 1), Brandenburg (N = 1), Hamburg (N = 1), Mecklenburg-Vorpommern (N = 1), Lower Saxony (N = 3), North Rhine-Westphalia (N = 2), and Schleswig-Holstein (N = 5), all north of the Benrath Line. Due to restrictions imposed by COVID-19, testing started in the lab for eight Southern German speakers and then was continued via the online platform SosciSurvey (https://www.soscisurvey.de, Leiner, 2018) for all Northern speakers and the six remaining Southern German speakers.

Materials

Four target wh-questions were constructed that consisted of the wh-word wer “who,” a monosyllabic verb, the particle denn, and a trisyllabic object noun with initial stress, see (1).

(1)

a. Wer heißt denn Melanie? (“Whose name is Melanie?”)
b. Wer spielt denn Libero? (“Who plays sweeper?”)
c. Wer malt denn Mandalas? (“Who draws/colors mandalas?”)
d. Wer trinkt denn Malibu? (“Who drinks Malibu cocktails?”)

Nouns with two post-tonic syllables were chosen to avoid tonal crowding (Prieto, 2011; Hanssen, 2017; Rathcke, 2017); also, their segments were as sonorous as possible, especially in the first two syllables, to ease f0 analysis. The propositions of the questions were chosen so that they did not elicit strong (positive or negative) feelings but were mainly perceived as neutral². The four questions were recorded by a female native speaker from Northern Germany (31 years at time of recording), who grew up with a Southern German parent and who is familiar with intonational phonology. She produced the wh-questions in two conditions: (i) with a nuclear L+H^* accent and (ii) with a nuclear L^*+H accent (see Supplementary Material S1 for acoustic analysis). She was instructed to focus on the alignment of the tonal targets. The recordings were then manipulated in three steps (splicing, duration manipulation, f0 manipulation) using Praat (Boersma and Weenink, 2016). First, splicing ensured that pitch accent realizations were not affected differently by the preceding part of the wh-question. To this end, for each item, the auditorily best “precontext” (wh-word, verb, particle) was selected. Likewise, the best productions of the object nouns (one for L+H^*, one for L^*+H for each item) were selected and cut at positive zero-crossings. Both parts were scaled to 63 dB. To reduce variability across items, the precontexts were manipulated in terms of duration using PSOLA resynthesis. This way, the constituents had an equal average duration for each item (in the three intonation conditions). The same was true for the three syllables of the noun. Second, the precontexts were cross-spliced to the nouns. Finally, the alignment of the tonal targets of the noun (L1: start of the f0 rise, H: f0 peak, L2: end of the f0 fall) was manipulated, based on the alignment of the naturally recorded stimuli for L+H^* and L^*+H and the values reported in Braun et al. (2019) for (LH)^*, see Table 1. Table 1 shows the locations of the three tonal targets (L1, H, L2) within the rising-falling contour (in the particle denn, the first, second or third syllable in the object noun). Percentages refer to the total duration of the respective unit, e.g., the f0 peak (H) occurred after 71% of first syllable of the noun in L+H^*, and after 94% in (LH)^*; for L^*+H, it occurred after 71% of the second syllable of the noun. Note that Figure 1 shows a visual representation of Table 1. The f0 values in the rising-falling contours were set at 166 Hz for L1, at 273 Hz for H, and at 170 Hz for L2 (based on the mean values in natural productions), leading to a pitch range of 8.6 semitones (st) for the rising part and 8.2 st for the falling part of the contour.

Table 1

	L1	H	L2
L+H*	In [n] from denn (22.0%)	In syllable 1 of noun (71.2%)	In syllable 2 of noun (69.6%)
(LH)*	In syllable 1 of noun (45.4%)	In syllable 1 of noun (94.4%)	In syllable 2 of noun (69.6%)
L*+H	In syllable 1 of noun (76.3%)	In syllable 2 of noun (71.1%)	In syllable 3 of noun (31.6%)

Alignment of tonal targets (L1, H, L2) in rising-falling contours in experimental stimuli; L+H^* and L^*+H values based on natural recordings, (LH)^* values based on Braun et al. (2019).

Percentages refer to the total duration of the respective unit.

The f0 contours of naturally produced L+H^* accents and naturally produced L^*+H accents (4 items each) were resynthesized into the three intonation conditions, leading to a total of 24 test stimuli (4 items × 3 target contours × 2 manipulation origins). We used two manipulation origins to exclude the possibility that spectral effects could have affected the imitations, which was not the case (see below). Natural recordings of (LH)^* were avoided because this contour might be realized with breathy voice in wh-questions (Braun et al., 2019), which might be a confounding cue. Furthermore, recording the contours at the ends of the continuum (Figures 1A,C) will allow us to resynthesize intermediate steps in future studies. In the present study, we start with three contours, the two more established L+H^* vs. L^*+H, and one intermediate contour (LH)^*. We further selected four additional wh-questions to be used as practice trials. They had the same syntactic structure but different target words (Thymian “thyme”, Komiker “comedian”, Kolibris “hummingbirds”, Tombolas “tombolas”). These questions were resynthesized into the more established accents L+H^* and L^*+H.

There were two experimental lists with a pseudo-randomized order of trials to avoid priming of contours between trials. Lists did not contain sequences with the same item or the same contour in a row. The second experimental list was a mirror list of the first list such that the first trial in list 1 was the last in list 2. This was done to avoid order effects. Experimental lists were randomly assigned to the participants. Prior to the 24 experimental trials, there were four practice trials to familiarize participants with the procedure and voice of the speaker.

Procedure

Each trial was initiated by a sine tone (at 300 Hz, 500 ms duration) to signal the beginning of the trial. Participants listened to the questions via headphones. Each target question was also orthographically displayed on screen. Participants were instructed to imitate the utterances as closely as possible with a special focus on their speech melody. They were told to choose a pitch register that appeared suitable for them. This was done to avoid a mimicry of pitch and vocal characteristics of the speaker.

Each utterance was played only once, followed by a 2,000 ms period of silence and a sine tone of 500 ms (presented pseudo-randomly at 450 or 150 Hz) before participants started to imitate the question. The sine frequencies meet the floor and ceiling register frequencies of the speaker who produced the stimuli; the sine tones were played to overwrite any acoustic trace that might be kept after the 2,000 ms silence. We used two different frequencies for the sine tone, in random order, so that participants could not anticipate and adapt to it. After participants had imitated the respective utterance, they pressed a key to proceed to the next trial. Recordings were done via the microphone of the participants' computers in the remote setting. In the lab setting, recordings were done with a head-set microphone (DPA 4088F) onto a MacBookPro in a sound-attenuated booth.

Data Processing and Statistical Analysis

Dataset

In total, we collected 672 sound files (28 participants × 24 imitated questions). Note that each sound file has one imitation. Nine files were excluded due to mispronunciations, bad sound quality, or a technical error on the online platform that led to data loss. The final data set for the analysis consisted of 663 sound files, see Table 2 for a breakdown of the distribution of files across different groups.

Table 2

	Northern German speakers	Southern German speakers
L+H* contour	108	112
(LH)* contour	111	111
L*+H contour	110	111
Sum	329	334
Total		663

Overview of imitated productions per group in final dataset of Experiment 1.

Data Processing

Sound files were first segmented semi-automatically using the software Web Maus (Kisler et al., 2017) with boundaries being manually adjusted according to standard segmentation criteria (Turk et al., 2006). The critical segments were [n] in the particle denn, and the first, second, and third syllables of the object nouns. Subsequently, f0 values were extracted for these four intervals using Prosody Pro (Xu, 2013).³ Figure 2 shows an actual imitation of one target question in the three conditions, along with the annotation. We used 50 measurements per time interval. To detect and remove f0-tracking errors, which primarily occurred in word-final fricatives and word-medial stops, we used a custom-made algorithm in Python that replaced likely octave jumps by “NA” so that they were excluded from the analysis. In R (R Development Core Team, 2015), raw f0 values were transformed into semitones to ease interpretation of (perceptible) differences across contours and downsampled (10 values per interval for statistical analysis).⁴

Figure 2

Imitative productions of the German target question *Wer trinkt denn Malibu*? ‘Who drinks Malibu?' in the three intonation conditions (vp19, Northern German group, female, 24 years). Top panel: L+H*, mid panel: (LH)*, bottom panel: L*+H. Tier 2 served as input tier for the extraction of f0 values; all other tiers are for illustration purposes only.

Statistical Analysis

We used GAMMs (Wood, 2006, 2017) to test the distinctiveness of the three different intonational contours. GAMMs were chosen as they allow for a direct comparison between f0 contours by modeling non-linear dependencies of a response variable (here f0) and different predictors (here intonation condition and region) over time via smooth functions. They use a pre-specified number of base functions of different shapes (Baayen et al., 2018; Wieling, 2018; van Rij et al., 2019; Sóskuthy, 2021). GAMMs also allow us to model interactions over time (e.g., condition × region), which test whether the distinctiveness of contours differs between speakers of Northern and Southern German (cf. van Rij et al., 2019, p. 8ff.; Wieling, 2018, p. 106ff.). For the model fitting of the GAMMs, we used the R package mgcv (Wood, 2011, 2017); the package itsadug was used to plot the model results (van Rij et al., 2017), which is essential to interpret model outputs.

The response variable was the f0 value (in st) at different time points (10 values per interval), which was roughly normally distributed. All models were corrected for autocorrelation in the f0 data using an autocorrelation parameter rho, determined by the acf_resid()-function from the package itsadug (van Rij et al., 2017).⁵ Models were initially fitted using the maximum likelihood (ML) estimation method in order to be able to compare models with different complexity (Sóskuthy, 2021, p. 16; Wieling, 2018, p. 89). We first tested whether the modeling of different curves for the three intonation conditions over time is warranted. Since this was the case, we then assessed the interaction between intonation condition and region (see below for details). Model fits were checked using gam.check() and the number of base functions (k) was adjusted if necessary. Also, models were re-run with the scaled t distribution (family = “scat”, closely following the suggestion in van Rij et al., 2019, p. 17) due to tailed residuals. All steps of the analyses can be found in the Supplementary Materials to this paper (http://doi.org/10.17632/yhv7nmjmgf.2).

Results

Figure 3 shows the raw data, i.e., the average f0 contours on time-normalized utterances (in st) of imitated productions in the different intonation conditions for Northern German (left) and Southern German speakers (right). Data from both manipulation origins (i.e., whether the contours were resynthesized from L+H^* or L^*+H) were collapsed since the resynthesis procedure did not affect the realization of contours (see analysis on Mendeley for details). Note that syllable durations in the imitated questions did not differ across intonation conditions (all p > 0.12).

Figure 3

Average f0 contours (in st) of imitative productions in the three different intonation conditions [L+H* in gray, (LH)* in orange, and L*+H in blue], for Northern German (left) and Southern German speakers (right). The x-axis displays the time-normalized questions (from [n] of the particle denn followed by the trisyllabic sentence-final object, e.g., *Mandalas*).

The initial GAMM included condition and region as parametric effects along with a smooth for the interaction of intonation condition over (normalized) time, s(Normtime, by = intonation condition), and factor smooths for participants and items. Model comparisons using the function compareML() revealed that this model was superior to a simpler model without the smooth for condition over time [ = 1323.09, p < 0.0001], corroborating the existence of different contours. We then assessed the interaction between intonation condition and region over time to test whether the distinction of contours differed across regions. To do so, we refitted the model including an interaction variable RegCond (6 levels, 2 regions × 3 intonation conditions). The interaction model had a better fit [ = 91.40, p < 0.001], indicating that the speakers from the North made different distinctions than speakers from the South. The best-fitting model was re-run with the scaled t distribution specified (family = “scat”). Based on this new model, we again determined a value for the rho parameter to account for autocorrelation. The outcome of this final model⁶ is visualized in Figure 4. It explained 68.5% of the deviance, see Supplementary Material S2.1 for details on model evaluation in terms of residuals.⁷

Figure 4

GAMM results. **(A)** Predicted f0 values in the three intonation conditions [L+H* in gray, (LH)* in orange, and L*+H in blue] for Northern German speakers (left panel) and Southern German speakers (right panel). **(B)** Predicted difference curves (pairwise comparisons between contours), for Northern German speakers (left panel) and Southern German speakers (right panel). The gray shading displays the 95% CI (confidence interval) of the predicted mean difference. The difference becomes significant if zero is not included in the 95% CI. This is marked by the vertical red lines.

Figure 4A shows the f0 contours in the three intonation conditions as predicted by the final GAMM, split by region (Northern German speakers are shown in the left and Southern German speakers in the right panel). The difference between two contours is directly displayed in so called difference curves, where f0 values in one condition are subtracted from f0 values in the other condition, see Figure 4B for the three pairwise contour comparisons for the two regions:

- Comparison L+H^*vs. (LH)^*. For speakers from Northern Germany, the imitative productions significantly differed already in the pre-tonic interval (the [n] of the particle denn), with L+H^* contours being slightly higher than (LH)^* contours. L+H^* furthermore had an earlier low turning point than (LH)^* in Northern German speakers (i.e., an earlier start of the rise), which accounted for a large difference in the stressed syllable. For speakers from Southern Germany, by contrast, the difference in alignment of the low turning point was less obvious and contours only differed for a small part of syllable 1 of the noun. The difference was around half a semitone, which is very subtle (Batliner, 1989). Similarly, a later alignment of the peak in (LH)^* than in L+H^* led to differences in the contour at the end of syllable 1 and the beginning of syllable 2 in the noun in the Northern German group. This difference was again less pronounced in the Southern German group. Overall, the distinction between L+H^* and (LH)^* seems to be larger for Northern German than for Southern German speakers (see Supplementary Material S.2.2 for interaction model). Importantly though, both speaker groups distinguished between the two contours.
- Comparison (LH)^* vs. L^*+. For both speaker groups, the two contours differed significantly in syllables 1 and 2 of the noun. The alignment of the low turning point was comparable but the (LH)^* had a steeper rise with an earlier peak than the L^*+H accent, which led to considerable differences in the stressed syllable and the post-tonic syllable. Supplementary Material S.2.2 show that regional differences are minor for this distinction.
- Comparison L+H^* vs. L^*+H. As expected, the imitative productions of the two established accents differed significantly in syllables 1 and 2 of the noun, with a later alignment of the low turning point and the peak for L^*+H as compared to L+H^*. The f0 of the L+H^* hence rose earlier than for L^*+H, leading to strong differences in realization. Interestingly, the contours deviated already in the pre-tonic syllable ([n] of the particle denn), similarly in both varieties. The distinction between contours was generally more pronounced for Northern German speakers than for Southern German speakers (see Supplementary Material S.2.2 for interaction model).

Taken together, the two more established accent types L+H^* vs. L^*+H are clearly distinct in their form, showing significant differences across both syllable 1 and 2 in both regional varieties (the difference was larger for Northern German speakers). Crucially though, the f0 contour in the (LH)^* condition is also significantly different from the f0 contour in L+H^* (syllable 1) and from the f0 contour in L^*+H (syllables 1 and 2)—for both speaker groups, but the distinction between L+H^* and (LH)^* is smaller for speakers from the South than for speakers from the North; in fact, these two contours are very similar in the productions of Southern German participants. The distinction between L^*+H and (LH)^* is more pronounced and clearly maintained in both regions, with differences in contours occurring both in syllables 1 and 2 of the object noun. The significant differences in contours across all pairwise comparisons suggest that speakers maintain a three-way contrast in rising-falling-contours, with regional variety modulating the extent of the distinction.

Interim Discussion

Experiment 1 tested the distinctiveness in realization of three rising-falling contours in Northern and Southern German speakers in a delayed imitation task that addressed phonological processing (Chodroff and Cole, 2019b; Petrone et al., 2021). Our findings show that all three contours are distinguished from each other. However, it is not unambiguously clear at this point whether participants imitated the tonal targets (i.e., pitch accent categories), a communicative function associated with the different contours (e.g., information-seeking vs. rhetorical question), or differences in perceived prominence. In prominence perception tasks, steeper slopes are judged more prominent than shallower ones (Rietveld and Gussenhoven, 1985; Baumann and Röhr, 2015; Baumann and Winter, 2018). Clearly, imitative productions had to be retrieved from stored representations, which may consist of aspects of tonal alignment properties, prominence, and meaning—possibly also a combination of the three.

Regional variety mediated the extent of the distinction between contours such that speakers from Northern Germany had more distinct productions than speakers from Southern Germany, especially regarding the distinction between L+H^* vs. (LH)^*. In section “The Present Study: Rationale and Hypotheses”, we hypothesized about mergers toward the more frequent accent type, i.e., mergers toward L+H^* in Northern German and toward L^*+H in Southern German speakers due to a high occurrence frequency of these complementary accents in the respective varieties (Kügler, 2004, 2007; Peters et al., 2005)—an account that had predicted less clear-cut distinctions between L+H^* and (LH)^* in the North and between L^*+H and (LH)^* in the South. This prediction was clearly not borne out: Instead, Northern German speakers were more distinct in all pairwise comparisons than Southern German speakers, especially with regard to the distinction L+H^* and (LH)^*, which almost seemed to converge for large parts of the contours in speakers from the South. We will discuss this finding and its implications in more detail in the General Discussion, including results from the association task in Experiment 2. Summarizing the main findings from Experiment 1, our results indicate that speakers maintain a three-way contrast in their imitative productions. Crucially, (LH)^* significantly differs in its form from the more established accents L+H^* and L^*+H in all experimental conditions, in particular for speakers from the North.

Experiment 2: Paraphrasing of Connotative Question Meaning

In Experiment 2, we tested whether the three rising-falling contours evoke different connotative meanings. To this end, listeners paraphrased the pragmatic meaning they associated with the stimuli from Experiment 1.