How (not) to look for meaning composition in the brain: A reassessment of current experimental paradigms

When we use language, we draw on a finite stock of lexical and functional meanings and grammatical structures to assign meanings to expressions of arbitrary complexity. According to the Principle of Compositionality, the meanings of complex expressions are a function of constituent meanings and syntax, and are generated by the recursive application of one or more composition operations. Given their central role in explanatory accounts of human language, it is surprising that relatively little is known about how the brain implements these composition operations in real time. In recent years, neurolinguistics has seen a surge of experiments investigating when and where in the brain meanings are composed. To date, however, neural correlates of composition have not been firmly established. In this article, we focus on studies that set out to find the correlates of linguistic composition. We critically examine the paradigms they employed, laying out the rationale behind each, their strengths and weaknesses. We argue that the still blurry picture of composition in the brain may be partly due to limitations of current experimental designs. We suggest that novel and improved paradigms are needed, and we discuss possible next steps in this direction. At the same time, rethinking the linguistic notion of composition, as based on a tight correspondence between syntax and semantics, might be in order.


. Introduction
Linguistic communication rests on our capacity to combine the meanings of morphemes and words into complex semantic structures. This basic property of language has been a central concern in linguistics for decades. More recently, it has attracted the attention of neurolinguists, as the need to understand its neurobiological underpinnings has become pressing. Research on "composition, " "unification, " "combinatorics, " or "integration" is now common in cognitive neuroscience. Yet, the mechanisms by which meaning is composed in the brain remain at present elusive: neural correlates of composition, invariant across experiments using different paradigms and methods, have not yet been established. The delay in our understanding of composition in the brain may partly stem from limitations inherent in the paradigms used so far: we will argue that none of them currently affords the direct comparisons between conditions that could reveal a correlate or signature of composition. As we review these paradigms, we will identify a number of requirements that future experiments should meet to achieve that goal. But how should composition be defined? At the computational level (Marr, 1982;Baggio, 2018) in formal semantics and adjacent fields, composition is the operation that, for any given complex expression E, takes as input E's immediate constituent meanings and E's constituent structure and outputs E's meaning (Heim and Kratzer, 1998). Compositionality is the idea that there is a strong parallelism, or one-to-one correspondence, between the operations that build syntactic structures and meaning composition: each application of meaning composition mirrors the application of syntactic structure-building operations. In the Minimalist Program, Merge is used to derive hierarchical constituent structure by recursively forming sets of syntactic objects in pairs (Adger, 2003). In standard versions of formal semantics, composition amounts to the "saturation" of "unsaturated" meanings (e.g., a verb by its arguments) via the operation known as Functional Application, where a function is applied to arguments of appropriate type (Heim and Kratzer, 1998). All these operations are characterized atemporally in any formal system that strives to model the syntax and semantics of a language. At Marr's (1982) algorithmic and implementational levels of analysis, instead, these operations are modeled as processes unfolding in time.
Our focus is on studies investigating local composition of linearly adjacent functional or lexical items, where there is a direct correspondence between logic and time, or between the deployment of composition at the computational level and its algorithmic and neural execution. This correspondence becomes more complex with non-adjacent constituents, which pose specific problems for theories and experiments. Moreover, our focus will be on on-line language comprehension, not on production: little is known (and perhaps can be known experimentally) about whether and how meanings are composed during early stages of conceptualization and message generation.
The question of the neural bases of composition and Compositionality has only recently been brought to the foreground of research. But this move did not provide the hoped-for advancements: the way meaning is composed in the brain remains an unsolved problem (Pylkkänen, 2019). Current experiments have not been based on paradigms that reliably vary the presence vs. absence of composition. At a minimum, the field has not benefited from enough discussion on whether accepted and presently used paradigms achieve the intended aims. This paper tries to fill this gap. We will not focus so much on the results of each study: those cannot be confidently interpreted unless the validity of paradigms is thoroughly assessed. Published research may report spatiotemporal activity that differs between conditions, but those effects may not entirely reflect the processes of interest, if the baseline conditions cannot fully prevent composition. Furthermore, it is not obvious that limitations of current paradigms can be mitigated by using higher-resolution neural recordings or more advanced methods for analyzing data. Progress is needed on several fronts simultaneously: here we concentrate on the paradigms front.
We should emphasize that the same paradigm or design may be inadequate for studying composition and perfectly suitable for other aims, for example the identification of brain signatures of syntactic or semantic processing complexity. On the one hand, this implies that some paradigms are "almost good enough", in that they successfully target processes closely associated with composition.
On the other hand, this should remind readers that our aim is not to disqualify certain paradigms, designs, studies, or research programs as such, or even as viable approaches to the experimental study of syntax and semantics in the brain, but only and specifically as they relate to syntax-driven meaning composition, as defined above. We will thus discuss studies that manipulate the inputs of composition (constituents meanings and syntax) and ask whether the chosen conditions are adequate to identify, upon subtraction or comparison of neural responses, a correlate or signature of meaning composition. Even if a paradigm was not originally or primarily intended to study composition, we can still ask whether it can be leveraged to do that.
Research on composition makes the rather plausible assumption that, for meanings that are regimented by Compositionality, we should be able to identify experimentally neural events that instantiate composition in comparison to conditions where the requirements of composition are not met, because constituent structure or meaning cannot be derived or the meanings of the parts are unavailable. The challenge is indeed to utilize control or baseline conditions that can prevent the system from engaging in composition. Syntactic and semantic processing are however correlated. One problem for isolating composition in the brain is that studies that vary the structure of a stimulus tend to vary its meaning as well (Pylkkänen, 2019), and covarying neural signals can be difficult to disentangle. A second challenge is that composition is correlated or co-occurring with other processes, including non-strictly-compositional processes, like conceptual combination, pragmatic processing, inference etc. (Baggio et al., 2016). Thirdly, linguistic theories and processing models do not yet fully agree on the steps by which structure and meaning are built, and linking hypotheses that can effectively connect levels of analysis and guide experimental research are scarce (Baggio et al., 2012a;Pylkkänen and Brennan, 2019;Baggio, 2020).
Most paradigms that have been used to study compositional processes vary the presence or absence of syntax or lexical semantics. By subtracting the compositional and baseline responses, they attempt to isolate only that which differs between the two: neural events associated with syntax-driven meaning composition. We discuss paradigms that use this approach (Section 2) or that exploit particularities of languages to vary semantics while keeping structure constant or vice versa (Section 3). Although we take structure to be an essential ingredient in meaning composition, studies that attempt to isolate composition should look not just at structure building per se, but at the derivation of meaning guided by structure. Thus, we will also consider studies investigating syntactic composition that used stimuli with compositional meaning (e.g., Pallier et al., 2011). Instead, experiments on syntactic structure in designs where meaning is absent, such as artificial grammars, will not be considered, along with studies using classical semantic or syntactic violations (e.g., Ni et al., 2000;Friederici et al., 2004). These designs are not suitable for isolating syntactic or semantic composition. They include well-formed sentences where semantic or syntactic constraints are violated on single words, but no comparisons that can reveal syntactic or semantic composition. Furthermore, the brain might still attempt to derive meaning in anomalous sentences, even though that meaning may not be licensed by . The beaten path: Three classical paradigms . . Scrambling linear order Sentences are not linear sequences of words, but recursive, hierarchical combinations of words and phrases. One widely used paradigm compares syntactically well-formed and meaningful expressions with stimuli where the linear order of words is scrambled. This manipulation is assumed to prevent the formation of syntactic structures at all levels of the hierarchy (phrases, clauses), thus disrupting compositional operations. Experiments using this approach can be separated into two groups, based on the "size" of the linguistic structures used for comparison: sentences or narratives.

. . . Well-formed sentences vs. lists of words
One type of paradigm compares well-formed sentences with word lists where the linear order is broken, and thus syntactic hierarchies and complex meanings cannot be formed (Hashimoto and Sakai, 2002;Kuperberg et al., 2000). This paradigm would seem to target Compositionality directly: the meaning of a sentence is not just given by the meanings of constituents; syntactic structure plays a role, too. One assumption behind this paradigm is that lists and sentences only differ in one respect: syntactic structure. As we will see in this section, however, that assumption does not always hold.
This paradigm has been used in combination with other baselines, such as pseudoword sentences ("Jabberwocky") and pseudoword lists, to be discussed below. Word lists and scrambled sentences are used in fMRI studies to identify broad patterns of activation for language processing (Fedorenko et al., 2010) and isolate specific functional components of language (e.g., syntactic and semantic processing). In these studies, fully wellformed meaningful sentences are compared either to scrambled versions of the same sentences, where the same content and function words are presented in random order, or to lists of words not present in the original sentences. The assumption is that processes engaged at the single word level (e.g., lexical retrieval) are equally present in lists and sentences, so subtraction (sentencelist) will isolate neural responses that differ between conditions, such as a putative neural correlate of syntax-driven composition. Across studies, there is variability in how baseline conditions with word lists are built: as we will see soon, this is indirect proof of the challenges that arise when constructing stimuli in this paradigm.
Sentences compared to unstructured lists involve the construction of sentence structure and meaning. Some studies have thus included manipulations of meaning to tease them apart. Vandenberghe et al. (2002) used PET with a blocked design comparing sentences to lists to determine the contribution of syntax to composition. The lists used scrambled content and function words from the sentences. Similarly, Humphries et al. (2006) ran an fMRI study using semantically congruent sentences (1a) and lists (1b): (1) a. The man on a vacation lost a bag and wallet b. On vacation lost then a and bag wallet man then a Both studies used semantic manipulations to disentangle compositional semantics from syntax. Semantically "random" sentences (1c) were compared with semantically "random" lists (1d); from Humphries et al. (2006): (1) c. The freeway on a pie watched a house and a window d. A ball the a the spilled librarian in sign through fire The incongruent condition (1c) is intermediate between a congruent sentence and a list: it has structure, but meaning is deviant. Incongruent sentences should control syntactic structure and lexical retrieval, but differences related to contextual activation of specific lexical items still exist between these conditions. Further, plausibility or meaningfulness manipulations may not prevent composition. Semantically anomalous sentences used in these studies appear felicitous up to the first few words, allowing participants to initially compose meaning ("Youths resented a sketch of the forest"). The results indicate that a subset of regions active for sentences is also active for anomalous sentences and lists, as if the brain engaged in composition in all conditions, albeit possibly to different extents or at different positions across items. Goucha and Friederici (2015) compared well-formed and meaningful sentences (2a) with well-formed incongruent sentences (2b) and scrambled lists of unrelated words (2c): (2) a. The complexity of the regulations had shocked the unhappy kingdom b. The vicinity of the constipation had ironed the uncanny wisdom c. Vicinity the of had constipation wisdom ironed uncanny the In Goucha and Friederici (2015), lexical information is matched across sentences and lists. To reduce the risk of incidental syntactic structure building in word lists, Humphries et al. (2006) created lists by randomly sampling function words from the stimulus set and by replacing them in sentences before shuffling their word order. Lexical content cannot be matched exactly, but randomly picked function words might be less likely to combine with the given content words. Even so, this is unlikely to fully block syntactic processing.
Another scrambling approach was used by Kaufeld et al. (2020) They found increased neural tracking, at the phrase frequency, for sentences (3a) vs. lists (3b), which suggests the brain is building hierarchical structure for sentences but not for lists. This could be taken to indicate that meaning composition too is happening only for sentences, tracking hierarchical structure. But stringing together locally words from the same category in lists could lead to compounding attempts (N-N), or could engage other syntactically viable modes of combination (e.g., adjective stacking, "bange bruine"). This issue is not specific to this study, but applies widely to lists paradigms. Composition may then occur in both sentences and lists, at least locally. Another variant of this paradigm disrupts syntactic structure parametrically, resulting in conditions with different degrees of scrambling. Pallier et al. (2011) studied the neural mechanisms of hierarchical structure building using stimuli with five levels of scrambling. These varied in the size of the constituents, ranging from a full sentence (4a) to a word list (4f), with lists of constituents of different sizes in between, 6 to 2 words, (4b) to (4e) As constituents were extracted from the sentence condition and concatenated randomly, lexical material was matched across conditions, but not within each items set. Activation was modulated by constituent size in the left superior temporal sulcus (STS) and inferior frontal gyrus (LIFG partes triangularis and orbitalis). In a replication study, Shain et al. (2021) suggest that these effects may not reflect syntactic structure building, but the fact that shorter constituents may not fully engage the language network. Larger chunks may then be easier to recognize by the language network as stimuli to be processed.
Parametric variation of constituent size can be a way of overcoming the poorer temporal resolution of BOLD fMRI and can be used to track how composition unfolds step by step as structure and meaning are built. However, as noted by Grodzinsky et al. (2021), these designs are not without issues. The conditions do not form minimal pairs: e.g., there are additional differences in category labels and number of structural units between them. A similar study is Matchin et al. (2017), who aimed to dissociate the effects of bottom-up syntactic computations from those of top-down predictions, by comparing lists of words ("rabbit the could extract protect") to lists of two-word phrases ("the fencer the baby the bill") and full sentences ("the poet will recite a verse"). Zaccarella et al. (2017) matched as much as possible semantic content between conditions, comparing sentences ("The ship sinks") to prepositional phrases that contained a matched noun ("on the ship"). The word list baseline contained a further control measure: the nouns in the lists were in the same positions as in the sentence or phrase ("stem ship juice"; "leek mouth ship"). Matchin et al. (2017) and Zaccarella et al. (2017) find effects in the left IFG and posterior STS (pSTS) for syntactic structure building, but only the former study reports effects in left pSTS for sentences and phrases. Mollica et al. (2020) compare in an fMRI study well-formed sentences (5a) to scrambled sentences with 1 swap (5b), 3 (5c), 5 (5d) or 7 swaps (5e), and a list of content words: (5) a. on their last day they were overwhelmed by farewell messages and gifts b. on their last day they were overwhelmed by farewell and messages gifts c. on their last they day were overwhelmed farewell by and messages gifts d. on their last day were overwhelmed they farewell messages by gifts and e. their last on they overwhelmed were day farewell by messages and gifts The novelty here is that word order is disrupted, but the message can still be recovered. A second experiment included a condition where scrambling was so severe that syntactic and semantic relations between words could not be established: (5) f. last day farewell gifts on were and they by they overwhelmed message.
The results show that, if dependencies between words can be recovered, linear order has little impact on processing: activation levels were similar across conditions, irrespective of scrambling. The exception is (5f), where scrambling was such that words cannot form dependencies: here activation levels were lower, closer to the level of content word lists.
This study is a reminder of the importance of carefully constructed baseline and control conditions. When scrambled words are linearly close to other words with which they can plausibly enter a dependency relation, there are no differences between the baseline and compositional conditions. One possibility, compatible with Mollica et al.'s interpretation of these results, is that scrambled sentences (5b-e) cannot prevent extraction of meaning from input: the brain is quite "aggressive" in its urge to compose. There is another lesson one could draw here. The extent to which intepretation requires (hierarchical) syntactic structure is open to question (Culicover and Jackendoff, 2006;Baggio, 2018Baggio, , 2021Nefdt and Baggio, 2023). Participants might use linear order as a proxy for syntactic structure, or extract meaning without (fully) reconstructing structure. If participants seek to compose meaning even in the scrambled conditions, either they do not need syntactic structure to compose or they are trying to fix the disrupted mapping between syntax and word order.
Note that linear order feeds in a systematic way o of structure, but it is not completely determined by it. Di erent languages have di erent base orders and tend to allow for variation in word order for the same message.
Thus, linear order is not an exact proxy for syntactic structure. In most fMRI studies using this paradigm, activation levels are averaged over the whole sentence and are compared with the average signal from word lists. Because of the slow temporal evolution of the BOLD response, these studies cannot zoom in on syntactic or compositional processes at specific points in a sentence, but can only indirectly associate composition to regions that activate more with the presence vs. the absence of structure, a binary variable that applies to the entire stimulus (Matchin et al., 2019a). Composition, however, is a time-sensitive process that may not occur in the same form at each word (there may be differences for optional vs. obligatory elements, function vs. content words etc.), that may not happen at every single word (if certain constructions imply storage of material, e.g., with longdistance dependencies), and that may be revised at subsequent processing stages (Baggio et al., 2008;Baggio, 2018). A fine-grained map of composition operations, as realized in the brain, may only be obtained from measures with sufficient temporal resolution and with experimental designs that harness that resolution. M/EEG have the advantage of sampling brain activity with a millisecond resolution. Hultén et al. (2019) use MEG to compare sentences (e.g., "I like to read nice books in my spare time") to lists containing the same words as the sentences, but in a scrambled order. For every word in the sentence, they found activity around 400 ms in the left posterior temporal cortex (LPTC), left inferior frontal cortex (LIFC), and left anterior temporal lobe (LATL). Fedorenko et al. (2016) used cortical-surface EEG (ECoG) with lists and sentences. They observed a monotonic increase of gamma power over frontal and temporal areas as the sentence unfolded. For lists, this was only seen until the third word, after which activity dropped, suggesting that participants may initially attempt to process lists much as they do sentences. Their results also show increased gamma activity for word lists relative to Jabberwocky and nonword conditions, suggesting that composition might be engaged in that condition too, as constituents in word lists may still be formed. Using ECoG, Nelson et al. (2017) compared sentences vs. scrambled lists and found high gamma decreases for words closing syntactic phrases. These studies point to possible gamma-band signatures of structure building or syntax-driven composition (but see Murphy, 2020 for a different account). However, word lists do not allow researchers to exploit the superior temporal resolution of M/EEG: as word order in lists is disrupted, one cannot compare the same word across conditions at any given time point while controlling for properties of the left context. Independent improvements of this paradigm would therefore be needed to fully take advantage of better recording resolution or advanced data analysis methods.
In these experiments, lexical material is matched between the items being compared but the presence of function words may still trigger structure building attempts also in lists, as suggested by Zaccarella et al. (2017). Their meta-analysis shows function and content words in lists can activate language regions, e.g., the left IFG. Affixes and function words carry grammatical information and can therefore guide syntactic processing. In an fMRI study of the neural correlates of syntax and semantics, Friederici et al. (2000) compared spoken German sentences (6a) to word lists (6b): The cook silent cat velocity yet honor.
They removed function words and inflectional morphology from lists and omitted verbs: German word order can make verbs within lists trigger syntactic processing. Still, their lists are considerably less diverse lexically than sentences. They reported activations for sentences relative to lists in the bilateral superior temporal gyrus (STG). This region has been associated with phonological processes. Given the differences in length or duration of words in lists (only content words, minus verbs) vs. in sentences (content and function words), it is difficult to establish whether the STG effect here is due to composition or to processing of phonological or auditory properties of stimuli. A similar concern applies to recent work, such as Branco et al. (2020), who also used lists with only content words as baselines. They find activation for sentences relative to word lists across left frontal and temporal areas, but this result may include any area sensitive to the distinction between function and content words, as opposed to combinatorial processes more specifically.
A possible approach to isolating composition would be to remove confounding variables by modifying the stimuli in a stepwise fashion. Humphries et al. (2005) compare spoken sentences (7a) to unstructured lists with and without prosodic cues. The lists served as a baseline and could contain function and content words (7b) or only content words (7c): a. The man was looking forward to an upcoming road trip in his expensive new car. b. That the in the wearing students the blonde expensive south up waits in performing the ate. c. Bank calm school bathtub workers home car tambourine neail waill hat beach umbrella street head.
Permuting the words within each sentence would run the risk of accidental composition: semantically related words might prompt speakers to reconstruct a meaningful message, as was noted by the authors. They thus randomly picked words from the sentence set for scrambling, keeping the stimulus length and number of syllables constant within items. The conditions were matched lexically over the entire set, but not for each item or each sentence position. The left anterior STS, toward the middle temporal gyrus (MTG), was active for sentences regardless of prosody; the left posterior STS was active for sentences with list prosody; the posterior bilateral STS showed a prosody * structure interaction.
Lists and sentences are difficult to match in all relevant respects except for composition. Law and Pylkkänen (2021) embedded lists of nouns ("lamps, dolls, guitars") into sentences (8a) or lists (8b) in an MEG study aimed at isolating correlates of syntactic composition: (8) a. The eccentric man hoarded lamps, dolls, guitars, watches and shoes b. Forks, pen, toilet, rodeo, lamps, dolls, guitars, wood, symbols, straps Their results show increased activity in the left inferior frontal cortex at 250-300 ms, at 300-350 ms in the LATL, and at 330-400 ms in the left posterior temporal cortex for lists in sentences relative to lists in lists. This design affords better control over local syntactic and semantic context, and the use of bare plural .
/flang. . nouns may help prevent N-N compounding in lists. However, the conditions are different beyond the immediate local context: lists do not include any function words, and content words before critical words differ between conditions, which might impact processing complexity and preactivation. Additionally, as noted by the authors, a word's meaning in a sentence could differ from the same word's meaning in a list.
. . . Composition beyond sentences: Structured narratives vs. scrambled sentences Experiments using single sentences may be argued to lack the ecological validity needed to draw inferences about how compositional machinery is used in everyday life (Hasson et al., 2018). We rarely communicate in isolated utterances: the messages that we convey often span multiple sentences. Recent studies have thus used multi-sentence narratives, typically presented in the auditory modality as naturalistic speech. Narratives have been compared to lists of scrambled words from the same story, to lists of words matched in lexical variables with words in the story, or to lists of unrelated sentences (Mazoyer et al., 1993;Xu et al., 2005;Pylkkänen, 2012, 2017;. Lerner et al. (2011) compared brain responses to stories in the auditory modality with scrambled versions at different levels of structure: word, sentence, and paragraph, plus a condition with the story played backward. Using structured narratives results in more ecologically valid conditions and increases the variety of expressions investigated. But these studies also use lists of words or sentences as baseline conditions, incurring the problems raised above. Further, the size of the stimuli makes it difficult to zoom in on local composition: interpretation of most words in narratives is influenced by the discourse model built up to that stage, engaging processes beyond composition (Baggio et al., 2016;Baggio, 2018). . . . Problems with lists: Interim summary Some paradigms have tried to align experimental and baseline conditions by controlling lexical frequency, length, and word class across sentences and lists, by scrambling words from the same sentences, by combining words from different sentences in the stimulus set, by leaving out function words, or by matching local contexts while varying aspects of global contexts. Such strategies may not always achieve minimality or precise matching of conditions (Grodzinsky et al., 2021).
Beyond minimality, the potential risk of accidental syntactic or semantic composition in lists always looms over the interpretability of experimental results, particularly when the words used in lists are drawn from the critical sentences and shuffled in random order. An inspection of the stimuli used in many studies reveals that phrase level dependencies can sometimes still be formed (Mollica et al., We define a "minimal pair" as two conditions that only di er in the variable of interest: e.g., conditions that only di er in that one involves composition and the other does not or where the mode of composition is di erent. An exact matching between conditions might prove impossible at the level of the stimulus, but a close matching might still obtain if the processes in the two conditions are identical except for the one of interest. Examples of steps in this direction are discussed in Section . 2020). Matchin et al. (2017) too point this out as a possibility in their list condition. The task used might encourage participants to impose syntactic structure on unstructured lists (Matchin et al., 2017). Some studies use block designs as a remedy, but drawbacks can be habituation effects or the emergence of expectations and processing strategies. There are also further differences in sentences vs. lists that are rarely discussed, for example that sentences introduce more information to be encoded in memory. Lists could engage attention and control more than sentences, if there is an active effort to interpret the stimulus.
An additional level of complexity is introduced by the interaction of problems related to the choice of methods (fMRI vs. M/EEG) with challenges that arise from problems in the paradigms themselves. With respect to minimal pairs, one question is whether the effect of noise or variability from different lexical items is more dangerous than the addition of function words in non-composition baselines, or vice versa. In fMRI, where localization is the goal, it may be more appropriate to get rid of function words than to be rigid about matching words in each comparison. With M/EEG, the trade-off might go the other way, given the prominence in measured signals of preactivation and related effects of content words, which should then be matched as much as possible. Sentence-level comparisons, for example using fMRI, would work only if differences between lists and full sentences were spatially localized on a "macro" level. Even then, fMRI's lack of temporal sensitivity still largely threatens non-minimal paradigms, if the goal is to isolate basic composition: the effects of pure composition will interleave with other linguistic operations and smear out over the total fMRI signal over the course of a sentence. This problem is exacerbated with longer discourses. Our assessment of studies using lists, scrambling, or constituent chunking is summarized in Table 1. Anomalous sentences and lists with function words are, in our view, the most problematic. Lists without function words may reduce chances of accidental composition, but the resulting contrasts are less minimal compared to lists with function words and scrambled sentences. In terms of minimality and naturalness, scrambled sentences are superior to lists with function words.

. . The Jabberwocky alteration: Form without content
Lists of words aim to disrupt linear order and thus prevent composition. However, this type of stimulus cannot be used to dissociate meaning and grammar: sentences and lists of words differ both in structure and compositional semantics (Grodzinsky et al., 2021). Differences between the two conditions will then reflect both aspects of composition.
One type of design, meant to dissociate syntax from semantics, relies on baseline stimuli that are devoid of lexical meaning, but still grammatical. Jabberwocky consist of phono-and morphotactically and grammatically well-formed strings, lacking content. Structure building is assumed to proceed unimpeded, but meaning composition is blocked by the unavailability of constituent meanings. In typical Jabberwocky experiments, all content words are replaced with phonotactically licensed pseudowords, maintaining all function words and affixes ("The gar . /flang. . was swabbing the mume from atar"; Fedorenko et al., 2016). The pseudowords are usually derived by replacing phonemes in real words while making sure that the resulting pseudowords do not exist in the given language. In Jabberwocky, syntactic constituents and dependencies are thus maintained in the absence of meaning. Some studies match low-level properties of Jabberwocky to real language by controlling variables such as bigram frequency, syllable length, and phoneme length (Heim et al., 2005;Humphries et al., 2006;Branco et al., 2020). By comparing a normal sentence (e.g., "The poet will recite a verse") with a Jabberwocky version matched in syntactic structure, but not in content (e.g., "The tevill will sawl a pand"; Matchin et al., 2017Matchin et al., , 2019a, one can reveal brain activity that reflects processes necessary to derive compositional meaning. There are however differences in how Jabberwocky and pseudoword sentences are used across studies. Friederici et al. (2000) maintain morphological and capitalization rules of German to give Jabberwocky the "feel" of German: "Das mumpfige Fölöfel föngert das apoldige Trekon". In addition to pseudowords, Fedorenko et al. (2016) used a low-level condition with strings of "nonwords" (e.g., "Phrez cre eked picuse emto pech cre zeigely"). This condition is meant to control for low-level orthographic processing in the absence of lexical processing and composition. Sometimes pseudowords and function words are scrambled within a sentence (e.g., "rooned the sif into lif and the and the foig aurene to"). The normal sentence vs. Jabberwocky sentence contrast is used to identify the effects of compositional semantics when structure is held constant (Röder et al., 2002), while the Jabberwocky sentence vs. Jabberwocky lists contrast is used to isolate syntactic structure building in the absence of meaning (Goucha and Friederici, 2015). This is seen as a viable strategy, if the goal is to dissociate syntactic from semantic processing . But as with word lists, Jabberwocky and pseudowords, let alone nonwords, raise concerns about the minimality of the stimuli compared; for example, some phonological and lexical variables cannot be measured and matched between the two conditions.
The question of whether specific areas of the language network are sensitive to syntactic structure, word meanings, and their interactions is often debated in the field (Fedorenko et al., 2012(Fedorenko et al., , 2020Hagoort and Indefrey, 2014). Several studies used pseudoword sentences vs. unstructured pseudoword lists to disentangle syntax and semantics in the brain (e.g., Fedorenko et al., 2016;Matchin et al., 2017). Branco et al. (2020) use pseudowords lists, lists of content words, real word sentences, pseudoword sentences, and a non-linguistic baseline with symbols matched in length and visual features to the linguistic stimuli. A similar design is used by Humphries et al. (2006), who compared conditions assumed to be minimally different in the presence or absence of syntax or semantics. In addition to normal sentences (1a), incongruent sentences (1c), and lists (1b), they used pseudoword sentences and pseudoword lists containing real function words: (9) a. The solims on a sonting grilloted a yome and a sovir b. Rooned the sif into lilf the and the foig aurene to Structured stimuli were compared to lists to establish a main effect of syntax: activation differences were seen in the left anterior STS. The effect of compositional semantics was derived by comparing normal sentences to incoherent sentences: these conditions both involve lexical processing, but only normal sentences result in a meaningful proposition. This contrast revealed effects in the left inferior temporal gyrus, the left STS, and the left AG. Comparisons were performed between incoherent and pseudoword sentences (with activation in left anterior, middle, posterior STS) and between normal and pseudoword sentences to determine effects of lexical processing (anterior, middle, posterior STS and MTG). The analysis was limited to temporal areas, but the results show that semantics is subserved by a wider network of areas in the temporal lobe than syntax. Stromswold et al. (1996) used a variation of this paradigm with conditions in which only one word in a sentence was replaced by a pseudoword (10a) vs. center-embedded (10b) and right-branching (10c) sentences: (10) a. The economist predicted the recession that chorried the man b. The limerick that the boy recited appalled the priest c. The biographer omitted the story that insulted the queen By manipulating both syntactic complexity and the possibility of deriving compositional meaning, this study asks whether brain areas subserving syntax as opposed to semantics can be isolated. They found increased activation in LIFG for syntactically more complex sentences and in the inferior frontal gyrus, superior temporal gyrus, and supramarginal gyrus for normal sentences vs. sentences with a pseudoword.
Another experiment using pseudowords to investigate syntactic composition is Segaert et al. (2018). To minimize the effect of .
/flang. . semantics, they used sentences where the subject is a pronoun and the verb is a pseudoword with inflectional morphology ("She grushes"). The baseline is a list of pseudowords matched in length to the sentences ("pob grushes"). The pronoun is assumed to trigger syntactic composition, whereas the pseudowords list should not. Structure building could also occur in lists, as morphological marking on the second word could allow speakers to parse the list as a pseudo-subject noun followed by a pseudo-verb. The study found increases in EEG alpha power over left fronto-temporal channels for sentences vs. lists, for the first and second words, interpreted as predictive and syntactic processes respectively (see also Hardy et al., 2023). It could be argued that Jabberwocky still involves formal compositional semantics, even though lexical and conceptual semantics are absent. Grammatical cues could license the assignment of thematic roles toward an interpretation: e.g., "The tevill will sawl a pand" refers to an event (sawl) that will be initiated by an entity (the tevil) affecting another (a pand). This is compatible with the results of studies such as Branco et al. (2020), which did not find differences in activation between real sentences and pseudoword sentences. Goucha and Friederici (2015) exemplify this observation in a parametric design. To identify areas of the left inferior frontal gyrus selectively involved in syntax and semantics, they used several types of pseudoword sentences as baselines. Their Jabberwocky sentences contained phonologically licensed pseudo-content words and real function words, with inflectional and derivational morphology (10a). They removed derivational morphemes (10b) and inflectional morphology replacing determiners with pseudowords (10c): (10) a. The pandexity of the larisations had zapped the unheggy wogdom. b. The pandesteek of the larisardens had zapped the enhegged fordem. c. Thue pandesteek of thue larisarden feg zopp thue enheg fordem.
Their fMRI results show a different pattern of activation for pseudoword sentences with vs. without derivational morphology, suggestive of residual morphosyntactic processing.
Another known issue is that pseudowords, due to their resemblance to real words, might trigger a "search" in the lexicon which will return no results. This might make them more difficult to process than real words, undermining the assumption that pseudowords can serve as a baseline involving fewer/simpler processes. Iwabuchi and Makuuchi (2021) use pronounceable letter strings as placeholders for real words, adding relevant morphology to form hierarchical structures in Japanese. They also included a syntactic manipulation with sentences with the canonical SOV word order (11a), more complex OSV order (11c), as well as nonsemantic sentences containing placeholders, but with the same syntactic structures as the natural sentences (11b, d): This type of design aims at dissociating syntactic from semantic processes in the brain, without using an additional condition of pseudowords and word lists. Using fMRI, they found an effect in the LATL for sentences vs. pronounceable non-sentences regardless of word order. BA44, premotor, and parietal cortices were more active to the placeholders. This latter finding might be attributed to the perceptual and/or phonological differences between placeholders and real words. The effect of syntax was less robust: activations in BA45 and pMTG were observed only before correcting for multiple comparisons.
. . . Problems with Jabberwocky: Interim summary The Jabberwocky paradigm tries to create an impoverished language, where meaning is removed but syntactic structure is preserved: the goal is to block semantic composition while keeping syntactic composition and other grammatical processes going. However, pseudoword sentences do not entirely lack compositional meaning, and function words, when present, can trigger the construction of a minimal formal semantic representation. Comparing sentences to Jabberwocky, with the purpose of isolating processes specific to meaning composition, can result in loss of signal precisely relevant to the latter process. Pseudowords and real words differ in frequency, familiarity, and the cognitive resources allocated to them, for example lexical recognition and search. Pseudoword sentences are used as part of designs also including (pseudo-)word lists, but pseudowords and lists of real words differ in their levels of salience and intelligibility, making direct comparisons difficult (Bautista and Wilson, 2016). Studies attempting to isolate syntactic and semantic components of language processing using word lists and pseudowords sentences can fail to create true minimal pairs: these conditions differ on other dimensions from sentences than just the presence or absence of syntax and semantics (Grodzinsky et al., 2021). Our assessment of studies using pseudowords sentences or Jabberwocky is provided in Table 1. Lack of minimality and naturalness of these stimuli are the main limitations and what renders these paradigms overall problematic for studying meaning composition.

. . Minimal phrases
Sentences involve processes that can obscure purely compositional operations. Semantic associations and other memory-based processes, conceptual combination, preactivation, prediction, and inferential, referential, and elaborative processes, among others (Baggio, 2018), contribute to meaning construction over and above composition. These processes interact with each other to ease demands on processing of downstream inputs (Bemis and Pylkkänen, 2013a;Zaccarella et al., 2017). In none of the paradigms reviewed above can composition be fully disentangled from co-occurring processes. Previous sentence-level studies have focused on delineating linguistic distinctions, such as lexicon .
vs. grammar, under the assumption of large-scale differences in localization. Interpreting their results to make claims about Compositionality requires linking hypotheses on the role of syntax in composition, e.g., whether syntax is the only driver vs. one constraint among many, or whether composition differs for lexical content vs. logical syntacto-semantic relations.
In order for a compositional algorithm to be set in motion, it needs to be fed at least two elements (e.g., words) to produce the meaning of their combination. From a generativist standpoint, elements are combined in pairs. This combination then becomes an element too, to be combined with another in a further step of the derivation. The minimal phrase paradigm, by Pylkkänen and collaborators, uses two-word phrases as the main object of investigation. Bemis and Pylkkänen (2011) "truncate" the pseudowords and lists designs in order to adapt them for the study of composition in simple phrases. Their compositional stimulus was a two-word uninflected adjective-noun phrase ("red boat") to be compared to a baseline consisting of an unpronounceable letter string followed by the same noun ("xkp boat"). The noun "boat", at which the comparison is made, can enter composition in the first but not in the second condition. The use of an unpronounceable letter string, as opposed to a pseudoword, would serve to prevent composition attempts. To control for influences of the lexical material before "boat" in the two word conditions, they included non-combinatorial lists of two nouns ("cup, boat"). However, the brain is eager to extract meaning from input, and there is a possibility of noun-noun compounding in lists (e.g., a plastic or paper cup made to float like a boat). Bemis and Pylkkänen then introduce an additional task manipulation. The task required participants to compose the meaning of the two words and to check whether the combination matched a subsequent picture of a colored object (composition task) vs. read each word to verify whether one matches the picture following each trial (noncomposition task). Composition only takes place at the second word, where contextual processes are minimized. This makes minimal phrases a better fit for time sensitive M/EEG methodology than other paradigms. In the auditory modality, pink noise can be used as a baseline instead of nonwords (Bemis and Pylkkänen, 2013b). Activity in the LATL, from around 200 ms from the onset of the second word, has emerged as a possible signature of semantic combination (Pylkkänen, 2019).
This paradigm combines a tightly controlled stimulus set with manipulations of the task to ensure that the recorded brain activity is related to the process at issue. For example, Bemis and Pylkkänen (2013a) compare canonical adjective noun phrases ("red boat") with reversed counterparts ("boat red") and nonword-word strings ("xhl cup", "frw red"). The key manipulation is the task, which involves a colored shape (compose) or two pictures, one of a colorless shape and one of a colored blob (non-compose): participants had to respond whether the probe matched both words. This study tested whether composition can also be deployed in ungrammatical sequences and whether it is automatic enough to be engaged even when the task does not require it. They found that the LATL is engaged in reversed sequences only when the task requires composition and with canonical word order regardless of the task. Fló et al. (2020) show that, when the task manipulation is eliminated, the effects of composition are no longer observed with EEG.
The minimal phrase experiments achieve something which has been challenging for the previously discussed paradigms: matching between conditions the word which has to be composed or not, at the position at which the neural signal is measured. The pre-critical content in non-combinatorial conditions (nonwords and nouns in lists), however, differs in several respects from the adjective used in the compose conditions. These differences might affect the signal recorded at the critical word. For example, a nonword at the start of a trial might make participants less engaged in processing the following words. At the same time, preactivations resulting from processing of a noun in lists and of an adjective in compositional trials will differ. Additionally, the two word list condition might trigger a process of compounding and thus involve composition regardless of explicit task.
Some minimal phrase studies have used multiple and different baseline conditions. Neufeld et al. (2016), Baggio (2020, 2022), and Kochari et al. (2021) use pseudowords and nonwords to disentangle semantic and syntactic processes, and Bemis and Pylkkänen (2013a) use a reversed word order condition ("boat red"). Del Prato and Pylkkänen (2014), instead of lists of nouns, use lists of adjectives and lists of numerals as baselines, which match in category to the precritical words used in the combinatorial contexts. Graessner et al. (2021a,b) contrast meaningful two-word phrases ("fresh apple") to anomalous phrases ("awake apple") and adjective-pseudoword phrases ("fresh gufel"). In an ECoG experiment, Murphy et al. (2022) compare adjective-noun phrases ("red boat"), which are assumed to involve composition at the noun and prediction at the adjective, to adjective-pseudoword phrases ("red neub"), involving just prediction, and to pseudoword-noun phrases ("zuik boat"), which involve neither.
Some minimal phrase studies have tested how different semantic contexts interact with composition, for example how specificity of the noun modulates LATL activity (Zhang and Pylkkänen, 2015) and the impact of semantic properties of adjectives (e.g., see Ziegler and Pylkkänen, 2016;Baggio, 2020, 2022;Kochari et al., 2021). Kim and Pylkkänen (2019) look for MEG correlates of composition in adverb-verb constructions, testing whether different classes of adverbs (eventive "slowly" vs. orientative "reluctantly") show similar LATL effects as in adjective-noun phrases. Manipulations of the precritical word target the interplay of composition and prediction, via the use of different pronoun types (Strijkers et al., 2019), and between composition and semantic properties of nouns, such as relationality or eventivity (Boylan et al., 2017;Williams et al., 2017). Studies have revealed early LATL responses for Adj-N phrases in the auditory and visual modalities. However, Kochari et al. (2021) failed to replicate this finding. The sensitivity of LATL to variables that syntax-driven composition should, according to theory, not be sensitive to (e.g., specificity) has led to the conclusion that the LATL does not perform composition, but rather conceptual combination (Pylkkänen, 2019). Moreover, the angular gyrus (AG) and the ventromedial prefrontal cortex (vmPFC) are involved in semantics, though they do not always activate across studies. Murphy et al. (2022) find effects of composition in . /flang. .
portions of the pSTS using iEEG/ECoG. With EEG and minimal phrases, Neufeld et al. (2016) link the N400 to combinatorial semantic processing (Hagoort et al., 2009;Baggio and Hagoort, 2011;Baggio, 2012;Nieuwland et al., 2020), and Baggio (2020, 2022) find and replicate P600 effects for adjectivenoun composition. The relatively tight control over experimental items offered by minimal phrases has also been used to tackle more finegrained and theoretically relevant questions on the nature of composition. One question is whether composition in different syntactic structures or environments, such as modification and predication, is carried out by different neural processes. Westerlund et al. (2015) test the distinction between composition operations of argument saturation and predicate modification (Heim and Kratzer, 1998): the former mode of composition includes verb-noun (e.g., "eats meat"), preposition-noun ("in Italy"), and determinernoun ("Tarzan's vine") combinations; the latter includes adjectivenoun (e.g., "black sweater"), adverb-verb ("never jogged"), and adverb-adjective ("very soft"). In keeping with the standard design, each expression was compared to a nonword followed by a matched noun in order to establish effects of composition. Boylan et al. (2015) use a similar design, crossing mode of composition (argument type: "eats meat", "with meat" or adjunct type: "eats slowly", "tasty meat") with presence or absence of a verb. The baseline was non-compositional phrases in which the nonword was either the first or the second element of the sequence ("eats fghjl"/"fghjl eats"). A similar approach is used by Schell et al. (2017). Matchin et al. (2019b) matched word forms exactly within the phrases, while varying syntactic structure for noun-adjective (e.g., "the frightened boy") and verbnoun ("frightened the boy") composition. A potential confound might arise in these designs, as also noted by Matchin et al. (2019a). Whereas, a noun composed with a modifier may be interpreted as a saturated structure on its own, a noun in the object position, composing with a verb, results in incomplete syntactic and semantic structures. Boylan et al. (2015) report activity in the left AG, regardless of mode of composition, for "eats meat" vs. "tasty meat". Westerlund et al. (2015) found that the LATL is involved in argument saturation and predicate modification. Matchin et al. (2019b) show that activity in the left IFG and pSTS increases for verb-noun composition, while there is no difference between the two syntactic structures in AG and LATL activation.
It is worth mentioning two more studies that extend the minimal phrase paradigm. Kim and Pylkkänen (2021) use hashtags in various positions in sentences to study subjectverb composition vs. verb-object composition (e.g., "kids toss objects" vs. "### toss objects" vs. "### ### objects"). However, hashtags can discourage participants to compose meaning for the rest of the sentence, as noted by the authors. Lau and Liao (2018) used coordinated adjective-noun phrases (e.g., "sunlit ponds and green umbrellas") vs. those noun phrases separated by hashtags ("sunlit ponds ### green umbrellas") vs. Jabberwocky versions to isolate brain correlates of building coordinated structures. They find sustained anterior negative ERPs from the first word in the second phrase for coordinated constructions.

. . . Problems with minimal phrases: Interim summary
The elegance and simplicity of the minimal phrase paradigm has provided fertile ground for testing core linguistic ideas with M/EEG. The main advantage of this paradigm is the control it affords over experimental stimuli, enabling the minimization of processes not strictly reflecting local combinatorics. However, minimality comes at a cost, for example a loss of naturalness or ecological validity of stimuli (Hasson et al., 2018). Full sentences may not be the most frequent type of utterance in spoken language corpora, but neither are NPs or VPs as used in these experiments; when those occur, they are elliptic phrases, interpretable in the context of other utterances. Most of these experiments used written stimuli: in written corpora disconnected noun or verb phrases may be even less common than in spoken corpora. However, one could argue that composition must take place for any given phrase, regardless of whether a naturalistic context is available. Another issue is that the baselines used in these experiments may differ from phrases in other respects than just composition. Our assessment of the minimal phrase paradigm is given in Table 1. This paradigm compares favorably to many others currently in use and is the one with the best balance between different limitations.

. Alternative and emerging approaches . . Theory-inspired and language-specific manipulations
For the paradigms just discussed, linguistic theory only covers combinatorial conditions, and possibly Jabberwocky and semantically anomalous sentences, but offers no analysis of conditions with lists of words, pseudowords, nonwords, and scrambled sentences. To bridge levels of analysis with linking hypotheses that can be evaluated empirically, both combinatorial and baseline conditions should be covered by formal theories: ideally, our theories should state why and how composition applies to some cases but not to others.
To design experiments capable of addressing composition, theoretical distinctions must be identified in the linguistics literature and stimuli reflecting those distinctions must be constructed. Consider complement coercion (Pylkkänen, 2008). Semantically, aspectual verbs, such as "begin" and "finish", require event-denoting complements (e.g., "begin the fight"), but syntactically, they may be combined with entity-denoting complements (e.g., "begin the book"): the denotation of the NP must then be coerced from entity to event, or an equivalent (e.g., inferential) operation must recover an eventive interpretation of the NP. In coercion constructions, syntactic structure is simple, but composition load varies: it is greater for entity-denoting than for event-denoting NPs (Piñango and Deo, 2016). Baggio et al. (2010) and Kuperberg et al. (2010) compared control conditions (12a) with coercion constructions (12b) and semantic anomalies (12c). Similar conditions were also used by Pylkkänen and McElree (2007) and Husband et al. (2011): Frontiers in Language Sciences frontiersin.org . /flang. .
(12) a. The journalist wrote the article b. The journalist began the article c. The journalist astonished the article These studies did not use non-combinatorial baseline conditions that attempt to prevent composition, but vary processing load between two conditions that require composition, while keeping plausibility and semantic associations from the context before the critical noun ("article") as constant as possible. This strategy has also been applied to metonymic constructions (Schumacher, 2013) and aspectual coercion (Paczynski et al., 2014). Baggio et al. (2010) and Kuperberg et al. (2010) find N400-type ERP negativities. Using MEG, Pylkkänen and McElree (2007) find increased activation of vmPFC for coercing sentences. Schumacher (2013) reports late positivities for container-forcontent metonymies (e.g., "The baby drank the bottle"). Paczynski et al. (2014) demonstrate that aspectual coercion (i.e., composition of punctual verbs and durative adverbs, e.g., "For several minutes, the cat pounced on the toy") is indexed by a late anterior negative ERP. In these studies, the conditions are closely matched, but precritical material is not kept constant. The focus on semantic differences between conditions, motivated by theory, is a valid way forward to investigate the online processing of these constructions and has the potential to refine linguistic theories. Still, the variable results emerging from these studies point to effects specific to the different linguistic phenomena investigated by each study as opposed to a neural correlate unique to composition.
Other studies are designed around syntactic or semantic properties of languages. Flick and Pylkkänen (2020) use properties of English in an attempt to vary syntax while keeping meaning constant. In English, attributive adjectives occur canonically before a noun, but they may also occur post-nominally in specific constructions. They compared declarative sentences with postnominal modifiers ("There are many trails wide enough for a bear") to questions with post-nominal predicative adjectives ("Are many trails wide?"). A novel aspect here, which is not found in minimal phrases, and to which we return later, is that the critical and pre-critical words form identical sequences across conditions ("... trails wide . . . "). The authors find an effect of structure in the left posterior temporal lobe (PTL) around 200 ms after the onset of the adjective, and an effect of semantic fit between the adjective and noun in the LATL. Parrish and Pylkkänen (2022) use semantic and syntactic properties of English to vary the point of composition. They compare expressions where an adverb and an adjective enter into local composition (e.g., "pleasantly sunny days") to expressions where two adjectives compose with a noun, but not locally with each other (e.g., "pleasant sunny days"). In this study, the precritical word was matched across conditions at the lemma or concept level, but not in its grammatical form. A further comparison involved structures such as "this herbal tea", where "tea" and "herbal" readily combine with each other, to conditions where they do not because of a gender mismatch: "these herbal tea . . . ". In this case, participants must wait until they see a noun that closes the phrase, like "these herbal tea drinkers". A non-combinatorial condition was created by placing the critical word at the start of the sentence, where it has no previous material to combine with: "Tea drinkers hate coffee". Composition in LATL can proceed in the absence of syntactic phrase closure, but syntax can also influence activity in this region, with the highest activity seen for phrases that were both syntactically and conceptually straightforwardly composable. Matchin et al. (2019b) exploit the fact English participle adjectives and past tensed verbs have the same form to construct modification and predication pairs (e.g., "the frightened boy" vs. "frightened the boy"), plus a list baseline (e.g., "frightened, scrubbed, wounded"). They found no differences in BOLD responses in the left ATL and AG. The left posterior STS and LIFG showed greater activity for predication (VP) vs. modification (NP). Matar et al. (2021) use unique properties of the Arabic language to achieve minimally differing stimuli where only syntactic composition varies. In Arabic, an adjective follows the noun it modifies. If the adjective and noun carry the definiteness marker (e.g., "al", in "al-kursi al-banafsaji", the purple chair) the result is an NP; if only the noun does (e.g., "al-kursi banafsaji", the chair is purple), a full sentence results. These two conditions were further compared to an indefinite NP (e.g., "kursi banafsaji", a purple chair). There were no MEG effects of syntactic structure in the left IFG, ATL, and AG. The left posterior temporal lobe (LPTL) was engaged more for indefinite NPs than for definite NPs, and least of all for sentences. The direction of this effect (NP > S) is opposite to that reported by Matchin et al. (2019b) (VP > NP) in the same region of the left posterior temporal cortex. Using a similar approach, Artoni et al. (2020) used Italian sentences containing noun phrases or verb phrases containing homophone two-word sequences, e.g., "la porta" in (13), which is either a Det-N phrase (13a) or a clitic followed by a verb in (13b) (the fragment "domani la porta" is in fact structurally ambiguous: Adv-VP vs. Adv-NP): (13) a. Pulisce la porta con l'acqua.
[He/she] washes the door with water. b. Domani la porta a casa.
[He/she] tomorrow takes her/it at home.
Using direct cortical EEG recordings, they found increased gamma activity above 150 Hz for VPs compared to NPs in large portions of the left hemisphere, beyond the LIFG and posterior STG/STS. The studies presented in this section compare conditions where the degree or type of composition varies to identify correlates responsible for the difference. However, to isolate composition true non-combinatorial conditions that do not have the limitations discussed so far would be needed.

. . Frequency tagging paradigms
Another approach to the study of structure building and indirectly meaning composition is the frequency tagging (or neural tracking) paradigm. By using rhythmically presented stimuli, recent studies have shown that neural oscillations in particular frequency bands can align with chunks at different levels of syntactic structure, as shown by peaks in the power spectrum of particular frequency bands (Ding et al., 2016) or increases in mutual information (MI) between auditory stimuli and neural oscillations (Kaufeld et al., 2020). Ding et al. (2016) and Sheng et al. (2019) compared scrambled syllable sequences with 4-syllable sentences and 4-syllable NPs .
/flang. . and VPs, matched in length but differing in the point at which structural dependencies are formed. They found rhythmic brain activity tracking each level of structure: syllable, phrase, sentence. There were no prosodic cues or breaks between sentences in a sequence: those effects can be attributed to synchrony of neural activity to internally generated structures ; see Kazanina and Tavano, 2023 for discussion). While Sheng et al. (2019) use MEG, Ding et al. (2016) also present ECoG data. They found activity modulated at the phrase frequency in bilateral pSTG, and in the left IFG and pSTG at the sentence frequency. Coopmans et al. (2022) compared normal sentences (14a) to idiomatic sentences (14b), anomalous prose (14c), Jabberwocky (14d), and scrambled sentences (14e): a. De jongen gaat zijn zusje met haar huiswerk helpen. The boy will help his sister with her homework. b. De directie zal een vinger aan de pols houden. The directorate will keep a finger on the wrist. c. Een prestatie zal het concept naar de mouwen leiden. An achievement will lead the concept to the sleeves. d. De jormen gaat zijn lumse met haar luisberk malpen. The jormen will malp his lumse with her luisberk. c. De gaat jongen zusje huiswerk zijn haar helpen met The will boy sister homework his her help with This study shows how a combination of different baseline conditions and advanced data analysis techniques allows us to track neural dynamics across conditions. At the phrase frequency, there were no differences in MI between sentences and anomalous prose, or sentences and idioms, but they found increased neural tracking in sentences compared to lists and Jabberwocky, as in Kaufeld et al. (2020). ERPs show differences between all of these conditions, but neural tracking reveals similarities across conditions containing structure and content words, pointing to a common mechanism for composition. Burroughs et al. (2021) adapt the paradigm used by Ding et al., in an experiment aimed at disentangling the effects of word category repetition from those of structure building. They found that the neural signal tracks syntactic structure, with increased tracking in the delta band for lists of phrases ("cold food loud room tall girl") vs. lists of words with repetitions of syntactic categories without structure ("rough give ill tell thin chew"). This effect is however modulated by syntactic category, with reduced tracking when the list of phrases did not contain repetition of syntactic categories ("that word send less too loud"). These results suggest that previous studies using the frequency tagging paradigm may have also included spurious effects of syntactic category repetition. Glushko et al. (2022) use EEG to disentangle the effects of syntax from those of prosody. They used sentences containing four words of the form NP-VP, with the NP consisting of 1 word (1+3 Syntax) or 2 words (2+2 Syntax) without prosody. These were then compared to trials containing the same syntactic structures but with a prosodic contour compatible with the 2+2 Syntax condition. Their results show an interaction between prosody and syntactic structure, suggesting that the generation of implicit prosody affects syntactic composition and that previously reported effects using the neural tracking paradigm can be partially explained by prosody effects. Kalenkovich et al. (2022) used Russian sentences containing the same number of words and lexical content and differing only by the use of a single suffix, which affords them a different syntactic structure. They created sentences with words grouped into 2 phrases (Genitive 2-2) and sentences containing a noun in the dative case with the same words grouped in a 1 word NP and a 3 word verb phrase (Dative 1-3). Interestingly, the spectral peaks between conditions at the 2-word frequency did not differ, suggesting that factors like repetition of lexical category might explain previous effects.
The frequency tagging paradigm has become popular since its introduction by Ding and colleagues. The conclusions originally drawn from those experiments have been recently challenged on empirical and theoretical grounds (Kazanina and Tavano, 2023), suggesting that the rhythmicity of stimulus presentation may introduce processes that stand in the way of observing neural correlates of structure building.

. . The cut-compose paradigm
The studies reviewed in Sections 1-2 investigate composition by comparing well-formed language to baselines that are assumed not to engage composition. It is unclear to what extent pseudowords and word lists prevent composition: compositionrelated signal can be lost if both conditions under comparison engage composition. A second challenge is that those baselines can differ from compositional expressions on several levels besides composition, leaving in mixed signals after subtraction or comparison. A third difficulty is that pseudoword sentences, word lists, and phrases are not as natural and informative as full sentences and can require additional pragmatic support, when they do not violate pragmatic constraints altogether.
We describe a novel paradigm for studying composition which tries to take into account the three limitations of previous paradigms: lack of minimality, lack of naturalness, and unsuccessful prevention of composition. The goal here is to learn from the successes and failures of previous studies and to explore possible new avenues in experimental design.
The Cut-Compose paradigm makes use of natural, wellformed, and complete sentences, varying the presence or absence of composition at specific points in the input string. The idea is to force or prevent composition in well-formed, meaningful sentences or pairs of sentences by exploiting syntactic boundaries: The same sequence of two words can occur as part of the same constituent, in (15a), the Compose condition, or as separated by a syntactic boundary, in (15b), the Cut condition, in this case also marked by punctuation. The first EEG study using this design, by Olstad et al. (2020), removed punctuation marks in order to match the precritical (e.g., "grey") and critical ("elephants") words. Additional safeguards had to be implemented to prevent accidental composition in the Cut condition. First, syntactically, the adjective "grey" has a predicative role, so it cannot modify "elephants". Second, semantically, "Some birds are completely grey elephants" would be anomalous. Third, the critical word initiates a new sentence, rather than a new phrase in the same sentence; this should block composition of larger constituents (e.g., phrases or clauses) higher up in the syntactic structure. One challenge is to match the precritical context in length, grammatical complexity (e.g., in syntactic nodes or arcs) and semantic associations: this is crucial for experiments using hemodynamic methods, while M/EEG studies should also attempt to control the factors that affect composition locally, around the boundary. The difference between Compose and Cut is meant to reveal that which differs between the two conditions, namely the composition of the adjective "grey" with the noun "elephants" in (15a) but not (15b). Similar to other paradigms, Cut-Compose also affords the possibility of investigating the compositional mechanisms involved in different semantic and syntactic contexts. Olstad et al. (2020) compared modification as in (15), with predication constructions as in (16), to assess whether these two different "modes of composition"-Predicate Modification vs. Functional Application, Adjoin vs. Merge-correspond to different neural events. As the study was conducted in Norwegian, the Cut sentence was created by fronting the object: (16) a. bråk er slitsomt men noen [hører musikk] blant alle lydene noise is tiring but some [hear music] among all the sounds b. bråk er innimellom noe man hører musikk er flott noise is sometimes something one hears music is nice In (16a), the proposition is incomplete without "musikk", as the verb "hører" requires two arguments to be saturated. This contrasts with Cut (16b), where the verb argument slots are all filled by "hører", leaving no room for "musikk" to compose with the verb. Different modes of composition can be directly compared in the same experiment, as the noun at which the M/EEG signal is measured can be held constant across environments. The sentences in (17) Olstad et al. (2020) found different ERP signals for the different modes of composition, providing support for the theoretical distinction between predication and modification, as well as preliminary evidence for the viability of the Cut-Compose paradigm.
Does composition not happen at all in the Cut condition? In both conditions, the critical noun is eventually composed into a higher-order representation: it is combined with the previous words in Compose, while it is yet to be combined with subsequent material in Cut. However, in the Cut condition, composition does not occur between the noun and its preceding context, and this the key difference with Compose. In contrast to artificial stimuli such as nonword or pseudoword strings, in Cut/Compose participants should be equally engaged in reading both types of sentences, implying a more equal distribution of cognitive resources (attention, memory etc.) across conditions. Additionally, both Cut and Compose are covered by theory: all formal linguistic theories on the market predict that composition is triggered in one case but not the other, at the point of measurement.
As other paradigms, Cut/Compose has limitations related to the baseline condition. One potential issue is the use of punctuation, which is necessary in order to make the stimuli as natural and as unambiguous as possible. Adding a period after the precritical word in Cut sentences creates a perceptual difference between the two conditions. An additional perceptual difference is capitalization of the first letter of the critical word in Cut. Olstad et al. (2020) avoided the use of punctuation and capitalization, relying on the structural properties of sentences to ensure that the noun is interpreted as starting a new sentence in the Cut condition. Follow-up experiments are needed to investigate the effects of both punctuation and capitalization in the visual modality, whether they affect the detection and quantification of composition signals, and the corresponding impact of appropriate prosody or intonation around the Cut boundary in the auditory modality.
Another possible issue is that critical nouns in the Cut condition introduce a new phrase and sentence, and may therefore engage different processes than nouns in the Compose condition which close a phrase or sentence. This issue may be partly addressed in future experiments where the syntactic cut is not a sentential boundary but a phrasal one. Note that inferences drawn regarding different modes of composition should still be valid, as opening a new sentence in the Cut condition should involve the same processes for both predication and modification contrasts. A different issue is that of discourse processing. The second sentence in the Cut condition is not disconnected from the first one. At the critical noun, the participant might try to integrate it into the discourse model instead of waiting to read the rest of the second sentence. However, integration with the preceding context also happens in Compose sentences, though the discourse representation in that case is not organized into multiple sentential or propositional units.
Similar to constituent chunking studies, like Pallier et al. (2011), Cut/Compose relies on manipulating the number of syntactic units between conditions, while it tries to control more precisely the immediate context of the critical word as well as aspects of the wider semantic context. Cut-Compose can be used with a variety of constructions, differing in semantic or syntactic properties, complexity and length. Many questions that have been of interest for other paradigms can also be tested with Cut/Compose: coercion, different classes of adjectives, adverbs, nouns and verbs, as well as the composition of functional and lexical elements. In the long run, we will be able to inch closer to the mechanisms by which the brain builds structure and meaning only by integrating results from different paradigms, different measures and data analysis methods. Cut/Compose aims to make a contribution to this longer-term project, and might also prompt the development of new and improved paradigms beyond the currently available ones.
. /flang. .    The paradigms included are those reviewed in Section 2. We report the results for the comparisons between well-formed meaningful sentences or phrases and the relevant baselines (specified in columns 1 or 2). We have reviewed studies using different paradigms that tried to isolate composition in brain signals. The limitations of the paradigms discussed here are not entirely unknown and have been occasionally pointed out before (e.g., see Humphries et al., 2006;Matchin et al., 2017Matchin et al., , 2019a. In this section, we reflect on what has been achieved so far in mapping semantic composition in brain space and time (for an earlier assessment, see Baggio, 2018). Table 1 summarizes our evaluation of the paradigms discussed above, and Table 2 is an overview of the main results of different studies. Our recommendation for the field includes developing new paradigms that overcome the limitations of current ones. A parallel strategy is to integrate results across studies and paradigms, in the hope that paradigms with complementary strengths and limitations would support each other and allow more reliable inferences from data. We briefly pursue this avenue here.
Despite their limitations, the words list and scrambled sentence paradigms allow lexical variables between stimuli to be matched. Although comparing sentences with scrambled versions may result in loss of signal (see above), scrambled sentences should still involve "less composition". Results from studies using this paradigm could help narrow down the search space of correlates of composition: regions engaged across studies using different baselines are candidate correlates of composition; regions that differ across studies may be related to processing of the particular stimuli used. The left posterior STS/STG, ATL, and AG consistently show up in normal vs. scrambled sentences contrasts. The left IFG is active in studies with difficult or engaging tasks, in studies using lists without function words, or words not in the original sentences. Further research is needed to understand how different baselines affect comparisons with normal sentences.
Jabberwocky sentences are a clever way of disentangling syntax from semantics, though formal aspects of meaning remain in stimuli with real function words and affixes. In this sense, like lists of words, Jabberwocky may involve semantic composition, but to a lower degree. Studies using this design often either reveal regions that overlap with those from studies using word lists or no effects in comparisons to sentences. Negative findings may suggest that lists are a better baseline than Jabberwocky, while overlapping results may indicate either that they are both equally effective or that both have issues with the same impact on brain signals. Minimal phrase designs using real word lists or pseudowords in baseline conditions have arguably made the most progress in narrowing down the space of correlates of composition. Zaccarella et al. (2017) and Matchin et al. (2017) implicate left IFG and pSTS in composition, while Murphy et al. (2022) localize effects of phrasal composition in pSTS around 200-300 ms from word onset. Inconsistencies remain across studies using minimal phrases as to the regions involved (left IFG, AG, vmPFC), with one frequently reported region being the left ATL. Yet, the LATL is mostly sensitive to conceptual composition.
Integrating results from the studies in Table 2 we thus find a network in the left perisylvian cortex, with possibly the most functionally critical node in the posterior superior temporal gyrus and sulcus. Section 3 considers alternative strategies, including testing theoretical distinctions Section 3.1, using advanced analysis methods Section 3.2, and developing new paradigms Section 3.3. We believe that initiating a discussion on the need to refine our paradigms is a crucial step forward, but a combination of approaches, as suggested in Section 3, as well as comparing results across methods (Table 2), is already leading to testable new hypotheses about the likely cortical seats and time course of syntax-driven meaning composition.
Our assessment of the different paradigms in Table 1 suggests that they are not all equal in their strengths and limitations. But the important lesson here is that while paradigms can be assessed on design grounds alone, they must also be evaluated empirically based on the plausibility and consistency of the results they generate: it is impossible to know exactly how the brain reacts to the different conditions a priori, and thus how severe the issues identified a priori may actually be. Comparing results across different paradigms can not only help us restrict the search of correlates of composition to fewer candidates: it can also provide indirect evidence of the actual impact of the limitations of particular paradigms. That said, this complex evaluative exercise remains fraught with difficulties, and is ultimately based on researcher choices, expertise, and judgement. For this reason, the way forward for the field should also involve the development of new paradigms and cannot be based entirely on comparison and integration of results across existing ones.

. Conclusion
This review has examined experimental paradigms and designs used to search for neural correlates of syntax-driven meaning composition. Our aim was to dissect each paradigm presenting the ways in which it has been implemented in specific studies, bringing forth its goals and assumptions, and uncovering its strengths and weaknesses. One conclusion concerns the lack of baseline or control conditions that can fully prevent composition at specific points in time. Without such conditions, interpreting comparisons with phrases or sentences remains difficult: any claim that a given signal is a correlate of composition is undermined, if the conditions compared do not only differ in whether composition is engaged or not. This may partly explain why M/EEG or fMRI studies have not revealed correlates of composition invariant across studies or paradigms (Table 2). But as noted, the challenge ultimately involves more than just experimental design: finding the neural mechanisms of composition will also require progress in integrative theory Baggio, 2020, 2021), recording resolution, and data acquisition and analysis.
Here, we have focused on a neglected, yet essential ingredient of research methodology: the internal validity of experimental paradigms and designs. Our critique is not meant to devalue the ingenuity of experimental designs used by researchers throughout the years: we have contributed to this research ourselves, and we have used several of the classical paradigms in our work. Some of the issues raised here were also noted by others, but we believe it is useful to assess different paradigms comparatively and systematically, using the same standards. In addition to examining the limitations of baseline conditions, we should reconsider the theoretical assumptions about composition that we build into our experimental designs. The brain may not always automatically compute meaning taking syntactic structure into account: if syntax is not always deployed during comprehension, or if lexical processing in well-formed and meaningful sentences always engages a set of independent operations in addition to syntax-driven composition, then any comparison of conditions, even assuming adequate baselines, will reveal either less or more in terms of neural signals than syntax-driven composition (Baggio, 2018(Baggio, , 2022. Consider the "standard view" of meaning composition from generative syntax and formal semantics. As a computational implementation of composition, that view may not quite provide what psycholinguists and neurolinguists need to derive specific predictions and explain existing experimental results. One reason is that there is still no real consensus on the atoms and structures of syntax in the first place, their relation to lexical encoding, and the semantic primitives of combination they correspond to. The logical calculus of formal semantics works equally well for very different choices of syntactic and semantic ontology: the existence of a syntax-semantics interface that respects function-argument composition does not, in and of itself, provide a unique answer to what those syntactic and semantic primitives are. Moreover, there are indications that the basic combinatoric building blocks assumed in formal semantic theory do not map in any systematic way to basic differences and measures at the neurolinguistic level (Pylkkänen and McElree, 2006).
Putting aside open questions of what the minimal parts and modes of combination are, one could disagree with the particulars of this narrow formulation, and specifically with the centrality of Compositionality (Baggio et al., 2012b;Baggio, 2018Baggio, , 2021. But the key insight here is that human languages have algorithms for building meanings predictably from their parts. Predictability and generativity of meaning should be taken seriously as computational constraints modulating language processing and its outputs, even though not all complex meanings may be equally subject to Compositionality. Developing better experimental paradigms should go hand in hand with theoretical and modeling efforts aimed at charting the different ways in which brains actually build meaning.