What Has Replication Ever Done for Us? Insights from Neuroimaging of Speech Perception

Replication of a previous scientific finding is necessary to verify its truth. Despite the importance of replication, incentive systems in science favor novel findings over reliable ones. Consequently, little effort is devoted to reproducing previous results compared to finding new discoveries. This is particularly true of brain imaging, in which the complexity of study design and analysis, and high costs and time intensive data collection, act as additional disincentives. Unfortunately, functional imaging studies often have small sample sizes (e.g., n < 20) resulting in low statistical power and inflated effect sizes, making them less likely to be successfully reproduced, has led to a crisis of confidence in neuroscientific findings. Recent work has begun to address issues around the reproducibility of brain imaging (see Barch and Yarkoni (2013) for an introduction to a special issue). Indeed, there have been some notable successes, for example, in identifying features of study design and analysis that influence reproducibility (Bennett and Miller, 2013; Turner and Miller, 2013), as well as in the development of tools to facilitate data sharing (Poldrack et al., 2013; Gorgolewski et al., 2016b), to evaluate data reliability (Shou et al., 2013) and to aid the reporting and reliability of data processing and analysis (Poldrack et al. However, despite these advances, relatively few functional imaging replication studies have been conducted to date. Recently in the speech perception domain, there have been some notable replication attempts, here I discuss what has been learnt from them about speech perception and the replication endeavor more generally. Defining replication is difficult as replications can take different forms. A broad distinction exists between direct replication, in which an identical procedure is repeated with the aim to recreate the previous experiment in its entirety, and conceptual replication, in which a previous result or hypothesis is tested with different methods (Schmidt, 2009). There have been a number of recent conceptual replication attempts in the field of speech perception research. As might be expected, the outcome of these studies has been mixed. For example, Arsenault and Buchsbaum (2016) failed to replicate evidence for somatotopic mapping of place of articulation distinctions in response to hearing spoken syllables, a finding originally demonstrated by Pulvermüller et al. (2006). This finding was controversial, with the original authors suggesting that differences in methodology explained the failure to replicate (Schomers and Pulvermüller, 2016). Whilst failures to replicate have become newsworthy, successful replications are sometimes perceived as …

Replication of a previous scientific finding is necessary to verify its truth. Despite the importance of replication, incentive systems in science favor novel findings over reliable ones. Consequently, little effort is devoted to reproducing previous results compared to finding new discoveries. This is particularly true of brain imaging, in which the complexity of study design and analysis, and high costs and time intensive data collection, act as additional disincentives. Unfortunately, functional imaging studies often have small sample sizes (e.g., n < 20) resulting in low statistical power and inflated effect sizes, making them less likely to be successfully reproduced (Carp, 2012;Button et al., 2013;Szucs and Ioannidis, 2016;Poldrack et al., 2017). This, in addition to discovered errors in analysis software (Eklund et al., 2016;Eickhoff et al., 2017) and wider concerns about the reliability of psychological research (Simmons et al., 2011;Open Science Collaboration, 2015), has led to a crisis of confidence in neuroscientific findings. Recent work has begun to address issues around the reproducibility of brain imaging (see Barch and Yarkoni (2013) for an introduction to a special issue). Indeed, there have been some notable successes, for example, in identifying features of study design and analysis that influence reproducibility (Bennett and Miller, 2013;Turner and Miller, 2013), as well as in the development of tools to facilitate data sharing (Poldrack et al., 2013;Gorgolewski et al., 2016b), to evaluate data reliability (Shou et al., 2013) and to aid the reporting and reliability of data processing and analysis (Poldrack et al., 2008;Carp, 2013;Pernet and Poline, 2015;Gorgolewski et al., 2016a). However, despite these advances, relatively few functional imaging replication studies have been conducted to date. Recently in the speech perception domain, there have been some notable replication attempts, here I discuss what has been learnt from them about speech perception and the replication endeavor more generally.
Defining replication is difficult as replications can take different forms. A broad distinction exists between direct replication, in which an identical procedure is repeated with the aim to recreate the previous experiment in its entirety, and conceptual replication, in which a previous result or hypothesis is tested with different methods (Schmidt, 2009). There have been a number of recent conceptual replication attempts in the field of speech perception research. As might be expected, the outcome of these studies has been mixed. For example, Arsenault and Buchsbaum (2016) failed to replicate evidence for somatotopic mapping of place of articulation distinctions in response to hearing spoken syllables, a finding originally demonstrated by Pulvermüller et al. (2006). This finding was controversial, with the original authors suggesting that differences in methodology explained the failure to replicate (Schomers and Pulvermüller, 2016). Whilst failures to replicate have become newsworthy, successful replications are sometimes perceived as less noteworthy, despite the fact that they often provide new knowledge, as well as confirming what was already known. Here, I describe in detail the outcome of successful replications of a paradigm investigating the neural basis of spoken sentence comprehension (Scott et al., 2000). This paradigm has been replicated several times, twice by researchers associated with the original study (Narain et al., 2003;Evans et al., 2014) and once by an independent group (Okada et al., 2010) (see Table 1 for a summary of the studies). Using these studies as an example, I demonstrate how advances in methodology in combination with replication have advanced our understanding of the neural systems supporting speech perception.
The original Scott et al. study is influential. To date it has received 921 Google scholar citations (Scholar.google.com., 2017) and has played an important role in shaping models of speech processing (Scott and Johnsrude, 2003;Scott and Wise, 2004;Rauschecker and Scott, 2009). Prior to this, researchers typically compared neural activity elicited by speech to activity evoked by simple sounds like tones or noise bursts. These sounds underestimated the complexity of the speech signal. This study was the first to use a more appropriate baseline: spectrally rotated speech. Spectral rotation involves flipping the frequencies of speech around an axis such that high frequencies become low, and vice versa. This renders speech unintelligible but maintains spectral and temporal structure. The original Positron Emission Tomography (PET) study employed an elegant factorial design in which participants listened to clear and noise-vocoded speech (an intelligible speech stimulus with reduced spectral detail), and their unintelligible rotated equivalents. This isolated neural responses associated with speech comprehension by contrasting the response to clear and noise-vocoded speech with the average of the unintelligible rotated equivalents and spectral detail by comparing the average of clear and rotated speech to their noise-vocoded equivalents. Activity was found in the left anterior superior temporal sulcus (STS) for speech comprehension and in the right superior temporal gyrus (STG) for spectral detail. Further, regions of the left posterior superior temporal cortex showed elevated activity to intelligible clear and noise-vocoded speech, and unintelligible rotated speech, in the context of reduced activity to rotated noise-vocoded speech. Given that clear, noise-vocoded and rotated speech contain acoustic-phonetic information, while rotated noise-vocoded does not, this provided evidence for a hierarchical processing pathway that transformed acoustic-phonetic information to meaningful speech along a posterior-anterior axis. This fit well with work in non-human primates suggesting multiple streams of processing in the brain, including a hierarchically organized, anteriorly directed sound-to-meaning pathway (Rauschecker, 1998;Kaas and Hackett, 1999;Rauschecker and Tian, 2000;Tian et al., 2001).
A later functional Magnetic Resonance Imaging (fMRI) replication found elevated activity in left anterior STS to intelligible speech, as well as in the posterior part of the sulcus (Narain et al., 2003). The authors applied the global null conjunction (Price and Friston, 1997) which identified conjoint effects for the two simple intelligibility contrasts: [clear speech-rotated speech] and [noise-vocoded-rotated noisevocoded speech], by testing for regions in which there was an averaged effect of intelligibility, in the absence of differences between these effects. This suggested a common mechanism for processing different forms of intelligible speech. However, the fixed effects analyses, used in this and the previous study, did not allow inferences to be extended to the wider population.
Another fMRI replication by Okada et al. (2010) conducted random effects analyses extending inferences beyond the tested participants. They found activity predominantly within lateral temporal cortex for the averaged response to intelligible speech, with bilateral activity found in the anterior and posterior superior temporal cortex. The authors also conducted multivariate pattern analyses (MVPA) (O'Toole et al., 2007;Mur et al., 2009;Pereira et al., 2009). This approach considers the pattern of activity over multiple voxels, allowing weakly discriminative information to be pooled over multiple data points, affording, in some instances, greater sensitivity (Haynes and Rees, 2006). Neural patterns were first normalized to remove the mean signal for each trial; ensuring that the MVPA analysis did not recapitulate the results of the univariate analysis. Using this approach, Okada et al. showed that intelligible speech could be discriminated from unintelligible sounds within regions of interest (ROIs) in early auditory cortex. This was unexpected within the context of hierarchical accounts of speech perception, in which early auditory regions engage in acoustic, rather than higher order language functions, and given that rotated speech was thought to be a close acoustic match to speech. A more expected finding was that bilateral anterior and posterior temporal ROIs successfully discriminated between intelligible and unintelligible speech. In an effort to identify regions that were sensitive to intelligibility in the absence of sensitivity to acoustics, Okada et al. expressed accuracies for intelligibility classifications relative to those for spectral detail, to create an "acoustic invariance" metric. This showed that the left posterior and right mid temporal cortex differed to primary auditory cortex on this metric, suggesting a more intelligibility selective response in these regions. Notably, however, the authors did not directly compare the strength of univariate responses between temporal lobe regions, nor did they examine multivariate responses beyond the superior temporal cortex. The final replication by Evans et al. (2014) also combined univariate and multivariate analyses. The univariate main effect of intelligibility was associated with bilateral activity within lateral temporal cortex, spreading along the STS from posterior to anterior in the left and from mid to anterior in the right hemisphere. Only the left anterior STS was significantly activated by both simple effects, this time testing for the conjunction null (Nichols et al., 2005) rather than the more liberal global null conjunction. Follow up tests indicated that the left anterior STS showed the strongest univariate intelligibility response. MVPA analyses were conducted using a searchlight technique (Kriegeskorte et al., 2006), in which classification was conducted iteratively on small patches across the entire brain. The authors elected not to use an acoustic invariance metric, as Okada and colleagues had done, because they noted that noise-vocoded speech differs from clear speech in both intelligibility and spectral detail, making the measure difficult to interpret. Using this approach, successful classifications of intelligible speech were found in a much wider fronto-temporoparietal network. Interestingly, when classification accuracies were compared within the same ROIs in which univariate activity had been compared, posterior rather than anterior STS regions showed the highest classification accuracies. This highlighted the possibility that there may be multiple ways in which intelligibility could be encoded and that this may differ in anterior versus posterior regions. Evans et al. (2014) also conducted a fully factorial univariate analysis, interrogating for the first time the interaction between intelligibility and spectral detail. This revealed that the right planum temporale responded more to rotated speech than to all other sounds. This was unexpected, given the assumption that the baseline would activate early auditory regions equivalently to speech. This result, alongside Okada et al.'s finding of sensitivity to intelligibility in and around Heschl's gyrus, emphasized the difficulty of finding an appropriate non-speech baseline.
So what have we gained from these studies? These investigations are successful replications; elevated univariate activity in response to intelligible speech was found in the left anterior temporal STS across all studies. In addition, these replications extended the initial findings by delineating a much broader fronto-temporo-parietal sentence processing network (Davis and Johnsrude, 2003;Rodd et al., 2005;Obleser et al., 2007;Friederici et al., 2010;Davis et al., 2011;Abrams et al., 2012;Adank, 2012), consistent with the notion of multiple, rather than a single, comprehension stream (Peelle et al., 2010). Indeed, converging evidence suggests that both anterior and posterior STS play an important role in resolving speech intelligibility and that the relative balance of importance depends on how it is measured. This might suggest that speech intelligibility is encoded at different spatial scales across the temporal cortices.
As well as revealing a broader intelligibility network, these replications raise important questions about non-speech baselines. Rotated speech has proven a useful tool to separate "low level" acoustic from "higher level" linguistic processes (Boebinger et al., 2015;Lima et al., 2015;McGettigan et al., 2015;Evans et al., 2016;Meekings et al., 2016). However the replications discussed here, unexpectedly, showed that primary auditory cortex could distinguish between rotated and clear speech, and that some neural regions responded selectively to rotation as compared to clear speech. Why might this occur? It may reflect differences in the acoustic profile of rotated speech. For example, spectral rotation of fricatives results in broadband high frequency energy that is pushed into low frequency regions, a feature not characteristic of speech. Equally, it may reflect the fact that early auditory areas are capable of higher order linguistic processing (Formisano et al., 2008;Kilian-Hutten et al., 2011) either by virtue of local responses or via co-activation with higher order language regions. Taking a broader perspective, these findings demonstrate the difficulty of synthesizing non-speech baselines with the same acoustic properties as speech. Indeed, philosophically, the search for the perfect baseline is doomed to failure as the best baseline is speech itself. This, in combination with recent behavioral studies suggesting intermediate representations between speech-specific and more general acoustic processes (Iverson et al., 2016) call into question the logic of speechnon-speech baseline subtraction. This is not to suggest that we abandon this approach altogether, but rather, highlights the need to integrate evidence across multiple baselines and methodological approaches. One such alternative is to exploit similarities and differences between different kinds of speech to separate linguistic from acoustic processes (Joanisse et al., 2007;Raizada and Poldrack, 2007;Correia et al., 2014;Evans and Davis, 2015).
What insights can we gain concerning replication from these neuroimaging studies? First, they highlight the difficulty of defining "successful" replication. Evidence in favor of replication in behavioral studies may be reduced to the presence or absence of an effect. This distinction is much more complex in neuroimaging as multiple hypotheses are tested at tens of thousands of measurement points. Indeed, how similar do two statistical brain maps have to be to constitute a successful replication? Further, the complex data collection and analysis pipelines involved in functional neuroimaging likely reduce the likelihood of successful replication. Indeed, given this, it is surprising how similar the results are across the studies described. Second, these studies highlight that successful replications can provide new knowledge and highlight the role that methodological advancements can play in that process. Indeed, much less would have been gained from replicating the original study as it had been first performed. In this instance, advances in analysis played a crucial role in providing new insights on brain function, and upon the experimental paradigm itself. In this respect, given the fast pace of methodological change, neuroimaging arguably has the most to gain from replication going forward.

AUTHOR CONTRIBUTIONS
The author confirms being the sole contributor of this work and approved it for publication.

ACKNOWLEDGMENTS
I would like to thank Carolyn McGettigan and Cesar Lima for providing comments on an earlier draft. Thank you to the reviewers for their contribution in improving this manuscript.