The Cologne Picture Naming Test for Language Mapping and Monitoring (CoNaT): An Open Set of 100 Black and White Object Drawings

Language assessment using a picture naming task crucially relies on the interpretation of the given verbal response by the rater. To avoid misinterpretations, a language-specific and linguistically controlled set of unambiguous, clearly identifiable and common object–word pairs is mandatory. We, here, set out to provide an open-source set of black and white object drawings, particularly suited for language mapping and monitoring, e.g., during awake brain tumour surgery or transcranial magnetic stimulation, in German language. A refined set of 100 black and white drawings was tested in two consecutive runs of randomised picture order and was analysed in respect of correct, prompt, and reliable object recognition and naming in a series of 132 healthy subjects between 18 and 84 years (median 25 years, 64% females) and a clinical pilot cohort of 10 brain tumour patients (median age 47 years, 80% males). The influence of important word- and subject-related factors on task performance and reliability was investigated. Overall, across both healthy subjects and patients, excellent correct object naming rates (97 vs. 96%) as well as high reliability coefficients (Goodman–Kruskal's gamma = 0.95 vs. 0.86) were found. However, the analysis of variance revealed a significant, overall negative effect of low word frequency (p < 0.05) and high age (p < 0.0001) on task performance whereas the effect of a low educational level was only evident for the subgroup of 72 or more years of age (p < 0.05). Moreover, a small learning effect was observed across the two runs of the test (p < 0.001). In summary, this study provides an overall robust and reliable picture naming tool, optimised for the clinical use to map and monitor language functions in patients. However, individual familiarisation before the clinical use remains advisable, especially for subjects that are comparatively prone to spontaneous picture naming errors such as older subjects of low educational level and patients with clinically apparent word finding difficulties.


INTRODUCTION
The correct identification and semantic retrieval of object names in a behavioural task is the basis of investigating conceptual knowledge of objects in the human brain (1). When using an overt object naming task, also expressive speech motor functions (i.e., articulation) are involved. This task, therefore, combines important language domains, which might have led to its wide use in the assessment and monitoring of language functions, e.g., for language mapping and monitoring in the context of awake neurosurgery (2). Controlling the correctness of the verbal answer is essential to assess either object identification, lexical/semantic retrieval, or word articulation. Different linguistic factors are known that affect the ease of the retrieval process and task performance in general. Three of these important factors are addressed in this work: First, the uniqueness of the object drawing to be named and the disambiguity of the corresponding word to be retrieved are crucial pre-requisites of reliable testing and calls for objects that can be easily depicted graphically as well as for the non-existence of alternative expressions (i.e., synonyms) to name the respective object [see (3) for review]. Second, word frequency, i.e., how often a certain word is typically used in a certain language, is described as an objective and highly relevant factor influencing lexical access in naming tasks [e.g., (4) for review, (5,6)], given the association of higher frequency words with a lower error rate as well as with faster retrieval process (6). A third relevant factor is the word length, here expressed by the number of syllables, since longer words are associated with a higher error rate (7). All factors vary, however, with respect to age or educational level as well as cultural background and language so that existing stimuli and procedures cannot be directly transferred from one language to another (8,9).
Although overt object naming tasks are widely used in both neurocognitive science and clinical practise, linguistically controlled and validated open-source assessment tools are scarce. As a result, to date, there is no consensus tool for intraoperative monitoring of language functions during awake surgery of cerebral lesions or related pre-surgical investigations, especially for the German language. Providing a linguistically controlled and validated stimulus set for use in German language might be of great value, e.g., to allow for data comparison in multicentre studies and to assure a state-of-the-art testing procedure, robust to possibly erroneous interpretations due to low reliability of the test protocol itself.
In the context of neurosurgery, the precise delineation of the boundaries of eloquent brain areas by intraoperative direct cortical stimulation (DCS) is extremely important not only to achieve maximum tumour control and improve survival but also to avoid permanent neurological deficits (10). For language, this is particularly relevant since the anatomical correlates of function underlie a much higher variability as compared to, e.g., primary motor functions, in both healthy (11) and, even more, in diseased brain (12)(13)(14)(15).
Since its introduction by Penfield and Roberts (16), visual object naming has become the most common task for intraoperative language mapping and monitoring (17). Apart from its inclusion in neuropsychological and language-related assessment batteries and its use for non-invasive functional imaging [e.g., magnetoencephalography, functional magnetic resonance imaging and positron emission tomography; (18)(19)(20)], the object naming task has also been used for neuronavigated, repetitive, task-locked transcranial magnetic stimulation (TMS). This technique simulates the intraoperative situation during awake surgery where task execution is temporarily hampered by local electrical stimulation (i.e., DCS) of a cortex site, also referred to as "virtual lesion" (21)(22)(23).
Like neurocognitive and language assessment for diagnostic purposes, the results of both TMS and DCS rely crucially on the ad hoc (intraoperative) or post-hoc (post-operative) interpretation of the given verbal response by the rater. Here, a language-specific and linguistically controlled set of unambiguous, clearly identifiable and common object-word pairs is particularly important.
Existing stimulus sets are of limited usability for Germanspeaking subjects due to language specificity of the normative data and/or the stimuli, mostly designed for English native speakers [e.g., (24)(25)(26)], and/or due to copyright protection [e.g., (27,28)]. We, therefore, set out to validate and provide an open-source set of black and white object drawings, specifically for German-speaking subjects, intended for both research and clinical use: The Cologne Picture Naming Test for Language Mapping and Monitoring (CoNaT). We expected high correct object naming rates and a strong correlation between the given answers and hypothesised that both word-related linguistic characteristics, i.e., higher number of syllables and lower word frequency, have a significant negative impact on object naming performance. Moreover, we expected better task performance from subjects of young age and high educational level. Apart from investigating the robustness of the task and the influence of these word-and subject-related factors on the object naming performance in a representative cohort of healthy adults of all age groups, we also assessed the suitability of the CoNaT as a reliable language monitoring instrument in a pilot cohort of brain tumour patients.

General Study Design
A set of 112 black and white drawings was tested in respect of correct object identification as well as correct, prompt and reliable object naming in a representative series of 132 healthy subjects and a clinical pilot cohort of 10 brain tumour patients.
For the development of the picture set, we generally included concrete monomorphematic simple nouns (no compound nouns) for which a clear and unambiguous pictorial illustration was feasible (29). In addition, two linguistic factors (i.e., word frequency, number of syllables) were considered to build four equally large subgroups of object-word pairs (see Stimuli Set section).
We set out to assess (i) the feasibility as expressed by the overall rate of correctly identified items and (ii) the test-retest reliability of the object naming performance, both of which are important to qualify the CoNaT e.g. for intraoperative monitoring, as well as (iii) the influence of stimulus-and subjectrelated characteristics on correct object recognition and naming reliability. Moreover, we investigated whether or not a correlation between object naming performance and the test result of a standard assessment of word finding difficulties (i.e., Bielefeld Screening for word finding difficulties for mild aphasia [BIWOS]; (29)) could be found in the pilot cohort of patients with utmost mild to moderate clinical signs of aphasia. Both groups, healthy subjects and brain tumour patients, performed the naming task twice, in two consecutive runs.
The study was carried out according to the declaration of Helsinki [(30), last revision 2013] and was approved by the local ethics committee.

Healthy Subjects
A total of 132 healthy subjects between 18 and 84 years of age were prospectively enrolled between 2016 and 2019. Subjects were characterised by age (group 1: 18-35 years; group 2: 36-53 years; group 3: 54-71 years; group 4: 72 years or older), gender, handedness, and general educational level (i.e., holding vs. missing university entrance diploma, generally corresponding to ≥/<12 years of general school education). Here, technical college entrance qualification was considered as equivalent to a university entrance diploma. Inclusion criteria were as follows: age of at least 18 years; German language skills on native speaker level; no intake of alcohol, drugs or psychoactive agents prior to the experiment with risk of reduced attention and/or alertness levels; and sufficient vision (i.e., ≥0.7 corrected visual acuity). Subjects with neurological or psychiatric diseases (including brain lesions and seizures) in medical history were excluded.

Patients
In addition, 10 adult patients with clinical signs of mild to moderate aphasia were included in this study in order to test the protocol under clinical conditions. All patients were newly diagnosed with a focal brain tumour of the left hemisphere.
The additional inclusion criteria were identical for both healthy subjects and patients. In contrast, specific exclusion criteria for patients were as follows: (i) neurological/psychiatric diseases unrelated to the brain tumour, (ii) clinical signs of moderate to severe cognitive dysfunction as indicated by a Mini Mental State Examination [MMSE; (31)] score of <20/30, and (iii) severe word finding difficulties according to a screening of object naming competence using 10 pictures (which were not included in the protocol). Here, correct naming of at least 7 out of the 10 objects was required to qualify for study inclusion.
The severity of word finding difficulties of all participating patients was characterised using the BIWOS assessment. Of note, the BIWOS was chosen since it tests for a comprehensive set of semantic and lexical language skills for diagnosing word finding difficulties by a series of well-standardised tasks (i.e., antonyms, rhymes [free, category specific], hyperonyms, verbal fluency [lexical, semantic], word composition, semantic feature analysis, naming by definition) but does not include visual object naming so that a low level of interference was expected. The BIWOS was analysed according to the standard procedure given in the manual, resulting in separate scores for lexical and semantic word finding skills as well as a total score and corresponding severity levels to describe the word finding difficulties.
Of note, all complementary examinations (i.e., MMSE, screening of object naming competence, BIWOS) were administered prior to the beginning of the object naming tests.

Stimuli Set
The entire picture set (N = 112) consisted of four different categories (A-D as defined by number of syllables and high vs. low word frequency) and included a total of 12 back-up illustrations to allow for a posteriori selection of the 100 best suited pictures ( Table 1). All object-word pairs were chosen based on the pilot data by a clinical neuroscientist together with an experienced linguist (i.e., authors CWL and KJ) and were controlled regarding the following criteria: (i) (gender neutral) word frequency [cf. (33,34)], (ii) number of syllables, and (iii) unambiguity of both the object illustration and the expected verbal response (i.e., good recognizability of the illustrated object, expected non-existence of synonyms for the object name in German language as well as the absence of semantically related attributes, which could lead to compound nouns and overspecified verbal responses such as "egg cup" instead of "egg").
Illustrations were black and white drawings (presented on a white screen), drawn by author CWL and were either (i) freely designed (n = 53) or inspired (ii) by the Snodgrass & Vanderwart picture set [n = 25; (24)] or (iii) by the pictures included in the commercial software Nexspeech (Nexstim Oy, Helsinki, Finland; n = 22). A total of n = 12 drawings (i.e., three drawings per class A-D) were omitted due to poor performance in respect of either correctness or unambiguity of the naming responses (mean correct naming rate: 87 ± 7%; mean Goodman and Kruskal's gamma [referred to as "GK-gamma" throughout the manuscript]: 0.94 ± 0.05) and were, thus, not considered for further statistical analysis (see Supplementary Table 1 for details). The remaining selection of n = 100 objects is provided in Table 2 (see Supplementary Material for stimuli, i.e., drawings). Example drawings are shown in Figure 1.

Test Protocol and Scoring
Pictures were presented in a pseudorandomised sequence on a white screen. The display time for each stimulus was 500 ms, interleaved by a time interval of 3 s for healthy subjects and 5 s for patients. No feedback was provided regarding the task performance (i.e., correctness of picture naming) during the experiment. Between the two consecutive sessions, a break of up to 10 min was allowed if required by the test subject, e.g., in case of tiring.
For each run, the verbal responses were audio-taped for additional post-hoc assessment of promptness, accuracy, and reliability of object recognition and naming ( Table 3) to account for both the uniqueness of the illustration and the unambiguity/simplicity of the semantic word retrieval and its articulation. Here, more specific object names compared to the expected verbal response like "sparrow" instead of "bird" as well as compound nouns instead of simple nouns such as "church bell" for "bell" were rated as over-specification and thus fell into the category of unexpected naming variants (i.e., category III, cf. Table 3). In contrast, generalisations like "animal" instead of "bird" were categorised as wrong naming response (i.e., category V; cf. Table 3). Response delays were assessed by acoustic evaluation, a common procedure in clinical practise (e.g., for presurgical and intraoperative language mapping using TMS/DCS), hereby considering the individual baseline response latency. For further analyses, correct responses were assigned to the types (A) "correct object naming, " including only correctly recognised and expectedly named objects (i.e., categories I-II), and (B) "correct object recognition, " including also correctly recognised but unexpectedly named objects (i.e., categories I-III; Table 3).

Statistics
Normality of data distributions was tested according to Shapiro-Wilk. The reliability of naming performance (categorical data; five levels; see above) between the first and the second run was assessed using GK-gamma for each stimulus item. An analysis of variance (ANOVA) was performed to test for the influence of stimulus-and subject-related factors on the results, i.e., on the average rate of correct object recognition as well as correct object naming (two levels: right vs. wrong; see above) in percent of total trials and the reliability of the naming responses (five levels i-v; Table 3) as expressed by GKgamma. GK-gamma is a symmetric measure of association, based on a sorted list of paired observations, which ranges from  Overt naming responses were categorised as follows: (i) prompt and correct, (ii) correct but delayed, (iii) unexpected naming variants like dialectal, cultural, or other previously unexpected synonyms (e.g., "Beelzebub" instead of "devil"), over-specification, diminutive, or plural, (iv) wrong but self-correction, and (v) wrong or non-response, with (i-ii) being considered as correct object recognition expressed by a correct and expected naming response (referred to as "correct naming response") and (i-iii) being considered as correct object recognition, including both expected and unexpected naming responses (referred to as "correct objection recognition" throughout the manuscript). −1.0 to +1.0, with +1.0 indicating perfect correlation. Please note that, for the ANOVA and for calculation of correlations, GK-gamma = 1 was assumed if GK-gamma could not be calculated due to perfect naming rates (i.e., 100% correct naming in both sessions). For ANOVA with GK-gamma as dependent (outcome) variable, outliers (i.e., >2 SD deviation from average) were omitted. Of note, this outlier removal had to be applied only for subjects/stimuli where the confidence interval was zero due to very low incidence of errors. In total, this procedure removed 10% (subjects)/13% (stimuli) of the total data. Levels of significance according to ANOVA are indicated without leading zeros (e.g., "p < 0.01") throughout the manuscript to allow for better distinction from results of group mean comparisons and correlations.
Post-hoc comparison of means between paired data (e.g., correct naming rates of session 1 vs. session 2) were calculated using paired t-tests or Wicoxon's signed rank test, depending on the normality of the data distribution (as assessed by the Shapiro-Wilk test). Accordingly, for comparison between independent groups, Wicoxon's rank test was applied in case of not normally distributed data.
Pearson's correlation was calculated to test for significant relationships between metric variables (i.e., behavioural scores).
In cases of comparisons between more than two groups (e.g., between different word groups: A-D), the levels of significance were adjusted using the false discovery rate (FDR) correction (35).

Healthy Subjects
Of the 132 subjects included in the study, 64% (n = 84) were female. With a median age of 35 years (range: 18-84 years), most healthy participants were of relatively young age (group 1 [

Patients
Ten patients (two females, median age 47 years, range 24-76 years) with normal to moderate word finding skills according to the BIWOS results were included in the clinical pilot part of the study. Most patients were right-handed (80%) and had a high educational level (78%; Table 4).

Healthy Subjects
Overall, mean correct object recognition and picture naming rates were in the range of 98 ± 4 and 97 ± 4% and were significantly higher in the second as compared to the first run (object recognition: 98.3 ± 3.6 vs. 97.9 ± 4.0%, p < 0.001; object naming: 97.7 ± 3.9 vs. 97.2 ± 4.3%, p < 0.0001; Table 5). Of note, the rate of delays decreased from the first to the second run (p = 0.001), whereas no significant differences between runs were observed for the other error categories ( Table 6). However, the overall reproducibility of object naming in-between both runs was excellent, as expressed by an overall Goodman and Kruskal's GK-gamma correlation coefficient of 0.95 ± 0.004 [confidence interval: 0.95; 0.96] ( Table 5). The two most common error categories were wrong item naming (43% of all errors) and delay (25%; Table 6).  (29); UED, university entrance diploma.

Influence of Word Characteristics on Object Naming Correctness and Reliability
A two-factorial ANOVA including the factors SYLLABLES (two levels: one, two) and FREQUENCY (two levels: high, low) revealed no influence of both factors on the GK-gamma coefficients as a measure of reproducibility or an interaction between them ( Table 7). In contrast, a significant main effect was found for the factor FREQUENCY on the correct object recognition rates (F 1,96 = 6.471; p < 0.05) as well as on the correct picture naming rates (F 1,96 = 4.166; p < 0.05) whereas there was no main or interaction effect on the correct object recognition or naming rates of the factor SYLLABLES ( Table 7). Accordingly, post-hoc tests revealed significantly higher correct object recognition rates for the high vs. low word frequency (98 ± 3 vs. 99 ± 2%; p < 0.05) and a concordant statistical trend regarding the correct object naming rates (98 ± 2 vs. 97 ± 3%, p = 0.06; Figure 2). Post-hoc comparisons revealed the lowest rates of delays for word class A (high WF, one syllable) as compared to all other classes (p < 0.0001, FDR-corrected, Table 6). In contrast, category III responses (e.g., dialect-related variants; see Table 3 and Supplementary Table 2) were more frequent when naming one-syllable words and were highest in word class C (C-B: p < 0.01; C-D: p < 0.001; A-D: p < 0.01, FDR-corrected; Table 6). Self-corrections were equally distributed across the stimulus classes. Of note, all unexpected correct naming alternatives (e.g., dialect variants) encountered in the study are provided in the supplement (Supplementary Table 2). According to our hypothesis, the rate of wrong object namings increased with the difficulty level and was particularly more frequent in the stimulus classes of low WF (A-B: p < 0.01; A-CD: p < 0.0001; B-C: p < 0.001; B-D: p < 0.05, FDR-corrected; Table 6).

Influence of Subject Characteristics on Object Naming Correctness
To analyze the influence of subject characteristics on correct object recognition and naming rates (sum of both runs), we performed a three-factorial ANOVA with the factors GENDER (two levels), EDUCATION (two levels) and AGE GROUP (four levels). We, here, found a significant main effect of the factors AGE GROUP and EDUCATION on both correct object recognition and naming rates as well as a significant interaction between those two factors ( Table 8). In contrast, the factor GENDER had no significant main effect on either object recognition or naming correctness and showed no interactions regarding the dependent variable object naming correctness. However, we observed an interaction with the factor AGE GROUP when analysing the effects on object recognition correctness ( Table 8).
Second-level one-factorial ANOVA confirmed a significant main effect of the factor AGE in both the subgroups of lower and high education levels on the correct object recognition rates (low: F 1,35 = 12.3, p < 0.01; high: F 1,90 = 12.2, p < 0.001) as well as on the correct object naming rates (low: F 1,35 = 12.3, p < 0.01; high: F 1,90 = 17.0, p < 0.0001), thus suggesting the strongest influence of age on object naming in highly educated subjects. Of interest, post-hoc tests revealed a significantly lower rate of correct recognition as well as object naming for elderly subjects (age group 4) compared to all other age groups (p < 0.01, FDR-corrected; Figure 3). In addition, subjects of slightly advanced age, i.e., between the age of 54 and 71 years showed similar object recognition performance (p > 0.1) but worse object naming rates compared to younger individuals (age group 3 vs. 1 [2]: p < 0.05 [p = 0.07], FDR-corrected; Figure 3). These findings go along with a larger variance and less skewed data distribution in the elderly-particularly when less educated-as compared to young age (Supplementary Figures 1, 2).
In contrast, analysed by age categories, a one-factorial ANOVA showed a significant main effect of the factor EDUCATION on the picture naming performance only for young subjects that represented the largest age group (object recognition: F 1,64 = 4.0, p < 0.05; object naming: F 1,64 = 5.0, p < 0.05). No noteworthy effect of this factor was found in the other groups, apart from the elderly group, which showed a statistical trend (object naming: F 1,11 = 3.4, p = 0.09). Post-hoc tests confirmed a statistical trend towards better picture recognition  Object names are given in English (for original German words, please see Table 2). Correct object naming rates are provided in brackets following the correct object recognition rates if differing from those. The percentage of delayed (however correct) object namings was maximum 5.3% and is indicated by colour-encoding for each run: white = no delay; light yellow = <1% delays; yellow = 1-3% delays; orange = >3% delays. Reliability measures (GK-gamma) are provided for each word and overall, including the confidence interval. Light grey: GK-gamma could not be calculated due to perfect object naming in at least one run. Dark grey: Confidence interval was zero due to very low number of errors in at least one run; thus, the respective GK-gamma values were not considered for further analysis. γ, GK-gamma; CI, confidence interval. Percentages of error rates are shown relative to the total amount or errors and relative to all stimuli (in brackets). For a more comprehensive description of the error types, please consider Table 3. For a descriptive overview of the delay rates by stimulus/word, cf. and naming for the subgroups of young and elderly subjects (p = 0.09, FDR-corrected; Figure 3). In summary, the effect of the subject's age-and particularly the affiliation to the age group of 72 or more years-seems to overweigh clearly the effect of the educational level on correct object identification and naming.

Influence of Subject Characteristics on Object Naming Reliability
In accordance with the factors on naming performance, we here analysed the influence of subject characteristics on the retest reliability of the object naming, i.e., on GK-gamma coefficients using a three-factorial ANOVA that included the factors GENDER (two levels), EDUCATION (two levels) and AGE GROUP (four levels).
In line with our results regarding object recognition and naming correctness, the factor AGE GROUP had a significant main effect on naming reliability (F 1,93 = 5.3, p < 0.05). However, no main effect was found for the factors EDUCATION and GENDER. Although no two-way interactions were observed, the ANOVA revealed a significant interaction between the factors AGE GROUP × EDUCATION × GENDER (F 1,93 = 6.0, p < 0.05).
Post-hoc tests showed that higher age was associated with worse test-retest reliability of the naming responses. Accordingly, lower GK-gamma coefficients were found in the age group of 72 years or older as compared to subjects younger than 54 years (i.e., groups 1 and 2; p < 0.05, FDR-corrected; Figure 4).

Patients
In the pilot cohort of patients, showing evidence for impaired lexicosemantic word finding skills according to the BIWOS score in at least half of the cases, results for correct object recognition and naming (  Table 3) was high, as expressed by a GK-gamma coefficient ranging from 0.74 (A) to 0.94 (D; Table 9). In line with our results from the healthy population indicating high age (above 72 years) as the major factor influencing task performance, we here observed the least correct object recognition and naming rates in the two older patients (patient 5: 91/88%; patient 7: 94/91%). In contrast to the healthy subjects, the better object recognition performance in run 2 could not be reproduced in the patients (run 1: 97 ± 2/97 ± 3% vs. run 2: 97 ± 4/95 ± 5%; p > 0.1). However, we found a significantly lower frequency of delayed responses in the second run (run 1: 5.4 ± 7.7 vs. run 2: 2.7 ± 5.7%; p < 0.001; Table 9). There was no significant correlation of correctness, delay or reliability of object identification or naming with the clinical aphasia score (BIWOS).

DISCUSSION
This work provides the first freely available data set of pictures, developed for experimental and clinical use (e.g., in the context of pre-surgical and intraoperative functional language mapping), specifically for German-speaking subjects. The CoNaT was especially designed for the context of language mapping using picture naming, where highly reliable naming performance is a pre-requisite of successful testing. The picture set, consisting   of 100 black and white drawings, stratified by word length (number of syllables) and word frequency, showed excellent correct object recognition and naming rates as well as high reliability coefficients across all item categories and subjects. However, a small learning effect was observed across the two runs of the test. Moreover, we found a significant negative effect of low word frequency and high age (older than 72 years) on the task performance.

Influence of Subject Characteristics
Amongst subject-related factors, age had the strongest effect on both the picture naming correctness and the test-retest  Correct object recognition and naming rates as well as GK-gamma coefficients are indicated by stimulus class, run and overall. Please note that correct object naming rates are indicated in brackets following the corresponding object recognition rates if different from those. The average percentage of delayed (however correct) object namings per word class was maximum 7.2% and is indicated by colour-encoding for each run: yellow = <3% delays; orange = 3-5% delays; light red = >5% delays.
reliability of the naming responses. Its negative effect on the task performance increased with age and was most evident in elderly subjects who are 72 years or older. The high effect size of the factor AGE was also reflected by its significant correlation with the object naming performance in the patient cohort, despite the small sample size of n = 10. This finding is widely in line with previous research that also found an effect of age on language skills in general and picture naming in particular (36)(37)(38). Furthermore, multiple subject-related factors including vision impairment, general cognitive decline, reduced attention span, slowed perceptual analysis (37,39,40), as well as linguistic factors such as weakening of semantic connections within the language system (36,41) have been discussed to affect language performance.
In line with previous publications of other groups (42, 43), a high general educational level (i.e., qualification for admission to university or equivalent) was associated with higher rates of correct picture recognition and naming in our data set. This effect was most prominent in the subgroups of elderly participants (i.e., 54 years or older) for which the factor education was more balanced as opposed to the mostly highly educated younger participants (cf. Limitations). The finding, however, that educational level did not correlate with the test-retest reliability of the responses, might reflect the robustness of the factorial influence on naming correctness, independent of supposable learning effects between both runs.
From the clinical point of view, the clearly impaired and less reliable task performance of elderly healthy subjects, especially when their level of education is low, points out that language mapping and monitoring results should be interpreted with particular caution to avoid false-positive results. In such cases, a more rigorous selection of the items to be included in the picture set prior to the clinical use might be advisable to reduce the risk of misinterpretations, e.g., by omitting items with generally suboptimal correct naming rates and delayed responses (cf. Table 5). Moreover, increasing the usual number of individual test runs may be helpful to make sure that potentially problematic items are excluded.
As opposed to age and educational level, we found no significant influence of the factor GENDER on picture naming correctness, indicating that the selected items can be considered gender neutral and appropriate for testing procedures with both male and female participants. This finding could explain the disagreement, e.g., with the previous work of (42) who reported a gender effect with mostly better performance of male subjects in a picture naming task, which they explained by specific components of their picture set [e.g., items like "tripod, " "compass, " and "dart"; cf. Table 3 in (42)]. In this regard, the result of "gender neutrality" met with our expectations, given that we excluded words with assumed gender effect, e.g., "screwdriver" from our picture set a priori in order to establish a robust, gender-independent picture set for clinical use.
The robustness of the picture set is also reflected by the overall excellent and highly reliable picture naming performance of the patient cohort, showing no significant difference in naming correctness rates compared to a matched group of healthy subjects. At least for the tested cohort of patients with utmost mild aphasic symptoms, we also found no significant correlation between picture naming correctness and lexicosemantic performance according to the formal testing using the BIWOS. This finding underlines the intention of the picture set, which was not designed to be used as a sensitive screening instrument for (even mild) aphasia but rather as reliable and robust monitoring tool, also suited for patients with mild aphasic symptoms.

Response Correctness
In this study, we investigated the influence of two important word characteristics, i.e., the word frequency and the number of syllables, on the correctness of picture recognition and verbal naming responses. Here, we used the factor lexical word frequency as the most common and standardised measure of frequency of (word) use in everyday life. We found a significantly better performance, i.e., higher correct object recognition and naming rates, when the subjects were asked to name high-frequency words. In addition, there were fewer delays when naming high-frequency words, at least in the subset of monosyllables. These results agree well with previous research that also showed an effect of word frequency on naming accuracy [e.g., (44)(45)(46)].
In contrast to the word frequency, there was no significant influence of the factor word length, expressed by the number of syllables (mono-vs. bisyllabic), on neither the correctness of picture recognition or naming nor the retest reliability of the naming responses between the two runs. This finding is in line with the results of Santiago et al. (47) who also did not find a significant influence of the number of syllables (also comparing mono-vs. bisyllabic words) on the occurrence of errors in a standard picture naming task.

Delay
We observed significantly less delays (as a measure of response latency) for class A words (i.e., monosyllabic, high WF) as compared to all other word categories.
This finding indicates an influence of word frequency on response latency only for monosyllabic words and, vice versa, an influence of the number of syllables only for high-frequency words, thereby reflecting the heterogeneous results of previous studies regarding the effect of word length and word frequency on response latency. In line with others (46,48), Alario et al. (49) identified word frequency but not the number of syllables as significant contributors for the prediction of response latency. Other research groups, in contrast, could not confirm an effect of word length on response latency (50)(51)(52).
The divergent study results could be explained by methodological differences across studies such as the distinct characteristics of the applied picture sets. For instance, we here used comparatively high median word frequencies and a small range of word lengths (number of syllables), due to the primary objective of our study to develop a robust language monitoring tool rather than a very sensitive screening instrument. Further possible influencing factors include (i) the different age ranges of the study participants (usually university students younger than 30 year-old compared to a wide age range of 18-89 years in our study), (ii) interactions with other item-or word-related characteristics [e.g., lexical/conceptual characteristics such as age of acquisition, animacy, relevance to everyday life, frequency of syllables or word form characteristics such as phonological or morphological complexity; cf. (49,53,54)], and (iii) priming processes (55,56) inherent to the respective picture sets, which were not controlled in this study (see also Limitations).
Taken together, due to the influence of word characteristics on both naming correctness and response latency, it might be advisable to start the clinical testing routine for patients with relatively advanced aphasic symptoms using the components of the stimulus classes A-D consecutively in alphabetic order. Items might even be omitted class-wise in severe cases.

Alternative Naming Variants and Clinical Implications
In addition to different response delay rates between monovs. bisyllabic high-frequency words, the word-class-wise analysis also showed a higher rate of unexpected, alternative responses like over-specifications, dialectal or cultural variants for monosyllabic words. In accordance with our hypothesis that rather short, monosyllabic words are generally more prone to over-specification (e.g., "water glass" for "glass"), this was the reason for two thirds of the unexpected alternative responses in word class A in our study.
In clinical practise, e.g., for monitoring during awake surgery using DCS or for preoperative language mapping using TMS, where robustness of the test is of particular importance to assure correct identification of transient language impairments, it might be advisable to reduce the pictures by avoiding items with relatively high alternative naming rates (cf . Supplementary Table 2). However, given the overall excellent reliability of the naming responses as expressed by high GK-gamma coefficients (cf. Table 5), alternative naming responses should usually be identifiable in the preparatory test run, i.e., the baseline investigation, which allows to tailor the picture set on an individual basis (cf. Influence of Subject Characteristics section). In general, a baseline investigation of naming performance is highly recommended, especially regarding the clinical application in patients using TMS and/or DCS for language mapping, in order to identify speech language difficulties such as increased response latencies (delayed naming) related to distinct stimuli/words.

Learning Effect
Although the overall reproducibility of the object naming inbetween both runs was excellent (GK-gamma = 0.95), the mean correct object recognition and naming rates improved slightly from the first to the second run in the healthy volunteers. In line with this finding, we found a concordant decrease in the rate of delayed namings. These findings might result from a repetition priming effect, which is considered an implicit learning phenomenon of non-hippocampal origin described for repeated picture naming, correlating to reduced neural activity in repeated conditions [e.g., (57)], which lasts for at least several weeks [cf. (58) for review]. In this regard, the observation of a learning effect further supports the evidence that high word frequency (as a measure of repetition) correlates with better picture naming performance.
In contrast to healthy subjects, the second run was not associated with a higher overall rate of correct object naming or recognition in patients, which could be attributed to the much smaller sample size as well as to the comparatively stronger effects of reduced attention, lower cognitive resilience or exertion fatigue in this cohort [cf. (59)(60)(61)]. However, repetition had a significant and-compared to the healthy subjects-relatively strong facilitating effect on the rate of delayed namings in this cohort. This finding indicates that naming delays are particularly prone to repetition priming effects in patients. Accordingly, our data support the assumption that the risk of spontaneous naming errors, unrelated to TMS/DCS stimulation, decreases with the number of repetitions. On the other hand, it seems likely that the susceptibility to TMS interference expressed by naming errors in general and by prolonged naming latencies in particular decreases along with the repetition of stimuli during a TMS/DCS mapping. Therefore, it seems mandatory to define an optimal trade-off regarding the size of the stimuli/word set to be used during language mapping, as well as to take the number of stimuli/word repetitions into account when analysing the mapping results. A more detailed investigation of this topic, however, lies beyond the scope of this study and deserves to be further addressed in the future.

Limitations
As the intended use of the picture set is to serve, i.a., for clinical mapping and monitoring of patients with brain tumours, which mostly occur in advanced age, our study cohort comprises a broad age range-in contrast to the vast majority of previous, similar studies. However, due to several constraints regarding the recruitment of older subjects (i.e., reduced access to the population via existing databases and media, morbidity/reduced mobility impeding on-site participation, non-matching of inand exclusion criteria), the cohort of older subjects remains underrepresented in our study collective. Moreover, the factorial analysis regarding the influence of age and educational level suffers from an unavoidable interaction between both factors, which we attribute mostly to a considerably increased access to high education over the past decades.
Although we analysed two major word-related factors on picture naming performance and response delay, i.e., word frequency and the number of syllables, other possible factors such as alternative measures of word familiarity [e.g., frequency of syllables and age of acquisition; (53)] and word length as well as picture-related factors like the visual complexity of the drawing, image agreement and imageability [e.g., (49)] were not controlled in this study.
The CoNaT has been specifically designed for German native speakers although the stimuli might be well-suited to be used also in other languages. Please note that the suitability of individual items should be checked prior to the test administration to ensure their fit with respect to relevant linguistic criteria.

CONCLUSION
In summary, the CoNaT provides an overall robust and reliable picture naming tool, optimised for the clinical use to map and monitor language functions in patients. We here provide normative data along with practical, clinical suggestions for the administration of the picture set, hereby taking important wordand subject-related factors of object recognition and naming into account. Based on the results, we are convinced that the entire picture set can be readily used in healthy subjects and patients, even with mild to moderate aphasic symptoms but should always be tested and-if necessary-reduced on an individual basis, particularly in elderly subjects of low educational level and patients. Here, starting to test with the most robust stimulus class A (high WF, monosyllables) over B (high WF, bisyllables) to C and D (low WF) and paying particular attention to items that are comparatively prone to alternative naming variants seem to be advisable.

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by local ethics committee (Ethikkommission der Medizinischen Fakultät der Universität zu Köln). Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
CWL: conceptualisation, methodology, formal analysis, investigation, writing -original draught, visualisation, and supervision. JP: formal analysis, investigation, and writing -original draught. SK: investigation and data curation. CN: investigation and writing -review and editing. CG: conceptualisation and writing -review and editing. RG: funding acquisition and writing -review and editing. KJ: conceptualisation, methodology, writing -original draught, and supervision. All authors contributed to the article and approved the submitted version.