Preschoolers’ Induction of the Concept of Material Kind to Make Predictions: The Effects of Comparison and Linguistic Labels

Analogical reasoning by comparison is considered a special case of inductive reasoning, which is fundamental to the scientific method. By reasoning analogically, learners can abstract the underlying commonalities of several entities, thereby ignoring single objects’ superficial features. We tested whether different task environments designed to trigger analogical reasoning by comparison would support preschoolers’ induction of the concept of material kind to predict and explain objects’ floating or sinking as a central aspect of scientific reasoning. Specifically, in two experiments, we investigated whether the number of presented objects (one versus two standards), consisting of a specific material and the labeling of objects with the respective material name, would benefit preschoolers’ material-based inferences. For each item set used in both experiments, we asked the children (N = 59 in Experiment 1, N = 99 in Experiment 2) to predict an object’s floating or sinking by matching it to the standards and to verbally explain their selections. As expected, we found a significant effect for the number of standards in both experiments on the prediction task, suggesting that children successfully induced the relevance of material kind by comparison. However, labels did not increase the effect of the standards. In Experiment 2, we found that the children could transfer their conceptual knowledge on material kind but that transfer performance did not differ among the task environments. Our findings suggest that tasks inviting analogical reasoning by comparison with two standards are useful for promoting young children’s scientific reasoning.

Analogical reasoning by comparison is considered a special case of inductive reasoning, which is fundamental to the scientific method. By reasoning analogically, learners can abstract the underlying commonalities of several entities, thereby ignoring single objects' superficial features. We tested whether different task environments designed to trigger analogical reasoning by comparison would support preschoolers' induction of the concept of material kind to predict and explain objects' floating or sinking as a central aspect of scientific reasoning. Specifically, in two experiments, we investigated whether the number of presented objects (one versus two standards), consisting of a specific material and the labeling of objects with the respective material name, would benefit preschoolers' material-based inferences. For each item set used in both experiments, we asked the children (N = 59 in Experiment 1, N = 99 in Experiment 2) to predict an object's floating or sinking by matching it to the standards and to verbally explain their selections. As expected, we found a significant effect for the number of standards in both experiments on the prediction task, suggesting that children successfully induced the relevance of material kind by comparison. However, labels did not increase the effect of the standards. In Experiment 2, we found that the children could transfer their conceptual knowledge on material kind but that transfer performance did not differ among the task environments. Our findings suggest that tasks inviting analogical reasoning by comparison with two standards are useful for promoting young children's scientific reasoning.

INTRODUCTION
Analogical reasoning by comparison is assumed to be a crucial mechanism, enabling induction and conceptual learning across different age groups and in a wide range of tasks (Loewenstein and Gentner, 2001;Alfieri et al., 2013;Schalk et al., 2016). According to the theory of structural alignment (Gentner, 2010), analogical reasoning involves individuals' identification, mapping, and evaluation of the similarities and differences of several entities. This process has benefits for building conceptual knowledge because it supports individuals' encoding of information, induction and abstraction of categories, and generalization (transfer) of knowledge (Gentner and Smith, 2012;Gentner and Hoyos, 2017). While analogical reasoning by comparison has been investigated in experimental settings on a range of conceptual learning tasks (Alfieri et al., 2013), its contribution to scientific reasoning is rarely considered. We propose that task environments that trigger comparison are relevant to scientific reasoning. Specifically, such environments may facilitate children's induction of scientific concepts as a basis for predictions and explanations by their encoding of relevant object features.
In the present research, we investigated different task environments that may facilitate preschoolers' encoding of relevant object features to generate predictions and explanations in the science context of "floating and sinking." Research with preschoolers across various science contexts has revealed that preschoolers typically hold a variety of naïve conceptions based on irrelevant and perceptually salient features and that these conceptions may affect hypothesis generation (Carey, 2000). For example, when preschoolers are asked to predict the floating or sinking of solid objects and to explain their predictions, they provide explanations such as "light things will float, " "things with holes will sink, " "large things will sink, " or "things with air in them will float" (Penner and Klahr, 1996;Leuchter et al., 2014). Thus, preschoolers' predictions about whether objects float or sink are typically based on salient features, such as weight, size, and form, rather than on generic and more abstract aspects such as material kind or density. Children's naïve conceptions prevail despite the fact that they are typically able to name the material of solid objects, such as wood, plastic, or metal (Smith et al., 1985;Dickinson, 1987;Leuchter et al., 2014). Typically, a process of conceptual restructuring is required for children to overcome their naïve conceptions and transform them into scientifically advanced conceptions (Schneider et al., 2012;Leuchter et al., 2014). In two experiments, we investigated whether different task environments intended to trigger analogical reasoning by comparison (i.e., presenting two objects at the same time and labeling them) would improve preschoolers' predictions and explanations of floating or sinking based on the concept of material kind.

Scientific Reasoning in Preschoolers
The goals of science education encompass mastery of scientific concepts as well as learning how to engage in scientific reasoning (Driver et al., 2009;Kuhn, 2010;Dunbar and Klahr, 2012;Sandoval et al., 2014). In general terms, scientific reasoning involves individuals' knowledge-seeking by the application of scientific methods (Klahr and Dunbar, 1988;Klahr, 2000;Piekny and Maehler, 2013). Models of scientific reasoning typically refer to processes of inductive reasoning to explain individuals' knowledge construction, as well as their hypothesis and inference generation (Zimmerman, 2007;Morris et al., 2012). Inductive reasoning is regarded as a cognitive process that captures how individuals encode information, mentally represent this information, organize information into patterns, and derive inferences (cf. Chinn and Brewer, 2001;Chinn and Malhotra, 2002;Dunbar and Klahr, 2012). With respect to hypothesis generation, Klahr and Dunbar (1988) differentiate between "evoking" and "inducing" a hypothesis. When evoking a hypothesis, individuals retrieve and rely on prior knowledge. When inducing a hypothesis, individuals need to observe and encode data, and to identify patterns, before venturing an initial hypothesis. Therefore, the generation of a hypothesis requires children to encode relevant observations, to identify underlying patterns, and to draw inferences, which may rely in part on prior conceptual knowledge.
Several reviews indicate that even preschoolers can exhibit basic aspects of scientific reasoning (Chinn and Malhotra, 2002;Zimmerman, 2007;van der Graaf et al., 2016). There is substantial evidence indicating that preschoolers can appropriately generate hypotheses, identify common patterns in data, and evaluate presented evidence in specific task contexts. With regard to hypothesis generation, Piekny and Maehler (2013) found that 4-to 6-year-olds were able to construct a hypothesis based on patterns of evidence. If task contexts are kept simple, 5-and 6-year-olds are even able to identify patterns that are more complex and to form various hypotheses (e.g., Sodian et al., 1991;Ruffman et al., 1993). Moreover, an intervention study by Schulz et al. (2007) revealed that 4-to 6-year-olds were not only able to infer the causal structure of events by using experimentally collected patterns of evidence but also to predict the outcome of a novel experiment. Piekny et al. (2014) found that 4-to 6-year-olds were able to evaluate conclusive and partially conclusive evidence correctly, and Koerber et al. (2005) found that the correspondence between preschoolers' conceptions and presented evidence facilitated evaluation, whereas conflicting conceptions impeded evaluation. Even the 4-year-olds in the study were able to understand that data with perfect covariation could corroborate or disconfirm a causal hypothesis (see also Tullos and Woolley, 2009). Finally, a study by Köksal-Tuncer and Sodian (2018) revealed that 3-to 6-year-olds were not only able to generate hypotheses but also to apply hypothesis-testing strategies when presented with counterevidence.
In this prior research, selection tasks and production tasks were employed to assess young children's generation of hypotheses in scientific reasoning contexts (Koerber et al., 2005;Piekny and Maehler, 2013;Gropen et al., 2017). Selection tasks provide children with different answer options to a given problem and may be employed to assess children's spontaneous reasoning with regard to a given scientific reasoning context. By contrast, production tasks require children to come up with solutions themselves on the basis of explicit reasoning. Production tasks have therefore been employed to assess children's deliberate reasoning based on the production, explanation, and evaluation of arguments (Mercier and Sperber, 2009;Mercier, 2011). For our two experiments, we used a selection task in which we asked children to predict which object would float or sink like one or two other objects (prediction task). We also employed a production task in which we asked children to explain their respective prediction in order to assess their deliberate reasoning (explanation task). Overall, preschoolers' success in generating hypotheses is typically assessed in terms of the adequacy of their predictions and explanations based on given data patterns. However, research has rarely focused on the specific task conditions supporting the formation of predictions and explanations. We suggest that research on analogical reasoning by comparison (Gentner and Smith, 2012;Alfieri et al., 2013) may provide insights and that task contexts triggering analogical reasoning may benefit the performance of young children in a scientific reasoning context.

Analogical Reasoning by Comparison
Analogical reasoning can be understood as a special case of inductive reasoning (Holland et al., 1989;Holyoak, 2005) because it refers to individuals' ability to integrally encode commonalities and differences across a variety of entities and situations, represent and re-represent this information, and draw inferences (Gentner, 2016). Analogical reasoning has been demonstrated to be fundamental to cognitive development and conceptual learning, such as spatial orientation, word learning, learning of principles, and social comparison (for summaries, see Holyoak, 2005;Gentner, 2010;Gentner and Hoyos, 2017). One strategy for promoting analogical reasoning is to invite comparisons. According to the theory of structural alignment, comparison involves the retrieval of relevant information from the long-term memory, the mapping of commonalities and differences of two (or more) presented entities or cases (thereby inducing an abstracted schema), and the projection of inferences based on this mapping (Gentner and Smith, 2012). The abstracted schema will be more general than the analogs because inferences are formed on the basis of aligned similarities and differences, leading to a merged representation-a process of "relational pattern completion" (Gentner and Smith, 2012).
Schema abstraction can be supported by simultaneously presenting two entities or examples (Gentner and Medina, 1998;Gentner and Namy, 1999;Gentner et al., 2007;Kurtz and Loewenstein, 2007;Gentner, 2010;Alfieri et al., 2013;Christie, 2020). Specifically, Gentner and Namy (1999) argue that the presentation of two standards may "promote the discovery of relatively abstract relational commonalities that could characterize the category being learned" (p. 506). Christie (2020) provides empirical support to this claim. She reports on a series of studies in which young children successfully learned categories when presented with multiple exemplars. In addition, Christie and Gentner (2010) showed that 4-year-olds were able to recognize similarity in relations (above, under) in pictures of animals in different positions by making correct choices in a selection task. Gentner and Rattermann (1991) showed, in a more complex task, that 3-year-olds were unable to carry out relational matches without additional support, evidenced by choosing an object match instead of a relational match, but that 5-yearolds were successful in this task. Despite positive evidence from prompting comparisons even in young children, presenting two cases, examples, entities, or situations simultaneously may not always be sufficient for learners to recognize similarities (Kurtz et al., 2001). In particular, it has been shown that young children benefit from additional prompts, such as the use of common labels, when presented with two objects to be compared (Gentner, 2010;Christie, 2020).

Using Language to Promote Comparison
Labels and other types of verbal scaffold often facilitate analogical reasoning by comparison (Alfieri et al., 2013). Namy (1999, 2006), Namy and Gentner (2002), Gentner (2010Gentner ( , 2016, and Hespos et al. (2020) argue that language, in general, plays a decisive role in triggering analogical reasoning by comparison even in young children. Language and structure-mapping are suggested to bootstrap each other, mutually influencing cognitive and conceptual development in young and older children as well as adults (Gentner, 2010;Christie, 2020). As Gentner and Rattermann (1991, p. 260) put it, "a word can function as a promissory note, signaling subtle commonalities that the child does not yet perceive." Gentner (2010) proposes four ways in which language and structure-mapping interact: (1) Common labels invite comparison and abstraction by highlighting similarity across entities; (2) the naming of entities promotes reification since it preserves abstraction linguistically; (3) the naming of entities promotes uniform relational encoding; and (4) the use of linguistic structures invites the construction of conceptual structures. In the present research, we focus on the function of labels to invite and trigger comparison.
In a series of experiments, Gentner et al. (2007) investigated the use of common labels for triggering analogical reasoning by comparison. For example, Namy and Gentner (2002) tested two groups of 4-year-olds using a forced-choice match-to-sample task in which the children had to extend the label of one object (the so-called standard) to one of two other objects. In the no-comparison group, the experimenter labeled a single standard (e.g., a picture of an apple) with a made-up name (e.g., blicket). The children had to decide which one of the two other objects, the taxonomic item (banana) or the perceptually similar item (balloon), would have the same name as the standard. In the comparison group, the task was the same except that the experimenter showed and labeled pictures of two taxonomically related standards (e.g., an apple and a pear) with the same madeup name. Namy and Gentner found that preschoolers' correct taxonomic choices for this task increased when two standards were presented rather than just one standard (see also Gentner and Namy, 1999). Importantly, children were even more likely to make taxonomic choices when the two standards were labeled with the same noun. By contrast, when the two standards received different labels, children did not engage in comparison, as indicated by their increased selection of the perceptually similar but taxonomically unrelated item.
This effect of labeling not only holds for children's category learning but also for their generalization of properties across entities. That is, the presence of a common label enhances young children's willingness to make inductive inferences between entities (e.g., Gelman and Markman, 1986;Davidson and Gelman, 1990;Saalbach et al., 2012). For example, Gelman and Markman (1986) found that preschoolers can generalize properties across members of the same category when category membership is labeled with the same noun, but not when it is unlabeled. Thus, linguistic labels can serve as simple scaffolds to trigger young children's comparison processes since they suggest similarity between two items. Importantly, these beneficial effects of labels have typically been found in tasks that require only a low degree of prior knowledge in young children (Christie, 2020). In our study, we investigated the role of labels in a domain in which learners start with some degree of prior conceptual knowledge. Specifically, we propose that labels of material kind (e.g., "this is made of wood") will function as a cue at a superordinate level. Material labels provide rich associations with unique properties of objects, such as their texture and specific weight, as well as their empirically observable behavior of floating or sinking in water. From early on, children learn that common labels are used for things that are alike. If labels are used in an instructional context, they can elicit comparison of the respective objects and thereby highlight the relevant (underlying) commonality of different yet related entities even in the absence of perceptual similarity (Gentner, 2010). In several studies it has been shown that nouns provide information about classes of objects better than verbs or adjectives do (Arunachalam and Waxman, 2010;Graham et al., 2012). In addition, Johanson and Papafragou (2016) found that labeling using nouns works with similar success as labeling using facts (e.g., descriptions of properties) in ambiguous situations. In our two experiments, we systematically varied the use of superordinate-level nouns when labeling objects to promote comparison.

The Present Research
We tested whether different task environments designed to trigger analogical reasoning by comparison with or without labeling the respective objects would support preschoolers' induction of the concept of material kind to predict and explain objects' floating or sinking. Specifically, 4-to 7-year-old preschoolers were exposed to sets of material within a forcedchoice match-to-sample task with variations in the number of standard objects and the use of superordinate labels. On the basis of Gentner (2010) and the results from Namy and Gentner (2002), we expected that triggering comparison via the use of linguistic labels would amplify the effect of presenting two objects simultaneously. In order to probe the effect of labeling on performance, we varied the extent to which the labels were employed across our two experiments.

EXPERIMENT 1
In Experiment 1, we crossed two factors: the number of standard objects and the use of material labels. Specifically, we presented either one standard or two standards to test the effect of comparison on children's predictions and explanations, and we either labeled the presented objects or did not label them at all in order to test the additional effect of labeling on the potential benefits of comparison. The labels referred to material kind and therefore served as superordinate category labels indicating common properties of the presented objects. Consequently, it was possible that they might amplify children's perception of the similarities between the presented objects in conditions with two standards. Before the children were randomly assigned to the four conditions of Experiment 1, their prior knowledge was assessed with a pretest and a baseline assessment.
Our research questions and hypotheses were as follows: (1) Will preschoolers induce the concept of material kind when analogical reasoning by comparison is triggered by presenting two standards with the same floating behavior?
Hypothesis (1): Preschoolers in conditions with two standards will outperform preschoolers in conditions with one standard on the Prediction and Explanation Tasks.
(2) Does the use of material labels facilitate preschoolers' induction of the concept of material kind?
Hypothesis (2): The use of material labels will improve performance in the Prediction and Explanation Tasks in the condition with two standards, as indicated by an interaction effect of Labeling and Number of Standards.

Method
Participants Fifty-nine preschoolers from a major German city who had German as a first language and a mean age of 5 years, 3 months (min. = 4 years, 11 months; max. = 6 years) participated in this study (testing was in German). They were recruited through preschools. Parental consent on participation was collected for all children. The children came from middle-class families living in urban and suburban areas.

Design
In a 2 × 2 between-groups design, we tested the importance of triggering a comparison (one standard or two standards) and the benefit of labeling standards with the respective material label (unlabeled versus labeled). Specifically, the four conditions were (1) One Unlabeled Standard (one_unlabeled); (2) One Labeled Standard (one_labeled); (3) Two Unlabeled Standards (two_unlabeled); (4) Two Labeled Standards (two_labeled). In the unlabeled conditions, the standards were referred to as "this"/"these"; in the labeled conditions, the standards' material was named (e.g., "this/these is/are made of wood"). In all conditions, children had to predict (Prediction Task) and explain (Explanation Task) which of the four selection objects would float or sink like the standards. Table 1 gives an overview of the experimental conditions and the respective instructions. Before the children were randomized to these four conditions, all children participated in a pretest and a baseline assessment. In the pretest, children had to match objects made of the same material (Matching Task) and to name the objects' material (Labeling Task). Subsequently, in the baseline assessment, children were presented with one standard that was not labeled and four selection objects. As in the four conditions described above, children had to predict (Prediction Task) and explain (Explanation Task) which of the four selection objects would float or sink like the standard.

Tasks and Procedure
All participants were tested individually in a quiet room in their preschools by an experimenter blind to the hypotheses.

One Standard Two Standards
Unlabeled Look, this one floats/sinks in water. Which of these also floats/sinks just like this one?
Look, these two float/sink in water. Which of these also floats/sinks just like these ones?
Labeled (example: wood) Look, this is made out of wood and it floats in water. Which of these also floats just like this one: this one made of wood, this one made of metal, this one made of metal, or this one made of glass?
Look, this one is made out of wood and it floats in water. And this one is also made out of wood and it floats, too. Which of these also floats just like these ones: this one made of wood, this one made of clay, this one made of clay, or this one made of metal?
Testing began with the pretest and the baseline assessment, comprising the Prediction and Explanation Tasks; afterward, the children took the Prediction and Explanation Tasks with different materials, depending on the conditions.

Pretest: Matching/labeling task
Two tasks were employed to assess the children's prior knowledge of materials and their labels. The children were presented with a total of 16 objects made of eight different materials (wood, stone, metal, plastic, Styrofoam, wax, glass, clay), each with two different shapes (e.g., a wooden block and a wooden spoon). In the Matching Task, the children were asked to match pairs of objects: "Find the two things that belong together and put them together on the table." After a child had matched all the objects, the experimenter rearranged them in pairs by material regardless of how the children had arranged them in the Matching Task. In the Labeling Task, the children were asked to name the material: "Tell us what the objects are made of." The experimenter did not use the term "material." The Matching Task was scored with respect to the successful matching of objects according to their material, with one point assigned for each correct material-based match of two objects (range of scores 0-8). The Labeling Task was scored with one point for each correct material label (range of scores 0-8).

Baseline assessment: Prediction and explanation task
The baseline assessment served to measure how children would predict and explain objects' floating or sinking if they were only provided with a single standard that was not labeled, i.e., a task environment without elements to support comparison. Specifically, we employed six object sets in the baseline assessment. All objects in these sets were different from the objects used in the pretest. In each set, one object of a specific material served as the standard; four objects served as the selection items. In each object set, only one of the selection objects was made of the same material as the standard, but it always had a different shape and size. The other three selection objects were distractors that were selected on the basis of children's typical misconceptions (Hardy et al., 2006). Of the three distractors, there was one with the same shape as the standard. The other two distractors had a salient size or weight. That is, if the standards floated, we used extremely light and/or small selection objects (e.g., a small needle), whereas if the standards sank, we used extremely heavy and/or large selection objects (e.g., a large piece of wax). Only the selection object made of the same material as the standards sank/floated like the standards. Half of the sets had a standard made of material that floats in water (wood, wax, Styrofoam), and the other half had a standard made of material that sinks in water (metal, plastic, clay). Table 2 provides an overview of all objects sets used in the baseline assessment and the conditions of Experiment 1 and Experiment 2.
In the Prediction Task the experimenter first took the standard and said "Look, this floats/sinks in water." The experimenter then immersed the standard in a water basin, and the child observed whether the object sank or rose to the top. The experimenter then positioned this object above the four selection objects and asked "Which of these also floats/sinks, just like this one here? This one, this one, this one, or this one?" while pointing at the four selection items. The child then selected one of four objects that would float/sink just like the standard. The children received one point for choosing the selection object of the same material as the presented standard and zero points for choosing one of the other three selection objects (i.e., the range of possible scores for the Prediction Task was 0-6).
The Explanation Task followed immediately after the child had made their prediction. That is, once the child had chosen one of the four selection objects, the experimenter asked the child to explain their prediction by asking, "Why do you think so?" For every set, we coded whether the answers referred to material kind. If the child referred to the material or an according quality at least once (e.g., "because it is made out of wood, " "because it is made out of the same stuff, " "because it is just the same"), the child received one point (i.e., the range of possible scores for the Explanation Task was 0-6).

Conditions: Prediction and explanation task
After the baseline assessment, the four different conditions followed. The materials used in the conditions were composed in a way that was similar to the baseline assessment. That is, we developed six novel object sets following the same logic as described above. However, all objects had different shapes than the objects used in the baseline assessment (and than the objects used in the pretest), and we chose a second standard for each set to be presented in the conditions with two standards (see Figures 1, 2 for object sets in the one-standard condition and the two-standard condition, respectively).
The instructions and the number of standards varied for the four conditions (see Table 1). The children saw either one or two standards, and the objects of a set were either labeled or unlabeled. In the labeled conditions, the standards were labeled according to their material after their floating behavior had been shown (e.g., "Look, this one is made out of wood and it floats in water. And this one is also made out of wood and it floats, too" in the two-standard condition). To increase the salience of labels, the labeling was not only applied to the standards but also to all selection objects.
In the Prediction Task, the children were asked to choose one of the selection objects that would show the same floating behavior ("Which of these also floats in water, just like these two? This one made out of iron, this one made out of wood, this one made out of glass, or this one made out of clay?"). Afterward, the children were asked to explain their selection (Explanation Task). The Prediction and the Explanation Tasks were scored as described for the baseline assessment (i.e., the range of scores was 0-6). In all conditions, the children were allowed to touch the objects and hold them in their hands, but they were not allowed to put them into the water. Only the experimenter immersed the standards into water for the children to observe. If the children expressed a wish to immerse objects into the water, they were told that they could do so after the experiment was finished. In all phases of the experiment, the children were praised for their active participation; however, no feedback was given concerning the accuracy of their replies. Table 3 presents the means and standard deviations of the Matching Task and the Labeling Task (pretest), as well as of the Prediction Task and the Explanation Task for the baseline assessment and four conditions. The scores of the Prediction and the Explanation Tasks of the baseline assessment were employed as covariates in the respective task analyses of children's performance in the different conditions. Before presenting the results with regard to our hypotheses, we present the results of the preliminary analyses testing whether there were significant differences across conditions on the two pretest tasks and the children's performance in the Prediction Task and Explanation Task in the baseline assessment. Oneway ANOVAs showed no significant differences across the experimental conditions for the pretests (Matching Task: p = 0.67, η p 2 = 0.028; Labeling Task: p = 0.98, η p 2 = 0.004). The Labeling Task (36% correct) was more difficult than the Matching Task (74% correct). There was also no significant difference in the Prediction Task for the baseline assessment across conditions (p = 0.47, η p 2 = 0.04). There was, however, a significant difference in the Explanation Task of the baseline assessment across conditions, F(3,55) = 5.68, p = 0.002, η p 2 = 0.24. Followup analyses of the Explanation Task indicated that children in the one_unlabeled condition had significantly higher scores than children in the one_labeled and the two_unlabeled conditions (see Table 3).

Results
To test Hypotheses 1 and 2 (i.e., better performance in the two-standard conditions and an interaction effect of Number of Standards and Labeling), a 2 × 2 ANCOVA with Number of Standards (one vs. two) and Labeling (unlabeled vs. labeled) and the Prediction Task performance as the dependent measure was computed, using the performance in the Prediction Task in the baseline assessment as covariate. The covariate significantly predicted the dependent measure, F(1,54) = 4.35, p = 0.04, η p 2 = 07. As expected, we found a significant effect of the Number of Standards, F(1,54) = 4.97, p = 0.03, with overall higher mean accuracy in the two-standard condition, M = 3.40 (SD = 1.83) than in the one-standard condition, M = 2.45 (SD = 1.66). However, the size of the effect was rather small (η p 2 = 0.08). We also found a significant but small main effect of Labeling, F(1,54) = 4.21, p = 0.01, η p 2 = 0.04, with higher mean accuracy in the labeled conditions, M = 3.69 (SD = 1.73) than in the unlabeled conditions, M = 2.20 (SD = 1.56). Contrary our expectation, there was no interaction of Number of Standards × Labeling, p = 0.69, η p 2 = 0.003. We performed the same analyses with the performance in the Explanation Task as a dependent measure, using performance in the Explanations Task in the baseline assessment as covariate. The covariate significantly predicted this dependent measure, F(1,54) = 6.26, p = 0.015, η p 2 = 0.10. We found only a significant effect of Labeling, F(1, 54) = 11.00, p = 0.002, η p 2 = 0.17, with M = 2.34 (SD = 2.24) for the labeled conditions and M = 0.97 (SD = 1.45) for the unlabeled conditions. Contrary to our hypotheses, however, there was no significant main effect of Number of Standards, p = 0.20, η p 2 = 0.03, and no significant Number of Standards × Labeling interaction, p = 0.82, η p 2 = 0.001. In exploratory post hoc analyses, we checked whether age contributed to the reported effects. We did not find significant differences between younger children (≤64 months, N = 36) and older children (

Discussion of Experiment 1
In Experiment 1, we investigated whether the presentation of two standards of the same material would support children in relying on the concept of material kind when predicting and explaining objects' floating or sinking, and whether the labeling of objects with their respective materials would increase children's ability to induce the concept of material kind. We found that preschoolers were indeed more likely to base their predictions on the concept of material kind when comparison was triggered by the presentation of two standard objects, with an overall small but significant effect. When presented with two standard objects, children decreased their reliance on irrelevant features, such as shape or size, when predicting floating or sinking. However, our hypothesis that the effect of comparison would be intensified by presenting labels was not confirmed. While we found a main effect of Labeling in the Prediction Task, we did not find a significant interaction between the factors Number of Standards and Labeling. Unexpectedly, even in the one-standard condition, the children were more likely to choose objects with the same material if they were provided with objects' material labels. This finding suggests that children in the one-standard condition used the common label of the standard and the same-material item of the selection objects to derive conclusions with respect to their underlying commonalities.
In the Explanation Task, we found an effect only for Labeling. As for the Prediction Task, the children showed a significantly higher tendency to refer to material when explaining their choices in the conditions when labels were used. In contrast to the results of the Prediction Task and in contrast to our hypothesis, however, there was no effect of the Number of Standards factor. It would appear that the children were able to base their predictions on material kind in conditions with two standards, but they did not explicate their intuitive knowledge when prompted for explanations unless they were supported by labels. On average, the solution rates for explanations were lower than for predictions. Therefore, one might speculate that this task placed a higher demand on children with regard to the retrieval of conceptual knowledge. A significant group difference had already been detected in the Explanation Task of the baseline assessment prior to the experimental variation, and this a priori difference may have diluted the effect of producing explanations in the experimental conditions. We accounted for this difference by using the baseline performance in the Explanation Task as a covariate. Nevertheless, this unexpected baseline difference may have biased performance in the Explanation Task. Thus, these results should be treated with caution.
How may we explain the effect of labeling on children's performance? First, it is likely that the labels presented in Experiment 1 elicited comparison processes that directed the children's attention to the objects' material, especially because material names were used to label all objects. While developmental research has shown that basic-category labels prompt analogical reasoning by comparison in children from an early age (Gelman and Markman, 1986;Davidson and Gelman, 1990;Nguyen and Gelman, 2012;Childers, 2020), Experiment 1 showed that material labels that provide superordinate information may also support inductive reasoning to derive predictions and explanations for floating and sinking. Second, using material labels may activate specific prior conceptual knowledge of material kind. For example, children may know that "wood" refers to the specific quality that some floating objects are made of. Thus, the children may have been more likely to pick wooden objects due to their prior experience. As we applied material labels to the standards and to all selection objects, the children's attention was drawn to material kind as a quality of all the presented objects. Thus, the activation of prior knowledge may have been especially strong. Finally, the effects of Number of Standards and Labeling in the Prediction Task were rather small. In contrast to Namy and Gentner's (2002) study, we used a baseline assessment of children's initial performance. Using this baseline performance as a covariate increases the power to detect effects. Moreover, Namy and Gentner investigated basic-category learning, whereas our study employed tasks in a scientific reasoning context. It is likely that this type of task led to rather small experimental effects since conceptual restructuring in science is difficult to achieve with short-term instructional interventions (Schneider et al., 2012). Specifically, the induction of a concept of material kind may be regarded as a process of initial conceptual restructuring. Since we chose our selection objects in the Prediction and Explanation Tasks on the basis of well-known misconceptions about floating and sinking at preschool and elementary school age (Leuchter et al., 2014), this task requires children to inhibit answers due to misconceptions to make the correct prediction and provide the correct explanation.
To evaluate whether an effect of labeling is indeed due to the activation of prior conceptual knowledge, we contrasted the use of real material labels and made-up labels in Experiment 2. If the labeling effect were due to the activation of prior conceptual knowledge, then we would find effects only for real labels and not for made-up labels To diminish the effects of drawing attention to material kind as a dimension of all objects in a set, we labeled only the standards and not the selection objects.

EXPERIMENT 2
As in Experiment 1, we varied whether one or two standards were presented and whether the standards were unlabeled (i.e., "this/these") or labeled with respect to their material (e.g., "this/these one/s is/are made of wood"). In addition to the use of real material labels, we included two conditions in which we used made-up labels (e.g., "these ones are made of idoform"). If the labeling effect in Experiment 1 was due to the triggering of analogical reasoning by comparison, then the effect would appear only in the two-standard conditions. If it was due to the triggering of prior conceptual knowledge by the use of the real material labels, then the effect would only appear in the labeled conditions with real material names. In addition to the Prediction and the Explanation Tasks of Experiment 1, we assessed whether there was evidence of conceptual knowledge transfer with regard to predicting whether novel objects would float or sink. We therefore employed a test of conceptual knowledge immediately before the baseline assessment (transfer pretest) and after the children had completed the conditions (transfer posttest).
Our research questions and hypotheses were as follows: (3) Will preschoolers induce the concept of material kind when analogical reasoning by comparison is triggered by the presentation of two standards with the same floating behavior?
Hypothesis (3): In conditions with two standards, preschoolers will outperform participants in conditions with one standard on the Prediction and the Explanation Tasks. (4) Does the use of material labels trigger analogical reasoning by comparison? Hypothesis (4): The use of real material labels will improve performance in the two-standard condition in the Prediction and Explanation Tasks, but not the use of madeup labels. (5) Will the intervention lead to conceptual knowledge transfer? Hypothesis (5): There will be knowledge transfer in the twostandard conditions.

Method
Participants Ninety-nine children from Central Switzerland with German as a first language participated (testing was in German). They were recruited with parental consent through preschools. The mean age was 5 years, 8 months, ranging from 4 years, 4 months to 7 years. The children were mostly from middle-class families living in suburban areas.

Design
Experiment 2 was based on a 2 × 3 between-groups design, including a baseline assessment as in Experiment 1. After the pretest and the baseline assessment, the preschoolers took a conceptual knowledge transfer test (transfer pretest). They then participated in one of six conditions: (1) One Unlabeled Standard

One Standard Two Standards
Unlabeled Look, this one floats/sinks in water. Which of these also floats/sinks just like this one?
Look, these two float/sink in water. Which of these also floats/sinks just like these two?
Real Label (example: wood) Look, this one is made out of wood and it floats in water. Which of these also floats just like this one?
Look, this one is made out of wood and it floats in water. And this other one is also made out of wood and it floats, too. Which of these also floats just like these two?
Made-up Label Look, this one is made out of feb and it floats in water. Which of these also floats just like this one?
Look, this one is made out of feb and it floats in water. And this other one is also made out of feb and it floats, too. Which of these also floats just like these two?
Material Label (one_ made-up), (4) Two Unlabeled Standards (two_unlabeled), (5) Two Standards Labeled with Real Material Label (two_real), or (6) Two Standards Labeled with Madeup Material Label (two_ made-up). After the conditions, the conceptual knowledge transfer test was again presented (transfer posttest). Table 4 gives an overview of the different conditions and the respective instructions.

Tasks and Procedure
The material, tasks, and procedure employed in Experiment 2 were largely similar to those of Experiment 1. In addition, we employed a conceptual knowledge transfer test that was first conducted immediately before the baseline assessment and then repeated after the children had solved the tasks in the different conditions.

Pretest: Matching/labeling task
The material and procedure for these tasks were the same as in Experiment 1.

Baseline assessment and conditions: prediction and explanation task
The object sets (see Table 2) and procedure were the same as in Experiment 1 for the baseline assessment and for the unlabeled conditions (one_unlabeled, two_unlabeled). For the labeled conditions, the procedure differed from Experiment 1 in the following way. In the conditions using real material labels (one_real, two_real), the standards were labeled while their floating behavior was demonstrated (e.g., "Look, this is made out of wood and it floats in water. And this is also made out of wood and it floats, too."). Afterward, the children were asked to select one of the selection objects ("Which of these also floats/sinks in water, just like these two? This one, this one, this one, or this one?"). Thus, in contrast to the procedure in Experiment 1, we applied the labels only to the standards and not to the four selection objects. In the conditions with made-up labels (one_made-up, two_made-up), the procedure was the same as in the real material label conditions except that the real material labels were replaced with made-up labels. For all objects sets, the children had to choose one of the selection objects (Prediction Task) and to explain their selection (Explanation Task). These two tasks were scored as in Experiment 1.

Test on conceptual knowledge transfer of floating and sinking
We designed a conceptual knowledge test to measure potential knowledge transfer. This test also included Prediction and Explanation Tasks. In the Transfer Prediction Task, the children were asked to predict whether a presented object would float or sink in water. In the Transfer Explanation Task, the children were asked to verbally explain their answers. Five objects were used, all consisting of material that was also used in the object sets for the baseline assessment and in the different conditions. However, the objects in the transfer tasks had novel shapes and were selected because they represented common misconceptions based on the perceptual qualities of size, weight, or shape: A large and heavy wooden block, a thin wooden board with holes, a metal cube, a small metal needle, and a large block of Styrofoam (see also Leuchter et al., 2014, for a similar task). When predicting whether such objects float or sink in water, children typically refer to their size, weight, or shape in their explanations rather than to their material. Each object was first shown to the child, and the child was asked to touch it and to hold it. A container with water was placed onto the table next to the objects but the children were only allowed to put the objects into it after all tasks were finished. In the Transfer Prediction Task, the child was first asked "Does this float or sink in water?" In the immediately following Transfer Explanation Task, the child was asked "Why do you think so?" In the Transfer Prediction Task, the children received one point for a correct prediction (i.e., range of scores 0 -5). In the Transfer Explanation Task, the children received one point if they provided a correct explanation with regard to material kind (i.e., range of scores 0-5). Table 5 presents the means and standard deviations of the six conditions for the pretest scores (Matching Task, Labeling Task), the Prediction Task, and the Explanation Task in the baseline assessment and the different conditions. Table 6 presents the respective descriptive values for the pre-and posttest scores of the Transfer Prediction Task and the Transfer Explanation Task.
To test our hypotheses derived from research questions 3 and 4 (i.e., better performance in the two-standard conditions and an improvement in performance in the two-standard condition with real labels), we conducted a 2 × 3 ANCOVA with Number of Standards (one vs. two) and Labeling (no material label vs. real material label vs. made-up label) as between-subject factors, and performance in the Prediction Task in different conditions as the dependent measures, including the baseline Prediction Task performance as covariate. The covariate predicted performance in the Prediction Task significantly, F(1,92) = 92.77, p = 0.001, η p 2 = 0.50. As expected, we found a main effect for Number of Standards, F(1,92) = 8.28, p = 0.005, with higher means in the conditions with two standards, M = 3.79 (SD = 2.18) than in the conditions with one standard, M = 2.76 (SD = 2.18). However, the effect size was rather small (η p 2 = 0.08). However, we did not find an effect of Labeling, p = 0.49, η p 2 = 0.01, nor an interaction of Number of Standards × Labeling, p = 0.87, η p 2 = 0.003. Overall, the results suggest that presenting two objects of the same material benefits children's induction of the concept of material kind. Labeling the material of the standards did not cause significant performance differences, regardless of whether real or made-up labels were used or whether labels were combined with the presentation of two objects.
For the Explanation Task performance, we conducted the same 2 × 3 ANCOVA as for the Prediction Task. We found a similar pattern. There was a significant effect for Number of Standards, F(1,92) = 6.16, p = 0.015, η p 2 = 0.063, with M = 2.92 (SD = 2.47) for the two-standard conditions and M = 1.96 (SD = 2.21) for one-standard conditions. This finding suggests that preschoolers were more likely to explain their choices by reference to material kind when two standards were presented than when only one standard was given. There was no effect for Labeling (p = 0.66, η p 2 = 0.01) and no Number of Standards × Labeling interaction (p = 0.60, η p 2 = 0.011). The covariate predicted performance in the Explanation Task significantly, F(1,92) = 127.67, p < 0.001, η p 2 = 0.58. In additional exploratory analyses, we found no significant differences between the age groups of children younger than 60 months (N = 11), between 61 and 72 months (N = 51), and more than 72 months (N = 20)  Time × Labeling × Number of Standards, p = 0.74, η p 2 = 0.006. That is, the improvement from pretest to posttest in the Explanation Task did not differ between conditions.

Discussion of Experiment 2
As expected, we found that preschoolers were more likely to induce the concept of material kind when analogical reasoning by comparison was triggered by presenting two objects of the same material rather than only one object. This effect was reflected in the Prediction Task and the Explanation Task, albeit with small effect sizes overall. As in Experiment 1, children's performance in the baseline assessment, included as a covariate in our statistical models, contributed significantly to children's performance in the Prediction and Explanation Tasks, with a large effect size. In Experiment 1, we also found that assigning a common label to all objects increased choices of objects of the same material. In contrast to Experiment 1, we used the material labels in a more restricted way in Experiment 2, applying them only to the standards and not to the selection objects. This more restricted use did not improve preschoolers' performance in the Prediction and Explanation Tasks in comparison to the unlabeled conditions. Our analyses of the Transfer Prediction Task and the Transfer Explanation Task only revealed a small overall gain in the Transfer Explanation Task. Overall, none of the conditions caused specific knowledge transfer effects. However, do our results suggest that the use of labels promotes analogical reasoning by comparison? The lack of a difference between the "two standards/no label" and "two standards/real label" conditions suggests that using material labels has no effect on top of presenting two objects. Given that the instruction and observation of the floating and sinking of two objects might have already invoked the comparison of the two standards, children may not have needed an additional linguistic prompt to align both items.

Task Effects of Comparisons of Number of Standards and Linguistic Labels
In our two experiments, we investigated whether triggering analogical reasoning by comparison and additional labeling would enhance preschoolers' induction of the concept of material kind as a basis for the generation of predictions and explanations in a scientific context for "floating and sinking." To this end, preschoolers were randomly assigned to conditions that were intended to trigger analogical reasoning by comparison by presenting objects evidenced to sink or float in water (i.e., standards). The objects were presented without labels, with their real material labels, or with made-up labels (Experiment 2 only). The children were then asked to predict which of four selection objects with an unknown status would float or sink and to explain their prediction. As expected, the presentation of two standards rather than one standard supported preschoolers' induction of the concept of material kind as a basis for generating predictions in both experiments (Ruffman et al., 1993). The benefit of two standards for the provision of explanations only emerged in the second experiment, however. This pattern of findings fits with previous research emphasizing the beneficial role for category and concept learning of triggering analogical reasoning by comparison through the presentation of two standards (Loewenstein et al., 1999;Gentner et al., 2007;Gentner, 2010Gentner, , 2016. Unlike previous research on preschoolers' learning through comparison, which employed forced-choice tasks, the present experiments assessed children's induction of conceptual knowledge for hypothesis generation using tasks that required children to select an object (Prediction Task) and to explain their selection (Explanation Task).
Prior research on analogical reasoning by comparison has provided some evidence that labels as language prompts may be even more beneficial than simply juxtaposing entities or objects. Language is presumed to play a pivotal role because using common labels across different entities may function as an invitation to compare and, as such, to align the similarities and differences of the entities and the encoding and abstraction of a generalizable schema Namy, 1999, 2006;Namy and Gentner, 2002;Gentner, 2010Gentner, , 2016. In Experiments 1 and 2, we labeled the material comprising each object, potentially providing superordinate category labels. Contrary to our hypothesis, we found an effect of labeling both in the conditions with one standard and with two standards. Since the labels in Experiment 1 were not only applied to the presented standards but also to the four selection objects, it is likely that the children's prior conceptual knowledge of material kind was activated in the condition with a single standard, facilitating responses based on the same material. The use of real material labels in Experiment 1 differed from the approach of Namy and Gentner (2002), who used made-up labels. In Experiment 1, at least some of the children were familiar with material labels from everyday life contexts, as indicated by their performance in the pretest, in which the children were asked to label the material of various objects. In Experiment 2, we therefore used labels in a more restricted way in order to differentiate between the effects of prior conceptual knowledge and the facilitation of comparison by labeling. We found no effect for labeling with this more restricted use. Children's predictions and explanations did not differ from the conditions without labels, neither when madeup labels were used nor when real labels were used. In Gentner and Namy's (1999) Experiment 2, a label and a no-label condition were contrasted, as well as a compare (two standards) and a noncompare (one standard) condition. Gentner and Namy found that the label/two-standard standard condition significantly increased appropriate responses not only in contrast to both single standard conditions but also to the unlabeled/two-standard condition. How may these differences be explained? We suppose that even our "two standards without labeling" condition sufficed as an invitation for analogical reasoning by comparison because the children could also observe whether these objects floated or sank as empirical evidence associated with the respective objects. These observations may have created an alignable similarity between the two objects, helping the preschoolers to induce the concept of material kind, in both Experiment 1 and Experiment 2. By contrast, Namy and Gentner only showed pictures of one versus two objects; thus, verbal support may have been more important than in our experiments. Moreover, in our experiments, all the standard objects of the same material showed the same floating/sinking behavior. That is, the covariation between objects and observed object behavior was perfect. In this regard, our findings fit with previous research indicating that even 4-year-olds can coordinate theory and evidence if there is a perfect covariation (Tullos and Woolley, 2009).
In Experiment 2, preschoolers' performance in the conceptual knowledge transfer test did not differ between conditions. Even so, the overall significant gains from pretest to posttest on the dependent measure of explanations may provide a first, albeit weak indication that a process of conceptual restructuring was initiated. Because conceptual restructuring is a process in which naïve conceptions are gradually extended and refined (Vosniadou et al., 2001), this process needs to be continued and deepened since children typically need additional opportunities to overcome their misconceptions (Leuchter et al., 2014). The slow process of conceptual restructuring may explain why the benefits of task environments inviting analogical reasoning by comparison are rather small in our experiments. Our tasks required preschoolers to overcome their misconceptions and construct novel conceptual knowledge in a science domain with a high degree of prior knowledge-the concept of material kind within the context of floating and sinking. Developing such knowledge may be more challenging than the basiccategory learning studied in previous research on the benefits of comparison (e.g., Namy and Gentner, 2002).

Limitations
Throughout this paper, we suggested that analogical reasoning by comparison benefits young children's generation of predictions and explanations as a central aspect of scientific reasoning. The cognitive models of analogical reasoning by comparison provide conceptualizations of how children actually engage in this process: When learners engage in analogical reasoning by comparison, they identify similarities and differences between entities, they may align them and abstract a schema, they may re-represent their existing conceptions, and they may project inferences based on the results of these processes (Gentner, 2010). In our experiments, we investigated only the induction of one specific basic science concept: the concept of material kind. We did not investigate hypothesis generation in other science domains that vary in children's prior conceptual knowledge and that may have an impact on children's generation of predictions and explanations. Therefore, further research is needed to investigate the induction of other scientific concepts so that the claim that analogical reasoning by comparison has benefits for scientific reasoning in general may be strengthened. Moreover, while we did assess children's prior knowledge of material kind, we did not look at individual differences and their influences on the process of hypothesis generation. Thus, further research should address the interaction of prior knowledge and the generation of predictions and explanations. Generally, while hypothesis generation is a central aspect of scientific reasoning, it will be important for future research to investigate whether other elements of the scientific inquiry cycle, such as the generation of experiments, the interpretation of data patterns, and the evaluation of evidence, would also benefit from interventions aimed to trigger analogical reasoning by comparison.
While both our studies were conducted with preschoolers, we included a wide age range of children, between 4 and 7 years. Additional exploratory analyses indicated that there were no significant differences between different age groups. However, our study design does not make it possible to fully disentangle the possible impact of age on the tasks used in the present experiments. On the basis of results summarized by Gentner (2010) and Hespos et al. (2020), we expect analogical reasoning by comparison to be a cognitive mechanism available to children starting at an early age. However, further research is needed to investigate age differences with the task formats that we employed in our experiments. For example, it is possible that age may be a more important factor in the Explanation Task, which required children to produce verbal answers, than in the Prediction Task, even though we did not find effects from age in either task format. Here, we have to acknowledge that the power of the present experiments is not sufficient to detect such differences (i.e., interactions between age, tasks, and conditions). A closer look at age differences may also illuminate the differences in performance between the dependent measures of Prediction and Explanation Tasks found in Experiments 1 and 2. Therefore, task designs taking into account individual differences in children's responses could also contribute to understanding the contingencies between children's predictions and explanations.

Conclusion
In our two experiments, we investigated whether triggering analogical reasoning by comparison would benefit children's predictions and explanations of objects' floating or sinking based on the concept of material kind. On the one hand, we triggered analogical reasoning by comparison by presenting two objects of the same material simultaneously and found some evidence that this indeed benefited children's induction of the concept of material kind for hypothesis generation. On the other hand, unexpectedly, the additional provision of language prompts did not increase the effect of presenting two objects. In previous research, such additional support was often necessary for concept learning, especially with young children. It may be that our task structure provided a different kind of support for triggering comparison because our tasks included the demonstration of the floating and sinking behavior of the standards. This speculation may provide interesting directions for future research. Our experiments therefore only give a first hint that analogical reasoning by comparison may be helpful for the induction of science concepts in a scientific reasoning context.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

AUTHOR CONTRIBUTIONS
HS and IH contributed the conception and design of the experiments. ML, HS, IH, and LS conducted the Experiments 1 and 2 in Germany and Switzerland. HS and IH performed the statistical analyses. ML, HS, and IH wrote a first draft of the manuscript. All the authors contributed to manuscript revision and approved the submitted version.

FUNDING
This research was funded with internal research funds from the authors' institutions as well as internal funding from the Max Planck Institute for Human Development, Berlin, Germany.