Addressing the Ceiling Effect when Assessing STEM Out-Of-School Time Experiences

The aim of this paper is to describe an analytical approach for addressing the ceiling effect, a measurement limitation that affects research and evaluation efforts in informal STEM learning projects. The ceiling effect occurs when a large proportion of subjects begin a study with very high scores on the measured variable(s), such that participation in an educational experience cannot yield significant gains among these learners. This effect is widespread in informal science learning due to the self-selective nature of participation in these experiences, such that participants are already interested in and knowledgeable about the content area. When the ceiling effect is present, no conclusions can be drawn regarding the influence of an intervention on participants’ learning outcomes which could lead evaluators and funders to underestimate the positive effects of STEM programs. We discuss how the use of person-centered analytic approaches that segment samples in theory driven ways could help address the ceiling effect and provide an illustrative example using data from a recent evaluation of a STEM afterschool program.


INTRODUCTION
As concerns arise about the need to increase the number of US STEM professionals in order to remain globally competitive, the pressure to emphasize STEM education particularly for adolescent youth has never been greater. Many educators and researchers recognize an urgent need to identify strategies for developing youth skills, abilities and dispositions in STEM early in life, particularly for underserved youth, to increase the potential for future academic and professional participation in STEM fields (National Research Council, 2010).
Out-of-school time (OST) activities such as afterschool programs, summer camps, and other enrichment programs (e.g., Girl Scouts science clubs) are uniquely situated to address this need with their ability to reach large numbers of young people, including low-income youth and youth of color (Afterschool Alliance, 2014). While schools often focus on delivering STEM content knowledge and science process skills National Research Council (2012a), OST programs emphasize the fostering or development of affective and emotional outcomes, such as STEM interest and identity, that are strongly associated with STEM persistence and increased future academic and professional participation in STEM fields (National Research Council, 2009;Maltese and Tai, 2011;Venville et al., 2013;Maltese et al., 2014;Stets et al., 2017). However, evaluating the success of such programs can be problematic due to the variable, unstructured nature of informal learning environments themselves, as well as the fact that participants often self-select programs based on their prior interests (National Research Council, 2009). Thus, although some studies have documented significant cognitive and affective gains from participation in outof-school STEM activities such as science clubs Bevan et al. (2010), Stocklmayer et al. (2010), Young et al. (2017), Allen et al. (2019), many others, particularly smaller programs with fewer participants, have failed to document significant increases for participants as a whole (Brossard, et al., 2005;Falk and Storksdieck, 2005;Judson, 2012).
The most likely reason for this phenomenon is the presence of a measurement limitation called the ceiling effect which can occur when a large proportion of subjects begin a study with very high scores on the measured variable(s), such that participation in an educational experience cannot yield significant gains among these learners (National Research Council, 2009;Judson, 2012) This effect is often attributed to the biased nature of participation. Informal science learning opportunities, including after school programs, are particularly susceptible to this effect due to the fact that participants generally choose to participate because they are already interested in and potentially knowledgeable about the content area. When the ceiling effect is present, no conclusions can be drawn regarding the influence of an intervention for youth on average. This effect can hinder efforts to evaluate the success of a program by leading evaluators to underestimate the positive effects on affective or cognitive learning outcomes that are measured with standard instruments.
In this paper we describe how person-centered analytic models could help informal science evaluators and researchers address the ceiling effect while potentially providing a better understanding of the outcomes of participants in ISL programs and other experiences. We refer to person-centered analytic models as approaches to data analysis that distinguish main treatment effects by participant type in meaningful (i.e., hypothesis-driven) ways. Although used frequently in other fields such as educational psychology, sociology, and vocational behavior research, person-centered analyses are still fairly uncommon in informal science education research and evaluation (Denson and Ing, 2014;Spurk et al., 2020). We begin with a short discussion of the ceiling effect in OST programs and the affordances and constraints of person-centered approaches as compared to more traditional variable-centered models for analyzing changes in outcomes over time. To further clarify the methodologies, we then provide an empirical example in which each type of approach is used on the same data set from the authors' evaluation of 27 afterschool STEM programs in Oregon Staus et al. (2018) in which the usefulness of the person-oriented approach and the variable-oriented approach are compared.

Out-Of-School-Time Programs
Out-of-school-time (OST) programs are a type of informal STEM learning opportunity provided to youth outside of regular school hours that include afterschool programs, summer camps, clubs, and competitions (National Research Council, 2009;National Research Council, 2015). OST programs provide expanded content-rich learning opportunities, often engaging students in rigorous, purposeful activities that feature hands-on engagement, which can help bring STEM to life and inspire inquiry, reasoning, problem-solving, and reflecting on the value of STEM as it relates to children and youth's personal lives (Noam and Shah, 2013;National Research Council, 2015). In addition, OST STEM activities may allow students to meet STEM professionals and learn about STEM careers Fadigan and Hammrich (2004), Bevan and Michalchik (2013), and can help learners to expand their identities as achievers in the context of STEM as they are actively involved in producing scientific knowledge and understanding (Barton and Tan, 2010).
Another key aspect of OST STEM time is that it is generally not associated with tests and assessments, providing a space for children and youth to engage in STEM without fear and anxiety, therefore creating a psychologically safe environment for being oneself in one's engagement with STEM. In fact, it is the nonassessed, learner-driven nature that makes OST engagement ideal for fostering affective outcomes around interest, identity, selfefficacy, and enjoyment (National Research Council, 2009;National Research Council, 2015). Consequently, many OST programs promote a number of noncognitive, socio-emotional learning (SEL) skills such as teamwork, critical thinking and problem-solving (Afterschool Alliance, 2014). Also known as twenty-first century skills, these skills are seen as essential to many employers when hiring for STEM jobs. Thus, participation in OST programs could potentially positively affect youths' later college, career, and life success (National Research Council, 2012b).
Despite the strong potential for OST programs to provide positive benefits to participants, there have been few studies that document significant changes in outcomes for youth as a result of participation in these programs (Dabney, et al., 2012;National Research Council, 2015). One recent study utilized a mixedmethods approach including surveys and observations of over 1,500 youth in 158 STEM-focused afterschool programs to investigate the relationship of program quality on a variety of youth outcomes and found that the majority of youth reported increases in STEM engagement, identity, career interest, career knowledge, and critical thinking (Allen et al., 2019). The largest gains were reported by youth who engaged in longer-term (4 weeks or more) and higher quality programs as measured with the Dimensions of Success (DoS), a common OST program assessment tool.
Similarly, using a meta-analysis of 15 studies examining OST programs for K-12 students, Young et al. (2017) found a small to medium-sized positive effect of OST programs on students' interest in STEM, although the effect was moderated by program focus, grade level, and quality of the research design. For example, programs with both an academic and social focus had a greater positive effect on STEM interest, while exclusively academic programs were less effective at promoting interest in STEM. The authors found no significant effect for programs serving youth in K-5; all other grade spans showed positive effects on STEM interest. Unlike Allen et al. (2019), this study found no effects related to the duration of the programs.
In contrast to the above large-scale research projects, many researchers or evaluators have failed to document significant increases in STEM outcomes for OST program participants as a whole. In particular, evaluations of single OST programs with fewer participants may have difficulty showing significant changes in STEM outcomes as a result of participating in the program. For example, an evaluation of a collaboration between libraries, zoos and poets designed to use poetry to increase visitors' conservation thinking and language use, found few significant changes in the type or frequency of visitor comments related to conservation themes or in their thinking about conservation concepts (Sickler et al., 2011). Similarly, in an evaluation of 330 gifted high school students participating in science enrichment programs, evaluators found no positive impact on science attitudes after participation in the program (Stake and Mares, 2001). Although mostly serving adults rather than children, several citizen science projects reported similar difficulty in documenting significant positive outcomes for participants (Trumbull et al., 2000;Overdevest et al., 2004;Jordan et al., 2011;Crall et al., 2012;RK and A, Inc., 2016). For example, an evaluation of The Birdhouse Network (TBN), a program in which participants observe and report data on bird nest boxes, revealed no significant change in attitudes toward science or understanding of the scientific process (Brossard et al., 2005). It is likely that there are many more examples that we were unable to access since program evaluations in general and studies that fail to find significant results in particular, often do not get published.
One plausible explanation for the lack of significant results in program-level evaluations like those described above is not that these programs failed to provide benefits to their participants, but that at least in those with significant positive bias in the participants, the presence of a ceiling effect resulted in a lack of significant gains among these learners on average (National Research Council, 2009;Judson, 2012). For example, in the TBN citizen science study mentioned above, participants entered the program with very strong positive attitudes toward the environment such that the questionnaire used to detect changes in attitudes was insensitive for this group (Brossard et al., 2005). As described earlier, the ceiling effect is a common phenomenon in OST programs which often attract learners who elect to participate because they are already interested in and knowledgeable about STEM (Stake and Mares, 2001;National Research Council, 2009). The potential danger of the ceiling effect is that positive outcomes due to participation in the OST program may go undetected when measured by standard measures which could lead to funding challenges or even termination of a program. Therefore, it is critical that program evaluators utilize appropriate analytic approaches that account for the ceiling effect to better understand how OST programs influence learner outcomes.

Analytic Approaches
Historically, the most common analytic methods when evaluating OST programs have involved a pre-post design using surveys administered at the beginning and end of the program to measure changes in knowledge, attitudes, and similar outcomes, presumably as a consequence of the educational experience (Stake and Mares, 2001). The pre-post data are typically analyzed with a variety of variablecentered approaches such as t-tests or ANOVAs to examine changes in outcomes of interest (e.g., content knowledge, attitude toward science) over the course of the program. However, as described above, the traditional pre-post design may be insufficient for measuring the impact of intervention programs when many participants begin the program with high levels of knowledge and interest in STEM topics and activities. This is because variable-centered analytic models produce group-level statistics like means and correlations that are not easily interpretable at the level of the individual and do not help us understand how and why individuals or groups of similar individuals differ in their learning outcomes over time (Bergman and Lundh, 2015). In other words, if subgroups exist in the population that do show significant changes in outcomes (perhaps because they began the program with lower pre-test scores), these results may be obscured by the use of variable-centered methods.
In contrast, "person-centered" analytic models are predicated on the assumption that populations of learners are heterogeneous, and therefore best studied by searching for patterns shared by subgroups within the larger sample (Block, 1971). Therefore, the focus is on identifying distinct categories or groups of people who share certain attributes (e.g., attitudes, motivation) that may help us understand why their outcomes differ from those in other groups (Magnusson, 2003). Standard statistical techniques include profile, class, and cluster analyses, which are suitable for addressing questions about group differences in patterns of development and associations among variables (Laursen and Hoff, 2006). However, because of the "regression effect" (i.e., regression to the mean) phenomenon in which those who have extremely low pretest values show the greatest increase while those who have extremely high pretest values show the greatest decrease Chernick and Friis (2003), subgroups must be constructed from variables other than the outcome score being measured. In addition, the selected variables that form the groups must have a strong conceptual basis and have the potential to form distinct categories that are meaningful for analyzing outcomes (Spurk et al., 2020). In the case of OST programs, one such variable may be motivation to participate.
Substantial research shows that visitors to informal STEM learning institutions such as museums, science centers and zoos arrive with a variety of typical configurations of interests, goals, and motivations that are strongly associated with learning and visit satisfaction outcomes (Falk, 2009;Packer and Ballantyne, 2002). Moussouri (1997) was one of the first to identify a typology of six categories of visitor motivations including education, social event, and entertainment, two of which (education and entertainment) were associated with greater learning than other motivation categories (Falk et al., 1998). Packer (2004) expanded on this work in a study of educational leisure experiences including museums and interpretive sites, in which she identified five categories of visitor motivations: 1) passive enjoyment; 2) learning and discovery; 3) personal self-fulfillment; 4) restoration; and 5) social contact; only visitors reporting learning and discovery goals showed significant learning outcomes. Since then, numerous informal STEM learning researchers have used audience segmentation to better understand the STEM outcomes of visitors (e.g., Falk and Storksdieck, 2005;Falk et al., 2007;O'Connell et al., 2020;Storksdieck and Falk, 2020). These studies suggest that learning outcomes differ based on learner goals or motivations, supporting the potential usefulness of this variable for personcentered analyses in informal science research and evaluation, including OST programs for youth.
In the case of OST programs, children also participate for a variety of motivations including interest in STEM, to socialize with friends, to have fun, and because they are compelled by parents. Thus, person-centered approaches could be used to identify subgroups of participants with differing motivations for participating in the program that may affect their identity and learning outcomes. Then variable-centered analyses such as t-tests could be used to examine changes in outcomes for each subpopulation. To help clarify how the person-centered methodologies described above could address the ceiling effect problem, we provide an illustrative example in which each type of approach is used on the same data set from the authors' prior research and the findings from the person-centered approach and the variable-centered approach are compared.

AN ILLUSTRATIVE EXAMPLE--STEM BEYOND SCHOOL PROGRAM Background
The empirical example we provide for this paper is the STEM Beyond School (SBS) Program, which was designed to better connect youth in under-resourced communities to STEM learning opportunities by creating a supportive infrastructure for community-based STEM OST programs (Staus et al., 2018). Rather than creating new programs, SBS supported existing communitybased STEM OST programs to provide high quality STEM experiences to youth across the state of Oregon. The 27
-2. I like to solve complex problems. 3. I like going to my out-of-school activities that involve science. 4. I like figuring things out. 5. I can succeed in situations that involve understanding science. 6. I Would like a job that uses science when I'm an adult.
Constructive coping and resilience (4 items) 0.84/.81 1. When I have difficulty learning something, I remind myself that this is important for my future.
-2. If I get stuck, I try something different to solve the problem. 3. If I don't understand something in science, I ask for help.
participating programs took place predominantly off-school grounds, served youth in grades 3 through 8, and provided a minimum of five different highly relevant STEM experiences located in their communities. The community-based programs were required to provide at least 50 h of learning connected to the interests of their youth that followed the SBS 4 Core Programming Principles (student driven, students as do'ers and designers, students apply learning in new situations, relevant to students and community-based). For comparison, elementary students in Oregon receive 1.9 h per week of science instruction (Blank, 2012). SBS was therefore a targeted investment towards dramatically increasing meaningful STEM experiences for underserved youth while also advancing the capacity of program providers to design and deliver high quality STEM activities for youth that center around learning in and from the community.
SBS requires programs to intentionally engage historically underserved youth, specifically youth from communities of color and low-income communities as well as youth with disabilities and those who are English-language learners. With a grant requirement of engaging at least 70% participation amongst these groups, programs were challenged and inspired to rethink their traditional ways of reaching out, recruiting, and retaining those students.
To ensure long-term benefits for youth, SBS provided capacity building support to the community-based programs in the form of educator professional development, program design guidance, a community of practice for participating providers, support from a Regional Coordinator, and equipment. Educators working directly with youth participated in high quality, high dose (70 h for new providers and 40 h for returning providers) professional development connected directly to their specific needs. Professional development categories included essential attributes in program quality, best practices in STEM learning environments, fostering STEM Identity, and connecting to the community. Rather than providing one-size-fits-all workshops, the program assessed the needs of the educators and then leveraged expertise from across the state to address specific training or coaching needs. This approach created a community-and peer-based "just-intime" professional learning experience that allowed educators to modify their programming in real time.

Methods and Findings
Like many of the studies discussed earlier, our evaluation of the SBS Program used a pre-post survey design to measure changes in youth outcomes over the course of the OST experience. The survey was developed in conjunction with the Portland Metro STEM Partnership's Common Measures project which was designed to address the limitation of current measurement tools and evaluation methodologies in K-12 STEM education (Saxton et al., 2014). The resulting STEM Common Measurement System includes constructs that span from student learning to teacher practice to professional development to school-level variables. For the purposes of the SBS Program evaluation, we chose six of the student learning constructs related to learner identity and motivational resilience in STEMrelated activities as our outcome measures (Figure 1). The original Student Affective Survey Saxton, et al. (2014) was modified by revisiting its research base and examining additional research (e.g., Cole, 2012). Scales were shortened based on results from a reliability analysis of the included scales of the pre survey in year 1 of the SBS program, and in response to concerns about length and readability from program provider feedback, which led to a redesign of the post survey for the final measure (O'Connell et al., 2017). The final measure consisted of 24 items with three to six items per STEM component, which were slightly modified from the original to be suitable for OST programs rather than classroom environments (see Table 1 for component items and alphas). In addition to these learning outcomes, the pre-survey included demographic items (e.g., gender, age) and an open-ended question to assess youth motivation for participating ("please tell us about the main reason that you are participating in this program"). The answers to this motivation question fell into three categories: 1) interest in STEM topics and activities; 2) wanted to do something fun; 3) compelled by parents or guardians.
Of the 361 youth who participated in the SBS pre-survey in year 3, 148 also completed a post-survey enabling us to examine changes in outcomes associated with SBS programming activities.
Here we present the findings in two ways: a variable-centered approach examining mean changes in outcomes for the sample as a whole, and a person-centered approach in which we identify unique motivation-related subgroups of individuals and examine changes in outcomes for each subgroup. We then discuss the usefulness of the person-oriented approach and the variableoriented approach for addressing the issue of the ceiling effect in ISL research and evaluation projects.

Variable-Centered Analysis
We conducted paired t-tests to examine overall changes in outcomes over the course of the SBS Program and found no significant changes for five of the six outcomes ( Table 2). Although there was a significant decline in cognitive engagement, the effect size was Note: Outcomes coded on a five-point scale from 1 "Strongly disagree" to 5 "Strongly agree." Frontiers in Education | www.frontiersin.org July 2021 | Volume 6 | Article 690431 FIGURE 2 | Mean scores for all youth who participated in the pre-survey by motivation class; means with an asterisk are different at the p < 0.05 level. All constructs were measured on a scale of 1 (Strongly disagree) to 5 (Strongly agree). Note: n 202 for Interest; n 89 for Fun; n 70 for Compelled. Note: Items in index were coded on a five-point scale from 1 "Strongly disagree" to 5 "Strongly agree." Interest (n 84), Fun (n 32), Compelled (n 32).
Frontiers in Education | www.frontiersin.org July 2021 | Volume 6 | Article 690431 small (d 0.16). In other words, this analysis indicated that, on average, youth who participated in SBS maintained their STEM identity and motivational resilience over the course of the program but did not show the increases in outcomes that SBS providers desired. An examination of the pre-survey scores indicated that youth on average were already at the higher end of the scale, suggesting that the lack of significant changes in outcomes may be due to the ceiling effect.

Person-Centered Analysis
In order to address the ceiling effect in our data, we segmented youth into unique subgroups based on self-reported pre-survey motivation classes (See Figure 2): interested in STEM (Interest), wanted to have fun (Fun), or compelled by parents (Compelled). As described above, theory suggests that youth in these motivation classes may experience different learning outcomes from the same educational intervention. Youth in the Interest subgroup made up 56% of the sample (n 202) and reported significantly greater feelings of learner identity, cognitive engagement, and relevance than youth in the other motivation classes in the pre-survey. The Fun subgroup included 25% of the sample (n 89) and reported similar levels of resilience, belongingness, and self-efficacy as interested youth, similar relevance as Compelled youth, but significantly different learner identity and cognitive engagement than youth in the other subgroups. Finally, Compelled youth comprised 19% of the sample (n 70) and reported significantly lower scores than youth in other subgroups on all outcome measures except relevance.
We then conducted paired t-tests for the 148 youth who completed both a pre-and post-survey. Results indicated only two significant (p < 0.05) changes over time: Interested youth reported a significant decrease in cognitive engagement with a moderate effect size (d 0.32), and youth in the Fun subgroup reported a decrease in feelings of belonging with a large effect size (d 0.66) ( Table 3). None of the subgroups reported significant increases in any of the outcome measures at the end of the program.

DISCUSSION
The above example showed how using person-centered approaches in the evaluation of OST programs has the potential to address the ceiling effect. By segmenting the sample in a theory-driven way, we created three subgroups based on motivation to participate, two of which (i.e., Fun, Compelled) reported low enough pre-survey scores to potentially indicate increases in outcomes as a result of the OST program. In our example, neither the variable-centered nor person-centered approach revealed significant positive changes in outcomes as a result of participating in the program. However, the person-centered approach provided the opportunity to identify such changes for different subgroups of participants. For example, if an OST program led to increased outcome scores for less STEM-motivated youth, such a finding could provide important evidence to funders about the efficacy of OST programs thus promoting longevity of successful STEMfocused youth programs.
Even in the absence of significant changes in STEM outcomes, person-centered approaches provide a more nuanced view of the youth and why they participated which is valuable information that program providers can use to inform future improvements to the program. In the case of SBS, knowing that almost half of youth participated for reasons other than interest in STEM could lead to the development of more effective educational strategies that provide a range of activities designed to engage youth in each motivational category, rather than relying on one-size-fits all programming strategies. Indeed, a recent longitudinal study of youth STEM learning pathways highlighted the importance of customizing STEM resources in the larger learning ecosystem based on the differing interests and motivations of youth in the community (Shaby, et al., 2021). For example, one youth with a strong interest in computer programming eventually lost interest because the content of the OST program he attended did not keep pace with his growing interest in learning new coding languages. While it is unclear why youth outcomes remained largely unchanged after participation in SBS, it is possible that the programming was unable to adequately serve youth with a diversity of interests and motivations for participating.
It is also possible that in addition to the ceiling effect, the study may have suffered from another common measurement challenge associated with traditional pre-post designs known as response shift bias in which participants' comparison standard for measured items (e.g., competency and selfefficacy) differs between pre-and post-assessments (Howard and Dailey, 1979). In other words, program participants may overestimate their knowledge and ability at the beginning of an intervention, while post survey scores may reflect more accurate assessments based on comparisons to others in the program or simply a better understanding of the constructs themselves. Either way, a response shift may exacerbate the ceiling effect and seriously hamper the assessment of true change over time for many respondents (Oort, 2005). One potential remedy to address response shift bias is the use of retrospective pre-post (RPP) designs to simultaneously collect pre-and post-assessment data at the end of a program (Howard, Ralph, et al., 1979). This design provides a consistent frame of reference within and across respondents allowing real change results to be detected from an educational intervention. A growing body of evidence supports the use of the RPP design as a valuable tool to evaluate the impact of educational programs on a variety of outcomes (Little et al., 2020).
Ultimately, to avoid ceiling effects, assessment instruments must be designed to measure outcomes in such a way that participants with a strong affinity for STEM are not already at the high end of the scale when they begin the program. This includes choosing to measure constructs that are not theoretically limited in scale. For example, psychological constructs such as interest have a finite number of phases--once a learner has reached the highest level of individual interest, they will be unable to indicate an increase due to participation in an educational program (Hidi and Renninger, 2006). In contrast, measuring a learner's change in content knowledge may be less limited. Thus, although there is a strong call to use standard, published or previously validated measures in evaluations Noam Frontiers in Education | www.frontiersin.org and Shah (2013), Saxton et al. (2014), instead of ad-hoc measures adjusted to the nature of a program or the characteristics of the target audience, this may increase the prevalence of the ceiling effect in programs with high positive selection bias if measures are not designed to detect changes over time at the upper end of the distribution.
While it may not be possible to avoid measurement issues such as the ceiling effect altogether in assessments of OST STEM programs, evaluators should be aware of the methodologies and analytic approaches that could be used to address them more effectively. In particular, person-centered approaches that allow the segmentation of participants into motivation-related or other theory-driven subgroups, perhaps in conjunction with retrospective pre-post-survey designs, should be considered at the outset of program evaluations whenever possible.

DATA AVAILABILITY STATEMENT
The data analyzed in this study is subject to the following licenses/ restrictions: The data are shared with partner organizations whose permission we would need to share publicly. Requests to access these datasets should be directed to stausn@oregonstate.edu.