Gender Performance Gaps Across Different Assessment Methods and the Underlying Mechanisms: The Case of Incoming Preparation and Test Anxiety

A persistent ‘gender penalty’ in exam performance disproportionately impacts women in large introductory science courses, where exam grades generally account for the majority of the students’ assessment of learning. Previous work in introductory biology demonstrates that some social psychological factors may underlie these gender penalties, including test anxiety and interest in course content. In this paper, we examine the extent that gender predicts performance across disciplines, and investigate social psychological factors that mediate performance. We also examine whether a gender penalty persists beyond introductory courses, and can be observed in more advanced upper division science courses. We ran analyses (1) across two colleges at a single institution: the College of Biological Sciences and the College of Science and Engineering (i.e., physics, chemistry, materials science, math); and (2) across introductory lower division courses and advanced upper division courses, or those that require a prerequisite. We affirm that exams have disparate impacts based on student gender at the introductory level, with female students underperforming relative to male students. We did not observe these exam gender penalties in upper division courses, suggesting that women are either being ‘weeded out’ at the introductory level, or ‘warming to’ timed examinations. Additionally, results from mediation analyses show that across disciplines and divisions, for women only, test anxiety negatively influences exam performance.


INTRODUCTION
To effectively promote student groups who have been historically underrepresented in science, technology, engineering, and math (STEM), we need to provide students from different backgrounds equal opportunities to perform in these fields. Results from previous studies, however, demonstrate that schools are still unable to provide all students with equal opportunities, as evidenced by gaps in performance based on gender and other demographic descriptors of student identities (McGrath and Braunstein, 1997;Kao and Thompson, 2003;DeBerard et al., 2004;Ballen and Mason, 2017). Demographic gaps in performance in higher education can be partly explained by demographic gaps in student incoming preparation (Sun, 2017;Salehi et al., 2019). However, they can also be due to biased education structures such as methods used to assess student performance in STEM fields (Stanger-Hall, 2012), introductory gateway courses that "weed out" students (Mervis, 2010(Mervis, , 2011, traditional uninterrupted lectures rather than high-structure active learning methods (Haak et al., 2011;Ballen et al., 2017b), feelings of exclusion (Hurtado and Ruiz, 2012), stereotype threat (Steele, 1997;Cohen et al., 2006), and discrimination (Milkman et al., 2015). While many demographics and identities remain underrepresented in STEM, such as certain racial and ethnic groups, and first-generation college students, the work described herein focuses broadly on women in STEM.
Previous work has demonstrated that using high stakes exams as an assessment method has disparate impacts on male and female students. Even after controlling for student incoming preparation, this work shows female students underperformed on exams across multiple introductory biology courses, due in part to test anxiety (Ballen et al., 2017a). This negative effect of anxiety on performance was observed only for female students, and only on exam assessments. Anxiety did not impact male student performance or female student performance on nonexam assessments such as homework and in-class assignments.
Women's underperformance on exams is troubling for two reasons in particular. First, exam scores usually constitute a high proportion of grades in introductory courses (Koester et al., 2016). If the primary assessment method in entrylevel STEM courses leads to a "gender penalty" for female students, then institutions are creating an early obstacle that may prevent women from advancing to the upper-level subject material. Second, studies across STEM courses show that in some disciplines, low performance in introductory courses is disproportionately impactful for women, often resulting in the abandonment of their major, while men with similar performance are more likely to continue in the discipline (Grandy, 1994;McCullough, 2004;Rask and Tiefenthaler, 2008;Rauschenberger and Sweeder, 2010;Creech and Sweeder, 2012;Eddy and Brownell, 2016;Koester et al., 2016;Matz et al., 2017). Among women who perform well in introductory courses, Marshman et al. (2018) showed those who received high scores on a physics conceptual survey (or who were receiving A's) reported similar self-efficacy measures as male students with medium or low scores on the physics conceptual surveys (or who were receiving B's and C's). Therefore, female underperformance on exams, if generalizable across disciplines and over time, leads to a consequential gender performance gap that systematically disadvantages female students during their undergraduate pursuit of a degree.
Our previous work showed that women in introductory biology classes underperformed relative to men on exams, and that exam anxiety and interest in course content mediated the relationship between incoming preparation and exam performance (Ballen et al., 2017a). Until now, it was unclear whether the patterns we observed in undergraduate biology persist (1) in other disciplines, and (2) among students who have advanced beyond introductory science courses. First, biological sciences are among the most female-dominant fields in undergraduate STEM; ∼60% of undergraduate students in the life sciences are women (Neugebauer, 2006). If the gender gap in exam performance in introductory biology is due in part to the impact of test anxiety, this gap might be even more pronounced in male-dominated STEM fields where women are susceptible to negative social and learning experiences (e.g., tokenism, gender stereotypes about science abilities; Kanter, 1977;Miller et al., 2015).
Second, gender gaps in exam performance can be moderated by characteristics of the learning environment. Examples of characteristics that have documented impacts on student performance or experiences include group composition (e.g., gender ratio; Dahlerup, 1988;Sullivan et al., 2018), instructor traits (e.g., instructor gender: Crombie et al., 2003;Cotner et al., 2011, or attitude: Alsharif andQi, 2014;Cooper et al., 2018b), and class size . These characteristics may vary across disciplines, divisions, or even over a single semester. For example, upper division courses differ from lower division courses in a number of ways, and performance gaps present at the introductory level might not be apparent in more advanced courses. In upper division courses, student grades are less reliant on scalable multiple-choice exams; instead, the reliance on "lower stakes" assessments might ameliorate the negative impact of test anxiety on performance. Alternatively, or additionally, capable but test-anxious women may be weeded out at the lower division, or become acclimated to high stakes exams-or develop tools to counter test anxiety-as they progress through higher education.
In this study, we examined the generalizability of the exam gender gap across different STEM fields, and across both lower and upper division courses. We also studied how underlying social psychological mechanisms that have been previously studied in the context of student performance in STEM courses (e.g., test anxiety, interest in the course material Ballen et al., 2017a) change over time, and how they function as mediators of the gender gap in exam performance.
We address two multi-part questions as they apply across different fields (the College of Biological Sciences and the College of Science and Engineering) and divisions (lower and upper division courses):

Gender gap in different assessment methods (RQ1): (A) Do
we observe a gender gap in performance across different assessment methods (i.e., exam, non-exam, laboratory, and course grades)? (B) To what extent can these potential gender gaps be explained by incoming preparation (as measured by students' American College Testing entrance exam score, hereafter ACT)? 2. Social psychological mediators of exam gender gap (RQ2): (A) How do test anxiety and interest in course content mediate performance outcomes on exams? (B) How do these two social psychological factors vary based on gender and over the course of a semester?

Data Collection
The study is based on a secondary analysis of previously collected data that were provided by CB and SC. The IRB of the University of Minnesota exempted this study from the ethics review process (University of Minnesota IRB 00000800).

Class Performance
Administrative data were obtained from 5,864 students between 2015 to 2017. Courses included those offered by the College of Biological Sciences (CBS) or the College of Science and Engineering (CSE). A subset of the CBS data were explored in prior reports (e.g., Ballen et al., 2017a;Cotner and Ballen, 2017). CBS is a relatively small college (∼2,500 undergraduates) with a large percentage of women (the 2018 first year class was 66% female) and is restricted to biological fields including neuroscience, ecology, and genetics ("College of Biological Sciences, " 2018). However, the lower-division courses involved in this study primarily target non-biology majors and only one of the courses enrolls students interested in pursuing biology as a major. These introductory biology courses not only include the standard curriculum, but also include courses that are customized to student interests such as "Environmental Biology: Science and Solutions, " and the "Evolution and Biology of Sex, " all of which fulfill introductory biology requirements for the university. The upper-division courses in CBS enroll predominantly students majoring in biology. CSE is a larger college (∼5,500 undergraduates) with a relatively small percentage of women (the 2018 percent of graduate and undergraduate females was 27.4% female; an all-time high percentage), and houses the departments of chemistry, physics and astronomy, chemical engineering and materials science, computer science and engineering, and the school of mathematics ("CSE: By numbers, " 2018). Lower division courses included a mix of majors and non-majors, and upper division courses primarily served students who intended to major in the discipline (e.g., chemistry, computer science). We only included students who reported their gender in our sample (N = 5766) ( Table 1).
In this sample, we compared (1) average exam scores, or scores on all high-stakes assessments that accounted for a relatively large portion of a student's grade, (2) average non-exam scores including in-class assignments, credit for participation, and group work (note: these scores do not include out-of-class homework), (3) average laboratory scores, where applicable, and (4) final course grades (i.e., student cumulative performance in the course based on their performance on all exam, lecture, and laboratory activities). For each of these items, we transformed all raw percentage scores into class Z-scores (a measure of how many standard deviations a value is from the class section's mean score) for ease of interpretation. We calculated Z-scores using the formula Z-score = (X-µ)/σ, where X is the grade of interest, µ is the class mean score, and σ is the standard deviation.

Social Psychological Factors
In addition to performance data, for a subsample, we also examined change in exam anxiety and interest in course content over time. We conducted a survey at the beginning of the semester (pre-survey) and at the end of the semester before the final exam (post-survey). The survey included measures of student interest in course content as well as test anxiety ( Table 2). For both metrics, we used multi-item constructs from Pintrich's et al. (1993) Motivated Strategies for Learning Questionnaire (MSLQ; Table 2). The MSLQ is a common tool for assessing motivated strategies for learning, with historically high reliability and validity across different student populations (e.g., Pintrich et al., 1993;McClendon, 1996;Büyüköztürk et al., 2004;Feiz and Hooman, 2013;Jakešová and Hrbáčková, 2014). Items on each subscale were rated on a 7-point scale (1 = not at all true for me to 7 = very true for me). Factor loadings of items were between 0.64 to 0.87 for interest in course content, and 0.73 to 0.89 for test anxiety. In the reliability study, the internal consistency alpha coefficient was calculated as 0.89 and 0.88, respectively, for these two subscales.

Statistical Analyses
For our analyses, we parsed our data across colleges and divisions: lower division CBS, lower division CSE, upper division CBS, and upper division CSE. We divided data across colleges because the students may be systematically different in each college in ways that impact our outcome variables. Differences across colleges in our sample are discussed in the data collection section. We also divided data across divisions for each college, as there may be a selection bias among those who pursue upper division courses. Upper division courses target students who have already chosen a STEM field for their major, while lower division courses also target non-major students. For each of the four sub-samples, we examined: (RQ1.A) the gender performance gap across different assessment methods (e.g., exam, non-exam, laboratory, and overall course grade); (RQ1.B) the impact of

RQ1.A. Gender Gaps in Performance Across Different Assessment Methods
First, we analyzed gender performance gaps for different assessment methods without controlling for student incoming preparation and other demographic factors. These raw, "transcriptable" performance measures are what students see on their transcripts, use to assess their performance relative to their peers, and submit in graduate school applications. In order to examine the gender gap in performance, we used mixed-model regression analysis to predict student performance by gender without controlling for student incoming preparation. In this analysis, we included the fixed effect of gender, and the random effects of courses and sections to reflect the nested structure of the data (i.e., when sections are nested within courses).

RQ1.B. The Impact of Incoming Preparation on Gender Performance Gaps
Second, we examined the gender gap in student performance while controlling for student incoming preparation as well as their underrepresented minority (URM) status and first generation status (FGEN). Here we define URM students as those who are African American, Latino/a, Pacific Islander, and Native American. Incoming preparation was measured as students' American College Testing (ACT) score. The ACT is a standardized test that covers English, mathematics, reading, and science reasoning, and is commonly used for college admissions as well as in education research as a general measure of "incoming academic preparation." High schools in the United States vary substantially with respect to coursework, institution type (e.g., public, private, home-schooled), size, and grading scale.
Admissions officers in higher education use tests such as the ACT to place student metrics such as grades and class rank in a national perspective (https://www.act.org). However, the location of public schools in the United States also dictates financial resources committed to them, such that a district with higher socio-economic status has more educational resources going to each individual student (Parrish et al., 1995). Thus, variation observed in ACT score can also be explained by socio-economic status of students or proxies thereof (e.g., minority status, firstgeneration status; Carnevale and Rose, 2013). For this analysis, to find the simplest best-fitting model, we first started with a basic additive model. Then, we added different interaction terms between variables to this basic model, and tested whether addition of any interaction term would significantly improve the fit of the model. Our final model included gender (a factor with two levels: male = 0, female = 1), URM status (a factor with two levels: non-URM = 0, URM = 1), first generation (FGEN) status (i.e., whether the student was among the first generation in their family to attend university; a factor with two levels: continuing generation = 0, first generation = 1), and ACT score, as well as any interaction terms between these variables that improved the model fit significantly. Similar to the previous analyses, we also included the random effects of courses and sections.

RQ2.A. The Mediation Effect of Social Psychological Factors on Student Performance
For a subsample of students from whom we collected surveys, we used structural equation modeling (SEM) with lavaan R package (Rosseel, 2012) in order to test the structural relationship between incoming preparation, self-reported test anxiety, interest in course content, and exam performance for different genders. SEM is a statistical tool that allows us to address mechanisms underlying documented trends (Taris, 2002;Jeon, 2015). We used CFI, RMSEA, and SRMR to evaluate model fits. In this analysis, we normalized ACT score, test anxiety, and interest in course content for the whole sample. The normalized scores represent a measure of how many standard deviation a value is from the sample mean score. For students' general levels of test anxiety and course interest, we used data from the survey administrated at the beginning of the semester. The descriptive statistics of the subsample used in SEM are reported in Table 3.

RQ2.B. The Variation of Social Psychological Factors Across Genders and Time
To examine the variation of social psychological factors across genders and time, we analyzed how test anxiety and interest in course content vary over the semester for men and women. We used mixed-model multivariable regression analyses to regress either of these two psychological factors on gender and time points (beginning and end of the semester), while including the random effect of students. Figure 1 shows the average normalized score for different assessment methods across genders. In the next section, we report the sizes of gender gaps, and their significance across colleges and divisions for different assessment methods based on mixedmodel single variable regression.

RQ1.A. Gender Gaps in Performance Across Different Assessment Methods
Consistent with the pattern observed in Ballen et al. (2017b), in lower division courses in CBS, women underperformed by a relatively small but significant margin on exams (p = 0.033) ( Table 4). However, they significantly overperformed relative to men in non-exam (p < 0.0001) and laboratory measures (p < 0.0001). Due to their overperformance in these measures, women overperformed relative to men in overall course grades (p = 0.044). For CSE lower division courses, which include more male-stereotyped STEM disciplines such as physics, math, and chemistry, we observed the same trend of female underperformance on exams (p < 0.0001); of note, the size of the exam gender gap in CSE was three times that of CBS (−0.24 compared to −0.08 standard deviation). However, there was no gender gap in the non-exam measure (p = 0.233), and women significantly overperformed relative to men in the laboratory measure (p = 0.033). Due in part to the difference in the size of the gender gap in exams, as well as the differential weighting of exams in the overall course grade (e.g., Cotner and Ballen, 2017), women underperformed relative to men in overall course grades in lower division CSE (p = 0.002).
For upper division students in both CBS and CSE, we observed no influence of gender on exam performance (p CBS = 0.164, p CSE = 0.987, Table 5). However, in non-exam assessments, women marginally overperformed relative to men in CBS courses (b CBS = 0.28, p CBS = 0.072), and significantly overperformed in CSE couses (b CSE = 0.54, p CSE < 0.0001). On overall course grades, we did not observe gender differences across disciplines (p CBS = 0.108; p CSE = 0.352, Table 5). Due to a left skew in our data, to be conservative we also re-ran all the above analyses using non-parametric tests. The outcomes were the same, and results of non-parametric analyses were very similar to regression analyses reported here (Tables S1, S2).

RQ1.B. The Impact of Incoming Preparation on Gender Performance Gaps
To analyze what portion of gender performance gaps described in Tables 4, 5 are due to differences in incoming academic preparation, we used mixed-model multivariable linear regression with ACT as a measure of incoming preparation.
In this analysis, we also controlled for URM and FGEN status of students. Table 6 reports the coefficients of the simplest best fitting regression models predicting performance in each assessment method for lower division courses across colleges.
In the following, we will focus on the effect of gender in this analysis. However, there are noteworthy effects of URM and first generation status on student performance that we discuss in detail in a forthcoming publication (Salehi et al., in preparation). Because our data violated the assumption of normality of residual distribution, we also analyzed the data using robust regression (Yaffee, 2002;Koller, 2015Koller, , 2016. The results of robust regression are reported in (Tables S3, S4). The results of these analyses were aligned with the following reported results. For lower division courses, we found no significant gender gap in exam performance in CBS courses (p = 0.404) after controlling for incoming preparation (Table 6). However, even after controlling for incoming preparation, women performed 0.19 standard deviation lower than men on exams in lower division CSE courses (p < 0.0001). In contrast, women overperformed relative to men in non-exam and laboratory scores by 0.23 (p < 0.0001), and 0.22 standard deviation

Each cell reports the simplest best fitting model. The simplest best fitting model includes only interaction terms if their addition improved the fit of the model significantly.
For each predictor, we report the coefficient, the standard error of the coefficient in parentheses, and p-value of that coefficient. Positive coefficients for categorical variables of URM, FGEN, and gender indicate that URM students, FGEN students, and female students overperformed relative to their counterparts, and negative values mean they underperformed. Significant codes are: ***p < 0.001, **0.001 < p < 0.01, *0.01 < p < 0.05, † 0.05 < p < 0.1.
(p < 0.0001), respectively, in lower division CBS courses; and 0.17 standard deviation (p = 0.004) in laboratory scores in lower division CSE courses. Women also marginally overperformed by 0.10 standard deviation in non-exam scores of lower division CSE courses (p = 0.090). For the overall course grade, after controlling for incoming preparation, women significantly overperformed relative to men by 0.16 standard deviation in CBS courses (p = 0.0001), but they marginally underperformed by 0.09 standard deviation in CSE courses (p = 0.057).
In upper division courses, after controlling for incoming preparation, we found no gender difference in exam performance for both colleges (Table 7). However, female students overperformed in non-exam measures significantly in upper division CSE courses, and marginally in CBS courses. They had on average 0.58 standard deviation higher non-exam score in upper division CSE courses (p < 0.0001), and 0.28 standard deviation in CBS courses (p = 0.096). Upper division CSE courses in this sample did not have lab components, and in CBS courses, we did not observe differences in lab scores (p = 0.330). For the overall course grade, there was no gender gap in CBS (p = 0.178), and marginal female overperformance of 0.21 standard deviation in CSE (p = 0.095). This marginal overperformance of females in CSE can be explained by their overperformance in non-exam assessments.
In summary, female students only underperformed on exams in lower division, introductory courses. After accounting for incoming preparation through ACT score, this gender gap in exam performance closed in one college (CBS), and decreased in size in the other (CSE). In other forms of assessment, if we observed any gender difference, it was female students outperforming their male counterparts.

RQ2.A. The Mediation Effect of Social Psychological Factors on Student Performance
Previous work demonstrates that test anxiety and interest in the course content exert gender-specific impacts on exam performance in introductory biology (Ballen et al., 2017b). To test whether these patterns persisted across different disciplines and divisions, we re-tested the same model on this larger sample using structural equation modeling (SEM) analysis. We fit the hypothesized model, shown in Figure 2, to four sub-samples of data (CBS lower division, CSE lower division, CBS upper division, CSE upper division), and used gender as a grouping variable to fit this model to the data of each gender separately.
We hypothesized that for women and men in each of these four sub-samples, exam performance is influenced by incoming preparation, text anxiety, and interest in course content. Furthermore, test anxiety and interest in course content are influenced by student incoming preparation. Therefore, this model suggests that incoming preparation influences student exam performance directly, as well as indirectly through test anxiety and interest in course content. In other words, test anxiety and interest in course content partially mediate the effect of incoming preparation on exam performance. By fitting this model to the data of each gender separately, we tested whether these mediation effects are different across genders for each sub-sample.
Acceptable ranges for SEM fit indices are: 0-0.07 for root mean square error (RMSEA), above 0.95 for comparative fit index (CFI), and 0-0.1 for standardized root mean square residual (SRMR) (Taris, 2002). This model had acceptable fit indices for all four subsamples, suggesting that it was an acceptable model to describe the variation in the data (CBS-lower: RMSEA = 0.064, CFI = 0.990, SRMR = 0.020; CBS-upper: RMSEA = 0.000, CFI    For women in CBS courses, test anxiety negatively influenced exam scores in both lower and upper divisions; for male students, however, test anxiety did not correlate with exam scores (Figure 3). Further, for CBS female students, ACT score was also negatively correlated with test anxiety. Therefore, this model suggests that incoming preparation influences student exam performance positively and directly, as well as indirectly through test anxiety (the red path in Figure 3). It is also notable that this indirect effect was stronger for upper division courses, as the negative relationship between test anxiety and female student exam score was stronger in upper division courses than in lower division courses. One standard deviation increase in test anxiety decreased exam score by 0.11 standard deviation in lower division courses and by 0.34 standard deviation in upper division courses.

Upper
Similarly, in CSE, test anxiety negatively correlated with exam scores for female students in both lower and upper divisions, but did not correlate with exam scores for male students in both divisions (Figure 4). However, unlike CBS, female student anxiety was correlated with their ACT scores only for the lower division courses, and not for the upper division courses. Therefore, the negative influence of anxiety is mediator for the indirect effect of ACT on exam for lower division CSE course. Like CBS, the negative influence of test anxiety on exam score increased in the upper division courses. One standard deviation increase in test anxiety decreased exam score by 0.14 standard deviation in lower division courses and 0.25 standard deviation in upper division courses. In summary, while the relationship between incoming preparation and test anxiety varied across women studying STEM at the University, we observed a consistently significant negative relationship between test anxiety and exam performance.
Our results suggest that regardless of discipline, exam performance for women was negatively influenced by their test anxiety, and surprisingly, this influence was more pronounced in upper division courses (Figures 3, 4).

RQ2.B. The Variation of Social Psychological Factors Across Genders and Time
In all four sub-samples, except for CBS upper division courses (p = 0.272), women reported significantly higher levels of test anxiety than men (Figure 5). Women reported on average 0.35 standard deviation higher levels of test anxiety (p = 0.0001) than men in CBS lower division courses, and 0.6 standard deviation higher level of test anxiety in both lower and upper division CSE courses (p < 0.0001). Furthermore, in CSE upper division courses, test anxiety increased by 0.29 standard deviation over the semester (p < 0.0001).
Interest in course content was not a significant factor in predicting exam performance in any of the subsamples. That said, we still examined the variation in interest across genders and over the semester. We also found no gender difference in interest in upper division courses across both colleges (p CBS = 0.257, p CSE = 0.665), and no significant change in interest over the semester for CBS (p CBS = 0.900, p CSE = 0.131). However, in lower division courses, female students expressed 0.35 standard deviation higher interest in course content than male students in CBS courses (p = 0.0001), and 0.49 standard deviation lower interest in course content than male students in CSE courses (p < 0.0001). For all students, interest increased by 0.18 standard deviation in CBS lower division courses (p = 0.023), and decreased by 0.17 standard deviation in CSE lower division courses (p = 0.005). Changes over time in interest in lower division courses were not different between genders (p CBS = 0.648, p CSE = 0.711) (Figure 6).

DISCUSSION
Gaps in academic performance are attributable to a host of different external factors, including measures of academic preparation. However, even when accounting for preparation (e.g., via the ACT, SAT, or high-school grade-point average), achievement in some disciplines can be predicted by student characteristics such as gender, underrepresented minority status, and first generation status. We explored how factors other than these unidimensional categories of student identity-such as social psychological factors-impacted performance among students in science. We focused on mechanisms that underlie the gender-based performance gaps in different assessment methods across STEM fields and divisions.
We showed that women only underperformed in high stakes examinations in lower division introductory courses FIGURE 3 | Partial mediation analyses show differences in the significant effects of incoming preparation (ACT) on exam performance for female (Left) and male (Right) students in lower division and upper division courses in the College of Biological Sciences (CBS). Red arrows depict negative effects and blue arrows show positive effects. ACT has direct, positive effects on exam performance for all lower division students and for female upper division students. This effect was marginally significant for male upper division students. For female students in the lower and upper division courses, ACT negatively predicts test anxiety, which in turn influences exam performance. For male students at the lower and upper division, ACT negatively affects test anxiety, but test anxiety does not in turn influence exam performance. In the graphs, "e" circles indicate error terms in estimations of the structural equation model variables.
across multiple STEM fields. However, in non-exam and laboratory assessment methods in these introductory courses, either there was no gender gap or female students overperformed relative to their male peers. In CBS courses the gender gap in exam performance became non-significant when incoming preparation was accounted for. However, in CSE, even after controlling for incoming preparation, we observed a significant gender gap in exam scores. For upper division courses, unlike lower division courses, there was no gender gap in exam performance; and similar to lower division courses, in nonexam and laboratory assessment methods, either there was no gender gap, or female students overperformed relative to their male peers.
The gender difference in "transcriptable" grades in introductory courses in the two colleges could be due to several factors. First, the courses included in this study in CBS are some of a number of courses that meet the university's liberal education requirement for "biology-with-lab." The majority of the CBS courses included in this study do not serve as prerequisites for any other courses nor are they specifically required for most majors. All the CSE courses in this study are prerequisites for numerous courses and are required (or one of several challenging course options) for various majors. Therefore, the pressure to perform in the introductory level courses included in this study might be very different between the colleges. The grade pressure in a biology-with-lab course that is not a requirement for a student's major is likely lower than the grade pressure in courses that are considered gateways into a major. Further, this grade pressure may differentially impact the level of exam anxiety students feel. However, we did not see meaningful differences in test anxiety between the two colleges in these lower division courses.
We examined the mediation impact of test anxiety and interest in course content on gender performance on exams. The underperformance of women in lower division exams was explained in part by reported test anxiety. In upper division courses, which lack gender gaps in exam performance, test anxiety still negatively impacted exam performance for women, but not for men. For the remainder of this work, we further explore the phenomenon of anxiety-both general and test anxiety, especially as it pertains to gender-biased gaps in performance in STEM fields.
Test anxiety is common among university students; in one sample of undergraduates, 30.0% of males and 46.3% of females reported suffering from test anxiety. In this same report, students often declined seeking help from their peers or instructor for fear of the stigma associated with test anxiety (Gerwing et al., 2015). A meta-analysis of 126 studies found a negative correlation between test anxiety and performance, reporting that overall, students who reported low test anxiety overperformed relative FIGURE 4 | Partial mediation analyses show differences in the significant effects of incoming preparation (ACT) on exam performance for female and male students in the College of Science and Engineering (CSE) in lower division and upper division courses. Red arrows depict negative effects and blue arrows show positive effects. ACT has direct, positive effects on exam performance for all lower division and upper division students. For female students at the lower division, ACT predicts test anxiety, which negatively predicts exam performance. For women in upper division courses, ACT does not predict test anxiety, but test anxiety negatively predicts exam performance. For male students at the lower and upper division, ACT negatively affects test anxiety, but test anxiety does not in turn influence exam performance. In the graphs, "e" circles indicate error terms in estimations of the structural equation model variables.
FIGURE 5 | Change in test anxiety over the course of the semester for students in CBS and CSE for women (green) and men (blue) in lower division courses (Top) and upper division courses (Bottom). The survey was administered at the beginning of the semester (pre-survey) and at the end of the semester (post-survey; i.e., after students completed the last in-class test, but before their final exam). On average, women (green) reported higher levels of test anxiety than men (blue) in lower division courses in the College of Biological Sciences (CBS) and in both upper and lower divisions in the College of Science and Engineering (CSE). In upper division CSE, average test anxiety significantly increased for all students over the course of the semester. to students who reported high test anxiety by nearly half of a standard deviation (Seipp, 1991).
Further, women are more likely than men to be diagnosed with a generalized anxiety disorder (Wittchen et al., 1994;Kessler et al., 2005;Leach et al., 2008). Similarly, some investigators have documented higher levels of test anxiety in women than in their male peers (Osborne, 2001;Núñez-Peña et al., 2016;Ballen et al., 2017a). Our current work connects these threads by demonstrating that test anxiety negatively impacts exam performance for women, but not for men. Not only do these data confirm prior findings (Ballen et al., 2017a), but they elaborate on earlier work by identifying these trends in courses offered through multiple STEM disciplines besides biology.
While some hypothesize that heightened emotionality during an exam causes heightened anxiety, which in turn depresses performance (Maloney and Beilock, 2012;Ramirez et al., 2013), others suggest that it is the awareness of poor past performance that causes test anxiety (Hembree, 1990). In the first case, it is the anxiety that leads to the poor performance, and in the second, it is the poor performance that leads to the anxiety. Regardless of the origins of anxiety, there are certainly strategies instructors can use to minimize test anxiety and its impacts-strategies that are likely to benefit all students. And, given the demonstrated connection between introductorylevel performance and retention in STEM (Seymour and Hewitt, 1997), it is worthwhile to pay closer attention to social psychological factors-such as test anxiety-that may disadvantage underrepresented groups.

How Can Instructors Address Student Anxiety?
It may be difficult to target each individual student's experience of anxiety, especially in the lower-division, high-enrollment courses. However, there are certain strategies instructors can employ to decrease the anxiety itself, or the impacts of the anxiety on a student's performance.
Rethinking assessment can be a helpful strategy that directly addresses test anxiety. Prior work in several introductory biology courses demonstrated that gendered performance gaps disappeared when exams were devalued in favor of the addition of multiple, lower-stakes assessments-possibly as a result of a reduction in test anxiety (Ballen et al., 2017a;Cotner and Ballen, 2017). The fact that women in our sample were more likely to underperform, relative to their male peers, on anxietyinducing high stakes exams, combined with the fact that, across the board, women express higher levels of test anxiety, suggests that minimizing the impact of exams could lower performance gaps-such as those documented here.
Instructors can also create a classroom environment that minimizes general anxiety. Tanner (2013) discusses several instructional strategies for creating a welcoming classroom environment and reducing general class anxiety-from playing music before class to taking time to hear a range of student voices. Avoiding anxiety-inducing behaviors such as cold-calling on individual students (England et al., 2017;Cooper et al., 2018a), and opting for less stressful options such as calling on groups via randomly appointed spokespeople can minimize anxiety (Rocca, 2010). And creating a pattern of frequently encountered behaviors will allow students to adjust to the specific in-class expectations of the instructor (McCroskey, 2009). Finally, simply being transparent in expectations (about grading, test content, learning goals) can minimize anxiety (Neer, 1990). These are strategies that target student general anxiety in class, not particularly their test anxiety. While these two anxiety constructs can be positively correlated, they might differ significantly as well. Future works should explore whether and how reducing general anxiety in class would impact test anxiety, and how this effect is moderated by demographic status.

Recommendations for Future Research
While there is compelling evidence that test anxiety, as well as anxiety in general, affects performance and retention, there is little if any work demonstrating the impacts of the above interventions on student anxiety, or the connections between lowered anxiety and improved performance. Thus, future work could measure the impacts of experimentally reducing anxiety on student outcomes such as performance, self-efficacy, sense of social belonging, and retention. For example, instructors could incorporate weekly quizzes, instead of or in addition to higherstakes midterm exams, to test whether this reduces test anxiety, and in turn, improves performance for those impacted by test anxiety. Additionally, adding constructed response questions to summative assessments in large enrollment courses mitigates gender-biased performance outcomes (Stanger-Hall, 2012), and future work would benefit from exploring the impacts of different types of exam questions on student anxiety. Also, offering the option of retaking high stakes exams might reduce the anxiety associated with single metrics, as could extending the time allowed to complete exams.
With this current work, we cannot explain why the gendered gaps in performance disappear in upper division courses. Are women being "weeded out" after introductory courses, are they learning to cope over time, or benefitting from small classrooms ? Also, because the populations are differentrepresenting a greater range of majors in the introductory courses-we cannot rule out the possibility that the differences seen in lower division courses are driven largely by students not intending to major in science. These questions should be experimentally addressed, and will also benefit from longitudinal studies of individual students in the STEM pathway.
In this study we did not have any data about specific instructional practices employed in each particular course. Therefore, we could not examine how instructional practices in each course influenced gender gaps in different assessment methods. Second, the courses in our sample do not represent a cross-section of all courses offered in each college, nor were they selected to be contrasting cases of instructional practices. Our data collection was based on a convenience sample of instructors willing to share their data. Given that, we could only examine whether, on average, there existed gender gaps in different assessment methods in a set of different STEM courses in lower and upper divisions. Despite differences in student composition across two diverse colleges, the similar results we observed suggest these trends are generalizable to science majors and non-majors. Future studies might explore how different instructional practices affect demographic performance gaps, and which STEM fields have been more successful in employing these equitable instructional practices.
Finally, we used ACT as a measure of incoming preparation. We recognize that the ACT itself is a crude measure of student incoming preparation, and that it is also a high stakes examination. Other metrics, such as high-school ranking or GPA, might give a more accurate snapshot of a student's incoming preparation. Given the evidence from this study and previous studies, it is clear that the way in which we assess students should be reconsidered-not only within colleges and universities, but also in the admission process of higher education.

CONCLUSION
For investigators, there is still much work to be done to establish the salient connections between student affect, performance, and retention in STEM. And for instructors, it's clearly time to reconsider long-standing norms related to assessment strategies. Specifically, it may be time for a shift away from reliance on high stakes, timed examinations, which have negative effects on female students and may not be telling of a students' ability to succeed in a discipline. Rather, we encourage the use of evaluation that measures relevant skills, encourages growth, and allows instructors and students to better assess student progress.

DATA AVAILABILITY STATEMENT
The datasets generated for this study are available on request to the corresponding author.

ETHICS STATEMENT
We received IRB exemption to work with student data from University of Minnesota, IRB 00000800.

AUTHOR CONTRIBUTIONS
SS, SC, and CB contributed to the conception and design of the study. CB organized the database. SS performed the statistical analysis. SS and CB wrote the first draft of the manuscript. SC wrote sections of the manuscript. All authors contributed to manuscript revision, read, and approved the submitted version.