Gender Stereotypes in Student Evaluations of Teaching

This paper tests how gender stereotypes may result in biased student evaluations of teaching (SET). We thereby contribute to an ongoing discussion about the validity and use of SET in academia. According to social psychological theory, gender biases in SET may occur because of a lack of fit between gender stereotypes, and the professional roles individuals engage in. A lack of fit often leads to more negative evaluations. Given that the role as a lecturer is associated with masculinity, women might suffer from biased SET because gender stereotypes indicate that they do not fit with this role. In two 2 × 2 between groups online experiments (N's = 400 and 452), participants read about a fictitious woman or man lecturer, described in terms of stereotypically feminine or masculine behavior, and evaluated the lecturer on different SET outcomes. Results showed that women lecturers were not disfavored in general, but that described feminine or masculine behaviors led to gendered evaluations of the lecturer. The results were especially pronounced in Experiment 2 where a lecturer described as displaying feminine behaviors was expected to also be more approachable, was better liked and the students rather attended their course. However, a lecturer displaying masculine behaviors were instead perceived as being more competent, a better pedagogue and leader. Gender incongruent behavior was therefore not sanctioned by lower SET. The results still support that SET should not be used as sole indicators of pedagogic ability of a lecturer for promotion and hiring decisions because they may be gender-biased.


INTRODUCTION
The purpose of this article was to test the impact of gender stereotypes in student evaluations of teaching (SET), in two online social psychological experiments. Previous research in this field indicates a gender bias in SET where women generally receive lower SET compared to men (e.g., MacNell et al., 2015;Boring, 2016;Mengel et al., 2018;Mitchell and Martin, 2018;Fan et al., 2019). With this article, we contribute to an ongoing discussion about the use of SET, both as formative and summative evaluations of teaching and teachers. We provide new insights into the mechanisms behind SET and how they relate to a lecturer's gender identity and gendered behavior.
Taking a social psychological perspective, gender biases may occur because gender stereotypes prescribe and proscribe certain behaviors for individuals of different genders. Specifically, when gender stereotypes and professional roles do not fit, the individual can be sanctioned with negative evaluations (Heilman, 2001;Heilman and Chen, 2005;Heilman and Haynes, 2005). In this article, we test to what extent women lecturers in higher education are sanctioned by low SET due to a tradeoff between behaviors expected from the supposedly masculine-coded role as a university lecturer, and the stereotypes about how women should and should not be.

Student Evaluations of Teaching
Originally, SET were introduced for formative purposes. That is, the evaluations were to be used in order to improve and shape the quality of teaching (Hornstein, 2016). Since then, SET has become a primary indicator of summative evaluations of a lecturer's performance. That is, SET are used as an overall sum of pedagogical competence, often as the sole indicator of this competence (Berk, 2005;Galbraith et al., 2012;Spooren et al., 2013). SET are now often used for promotion and hiring decisions (Cashin, 1999;Seldin, 1999;Clayson, 2009;Davis, 2009;Seldin et al., 2010), indicating that it is important to understand systematic variations in SET.
SET were first criticized by Adams (1997), where he pointed out several flaws such as validity, reliability, gender bias, and a number of other related issues (Yunker and Yunker, 2003;Wright, 2006;Beecham, 2009;Hoefer et al., 2012;Spooren et al., 2013;Braga et al., 2014;Stark and Freishtat, 2014;Boring et al., 2016). It is suggested that SET mainly reflects satisfaction with teaching among students after they have finished a course. As such, it is argued that SET rather should be seen as a popularity measurement, rather than a measurement of teaching capability (Beecham, 2009;Spooren et al., 2013;Braga et al., 2014;Stark and Freishtat, 2014). This paves the way for both individual and contextual factors to exert influence regarding high or low evaluations and leads to the aim of the present article-to test if gender stereotypes influence SET.
Several studies have shown a gender bias in SET, although the results are inconclusive. Many studies have shown that women receive lower evaluations than men (MacNell et al., 2015;Boring et al., 2016;Mengel et al., 2018;Mitchell and Martin, 2018). For instance, Boring et al. (2016) showed a systematic gender bias in SET where women lecturers received lower evaluations on seemingly objective aspects, such as how promptly assignments were graded. Likewise, Mitchell and Martin (2018) showed that a woman lecturer was rated lower on other similar aspects, such as the course itself, work load, the technology, etc. However, some studies show that women receive higher ratings than men (Rowden and Carlson, 1996;Bachen et al., 1999), and finally, some have not found a difference between evaluations of women and men (Feldman, 1993;Centra and Gaubatz, 2000). These results imply that gender of a lecturer alone is not sufficient to explain variations in SET between women and men lecturers. One possible cause to the inconsistencies in earlier results may be that both individual and contextual factors interact with a lecturer's gender . For instance, Boring et al. (2016) found that the gender bias in SET varied with, for example, discipline. These results are supported by Mengel et al. (2018), who showed that the gender bias is magnified in mathematical courses, and particularly pronounced for younger women lecturers. One explanation might be that the STEMfield (Science, Technology, Engineering, and Math) is heavily dominated by men (Makarova et al., 2019), where (younger) women accordingly violate the gender norms, resulting in a lack of fit between the expectations of their gender role and the expectations of the role as a university lecturer, which could explain the bias (Heilman, 1983(Heilman, , 2012. Such lack of fit, described more below, indicate that a woman lecturer behaving in a "masculine" way may receive different SET as compared to a woman lecturer acting in a "feminine" way, which essentially decreases the lack of fit. To better understand the complexity of how gender, stereotypes and fit between a lecturer's gender and their behavior operate to influence biases in SET, we now turn to social psychological theory.

Gender Stereotypes
Gender stereotypes are collective mental representations about what is typical regarding women and men when it comes to personality, behavior, and/or expression (Ellemers, 2018). This means that gender stereotypes are shared generalizations about women and men, and the consensus of these generalizations among the population is high (Hentschel et al., 2019). The content of the gender stereotypes pertain to two core dimensions in social judgment, referred to as agency and communion (Abele and Wojciszke, 2014). Agency refers to goal-achievement, whereas communion refers to the maintenance of social relationships (Bakan, 1966). Women are more often perceived as communal (e.g., caring, sensitive, loyal, and understanding; Eagly and Wood, 2012), while men are more often perceived as agentic (e.g., independent, assertive, dominant, self-reliant, and determined). Hence, agentic traits are traditionally associated with masculinity, while communal traits are traditionally associated with femininity. Importantly, gender stereotypes function both prescriptively (what women and men should engage in, and how they should be), and proscriptively (what they should not engage in and be) (Gustafsson Sendén et al., 2019;Hentschel et al., 2019).
When gender stereotypes are fulfilled, that is, when women perform communal tasks and men perform agentic tasks, individuals are positively evaluated. Thus, lecturers who adhere to gendered expectations can be evaluated more favorably (Andersen and Miller, 1997). For example, Boring (2016) found that women lecturers received the highest ratings on availability and quality of contact-two characteristics typical of the stereotypes for women (Abele and Wojciszke, 2014). In relation to social perception and evaluation of others, the problem with stereotypes becomes evident when they are challengedwhen gender and role, or behavior, mismatch. When stereotypes regarding roles or behavior and gender are incongruent (i.e., lack of fit), individuals are likely to be sanctioned and negatively evaluated (Heilman, 1983(Heilman, , 2012Eagly and Karau, 2002;Heilman and Okimoto, 2007;Brescoll et al., 2010). Rudman et al. (2012) discuss a gender backlash effect where women can reach higher positions through agentic behaviors, but they are at the same time disliked and hence not viewed as hirable. This leads women to a situation where they are forced between being liked or being respected, which undermines their ability to achieve positions of power (Rudman et al., 2012). For instance, when women engage in behaviors typically considered as masculine, they are less liked and their behavior is found to be less socially accepted, as compared to when men engage in the same behavior (Bartol and Butterfield, 1976;Jago and Vroom, 1982;Carli, 1990;Carli et al., 1995;Heilman and Okimoto, 2007). This seems to be true in students' perceptions of lecturers as well. When gender roles are violated by lecturers, students become critical (Chamberlin and Hickey, 2001;Sprague and Massoni, 2005). This suggests that if gender stereotypes are responsible for the variation in SET between women and men lecturers that has been observed in previous research, the role as a lecturer is coded as masculine. Traditionally, higher education has been exclusively for men, which could still affect how the role as a university lecturer is perceived in terms of gender. Moreover, being a lecturer at a higher education institution is a leadership role, and because leadership and authority traditionally are associated with masculinity (see Heilman and Okimoto, 2007), women lecturers violate gender stereotypes and may face biases and criticism (Eagly and Karau, 2002). Hence, women lecturers must balance the demands of their gender role, as well as the demands of being an authority figure, which inevitably will lead to some sort of discrepancy. Taken together, theory and empirical studies highlight the difficulty that women lecturers have in balancing the tension between agentic demands from the leadership role and communal demands from the gender role (Zhen et al., 2018).

Overview of the Present Research
The present research zooms in on the discrepancy between gender stereotypes and the role as a university lecturer as a source of gender bias in SET. Specifically, we test if women lecturers are sanctioned if they do not engage in traditionally feminine behaviors, or lack traditionally feminine characteristics (Rudman, 1998;Rudman and Glick, 2001). The following hypotheses are formulated: H1: Women lecturers receive lower SET on average, compared to men lecturers.
H2: A woman lecturer described as having traditionally masculine behavior and characteristics, receive the lowest SET.
In two experiments, students were presented with a description of a fictive lecturer. The descriptions varied with respect to the lecturer's gender (the lecturer was referred to as either "she" or "he" in the text). Moreover, the behavior and characteristics of the lecturer were described as either stereotypically feminine or stereotypically masculine. In Experiment 1, the description of the lecturer contained both positive and negative feminine/masculine behaviors and traits. In Experiment 2, the valence of feminine/masculine behaviors and traits (i.e., positive and negative) was even more balanced. Participants' task was to rate the lecturer on common SET items. Experiment 1 used a wide range of SET items, mainly from previous literature. In Experiment 2, the number of items were reduced due to semantical overlap.
The studies were carried out in accordance with the national guidelines on ethical research established by the Swedish Research Council retrievable at: https://publikationer.vr.se/en/ product/good-research-practice/.

EXPERIMENT 1
Because our hypotheses are formulated to test the potential mismatch between the role as a university lecturer, and the female gender role, we first established that the role as a university lecturer was indeed coded as masculine. In a pilot study, 82 students read a description of a lecturer. The description varied with respect to gender stereotypical (feminine and masculine) characteristics and behaviors of the lecturer, but no actual gender information was provided (i.e., we replaced the pronoun with X). After reading the description of the lecturer, participants indicated what gender they thought the lecturer had, as a free-text response. Across the feminine (n = 33) and masculine (n = 49) conditions, 74 (90%) participants indicated that the lecturer was a man, only 8 (10%) indicated a woman (masculine condition: man = 44, woman = 5; feminine condition: man = 30, woman = 3). No other genders were suggested. Hence, the role as university lecturer is clearly associated with masculinity.
To assess the impact of lack of fit between the lecturer role and gender role, we designed an experiment where the lecturer's gender and behavior varied between conditions. The design was a 2 (gender: she/he) × 2 (behavior: feminine/masculine), between groups factorial design. For example, in the feminine version, the lecturer was described as supportive and caring, being available for students, being responsive and empathic, while the masculine version was described as more focused on the research, being assertive and demanding, expecting hard work, and being unavailable. The descriptions were balanced in that the feminine version also contained some negative feminine traits, such as being uncertain, whereas the masculine version contained some positive masculine traits, such as being certain. The descriptions are provided in the Supplementary Material. Participants were randomly assigned to one of the four conditions (n's= she/masculine = 119, she/feminine = 89, he/masculine = 99, he/feminine = 94).

Measures
To measure SET, a range of measures from previous research were included. The Professor Effectiveness scale (Goebel and Cashen, 1979;Wilson et al., 2014), The Brief Professor-Student Rapport Scale (Ryan and Wilson, 2014) with two sub-scales (Perceptions of the teacher and Student Engagement). Personal characteristics of the lecturer were assessed by items suggested by MacNell et al. (2015) and Boring (2016). To assess perceptions of the lecturer's competence, we included items referring to more general perceptions of the course and the pedagogy, since these may better reflect competence compared to the evaluation of individual characteristics. These items were averaged into a mean index. Two items measured the difficulty level of the course, and two items measured the general impression of the course. Finally, participants rated warmth and competence (Fiske et al., 2007). Where indices were made of the scales, we averaged the items into

Analyzed separately
The lecturer encourages questions The lecturer expects good work The lecturer assigns too much work The lecturer is organized The lecturer can explain concepts The lecturer behaves in a friendly manner The lecturer is generally a good teacher The Brief Professor-Student Rapport Scale (Ryan and Wilson, 2014) 1 = Strongly disagree 7 = Strongly agree The lecturer is compassionate The lecturer is enthusiastic The lecturer is reliable The lecturer is receptive The lecturer cares about the class The lecturer encourages questions and comments from students The lecturer is caring The lecturer is consistent The lecturer is enthusiastic The lecturer is fair The lecturer is helpful The lecturer is knowledgeable The lecturer is professional The lecturer is prompt The lecturer is respectful The lecturer provides praise The lecturer provides feedback Boring (  a mean index. Cronbach's α's for these scales are shown in Table 1, where it is also detailed if the items were analyzed separately (i.e., not included in a scale). The questions are summarized in Table 1.

Results
For all of the outcome measures detailed in Table 1, we computed 2 × 2 ANOVAs with gender of the lecturer (she/he) and gendered behavior (feminine/masculine) as between-participant factors. We also included participant gender as covariate. Means, standard deviations and F-values for the main effects are shown in Table 2. Only the main effects are presented, because none of the interaction effects were significant. The first hypothesis stated that women lecturers overall should receive lower SET than men. The results showed no main effects of the lecturer's gender on any of the outcome variables, see Table 2. The second hypothesis stated that women lecturers described as having masculine characteristics and behavior should receive the lowest SET. This hypothesis implies that we would see interaction effects between gender of the lecturer and described behavior. However, none of the interactions were significant. Thus, the results indicate that there were no differences between how a woman lecturer was rated depending on feminine/masculine behavior, as compared to a man lecturer described with feminine/masculine behavior. This means that neither of the hypotheses were supported. Interestingly, there were significant main effects of whether the lecturer was described as having feminine or masculine characteristics on all outcome variables. The means are shown in Table 2. For easier overview, significant differences in favor of the feminine description are marked in bold, while differences in favor of the masculine description are marked in gray.
In sum, participants rated a feminine behavior more positively than the masculine behavior on almost all the outcome measures. The difference on many items are unsurprising since the text in the feminine condition described a lecturer that was more involved with the students and teaching, therefore it can be expected that students would prefer a lecturer with these characteristics. For instance, in the Professor Effectiveness scale, the items encourages questions, is organized, can explain concepts, behaves in a friendly manner, and is generally a good teacher should receive higher values based on the text in the feminine condition. An interesting finding was that the participants expected that the masculine lecturer would expect good work and assign too much work to a higher degree compared to the feminine lecturer. Other results that are not easily explained by the descriptions of the lecturer are the items related to difficulty. The participants thought that the course had higher requirements and that students at the course studied more when the behavior of the lecturer was masculine.
Combined, the results indicate that the participants rate a lecturer described in feminine terms more positively, and they rather attend their course, compared to a lecturer described in masculine terms. However, the participants thought that the masculine behavior implied higher demands and a more difficult course, where students actually did put in more hours. These are not unambiguously negative features from a learning perspective.
Finally, the lecturer with masculine behavior was rated as less competent than the lecturer with feminine behavior. Even though the effect was smaller compared to the other effects in this study, it was significant. This was surprising since competence has been strongly associated with masculinity (Fiske et al., 2007). However, recent research show that competence is one aspect of gender stereotypes that has changed the most over the years, and that women now sometimes are perceived as more competent than men (Gustafsson Sendén et al., 2019;Eagly et al., 2020). Hence, the results are not contradicting of recent research. Also, in the masculine condition, the lecturer was described as more competent as a researcher than teacher, while the feminine behavior was described as more competent in pedagogy. It is possible that this asymmetry between competence in different areas influenced the participants when they made the overall competence rating. From a student perspective, pedagogical competence should be more important in SET than research competence. One reason for the lack of main effects of the lecturer's gender, or interactions with description of behavior and characteristics, may be that the feminine version overall was seen as more positive from a student's perspective. Hence, in a second experiment, the descriptions of the lecturer were more ambiguous, so that the feminine condition also entailed more negative feminine traits and the masculine condition entailed more positive masculine traits. We also reduced the number of outcome variables, and focused on assessments of the course that were not directly related to the individual described.
The design was the same as in Experiment 1, that is a 2 (gender of lecturer: she/he) × 2 (description: feminine/masculine), between groups factorial design. Participants were randomly assigned to one of the four conditions (n's = she/masculine = 112, she/feminine = 100, he/masculine = 122, he/feminine = 118). As mentioned, the feminine and masculine descriptions were now more balanced with respect to valence of described traits and behaviors. For instance, the feminine description detailed that the lecturer appeared afraid of students if being criticized, and problems in the teaching team where the lecturer lacked leadership skills and confidence (Abele and Wojciszke, 2014). Because we still kept the positive aspects in the description, such as being considerate, sympathetic and caring, the description was ambivalent on purpose. The masculine description underwent the same procedure, where that the lecturer was described as confident and convincing, ambitious, competent and professional, and that these traits were applied not only to research but also to teaching. By keeping some of the negative aspects from the previous description, such as being seen as unapproachable, research focused and rigid, this description also became ambivalent on purpose.

Measures
The outcome measures assessed pedagogy and evaluations of the course, rather than traits of the lecturer. The pedagogy items formed a scale with a mean index and were the same as in Experiment 1 (α = 0.85). The items measuring difficulty level of the course were also the same, except for the item measuring perceived amount of study hours. This time, perceived amount of study hours was assessed with a scale from 1 = Very little time to 7 = Very much time instead of a free-text response, to make it possible to include the item in the mean index of difficulty level, instead of analyzing it separately. We kept the item "The lecturer assigns too much work" from the Professor effectiveness scale (Goebel and Cashen, 1979;Wilson et al., 2014) as it fitted nicely with the other difficulty level items. These three items were averaged into a mean index, α = 0.70. Also, the single items regarding overall impression of and interest in attending the course were the same as in Experiment 1. We added 2 items of general impression: What is your overall impression of the lecturer? and How does the lecturer seem to be as a leader of the teaching team? Answers ranged from 1 = Extremely bad to 7 = Extremely good. Three items asked about specific traits and engagement: Do you think of the lecturer as a serious person? Do you think that the lecturer is knowledgeable? and Do you think that the lecturer is engaged in the teaching? Answers ranged from 1 = No, not at all to 7 = Yes, definitely. We also kept the item measuring competence and "What is your impression that the students think of the lecturer?" Finally, we kept the questions by Boring et al. (2016) because they focused more on the lecturer's ability than individual traits (see Table 1).

Results
For all outcome measures, we computed 2 × 2 ANOVAs with gender of the lecturer (she/he) and description (feminine/masculine) as between-participants factors. Participant gender was again included as covariate. Means, standard deviations and F-values for the main effects are shown in Table 3. Only the main effects are included, because none of the interaction effects were significant. For easier overview, we again marked significant differences in favor of the feminine lecturer (or a woman lecturer) in bold, while differences in favor of the masculine lecturer is marked in gray. Table 3, shows a general pattern were type of behavior is significant on most outcome variables. For some outcomes, gender of the lecturer (she/he) was significant.
The first hypothesis stated that women should receive lower SET on average, compared to men. In contrast to Hypothesis 1, the effects were rather in favor of the woman. For instance, the overall impression of the course was higher for the woman, and she was also rated as better at pedagogy, compared to the man. Three items in the Boring (2016) scale were also significant in favor of a woman lecturer: preparation and organization, ability to relate to current issues and contribution to the students' intellectual development, which at least partly aligns with Boring's results. However, it should be noted that the effects were rather weak.
The second hypothesis focused on the interaction between gender of the lecturer (she/he) and description of behavior and characteristics (feminine/masculine), where we expected that a masculine woman would be rated lowest on SET. Because no interactions were significant, H2 was not supported. Hence, the results so far are largely in line with the results found in Experiment 1. This means that gender incongruent behavior, neither for women nor men lecturers, seem to lead to lower SET.
Similar to Experiment 1, there were several main effects of description (i.e., feminine/masculine). However, in contrast to Experiment 1, the effects were not consistently in favor of the feminine behavior, which indicate that we managed to make the descriptions more ambiguous. First, the masculine behavior seemed to reflect perceptions of being a better pedagogue. The feminine behavior was seen as better when it comes to encouraging work, being available, better quality of contact and better at relating to current issues-again largely in line with Experiment 1 and Boring (2016), and also in line with a feminine gender stereotype (Abele and Wojciszke, 2014). As in Experiment 1, the masculine behavior was perceived as "tougher, " such that ratings of the lecturer described as masculine were higher on difficulty as compared to the feminine condition.
The masculine behavior was perceived as conforming to traditional male stereotypes of leadership and competence, such that the lecturer was seen as more serious, knowledgeable and competent, as well as being a better leader of the teaching team and the class. A possible reason for the shift in competence from the feminine behavior in Experiment 1 to the masculine behavior in Experiment 2 is most likely due to that the masculine description this time contained having the competence to, for instance, respond to students' questions and being more involved in the course in general.
While the participants rated masculine behavior higher on pedagogy, leadership, and learning, they still preferred the lecturer with the feminine behavior. The feminine behavior was rated higher on overall impression, and engagement in teaching. The students rated feminine behavior as more liked, and they expressed more interest in attending a course with a lecturer acting more feminine rather than masculine. Other stereotypically feminine characteristics that was rated higher in the feminine condition was ability to encourage work and 15.13*** ***p < 0.001, **p < 0.01, *p < 0.05. Bold figures indicate significant differences in favor of a woman/feminine lecturer, gray highlighting indicate significant differences in favor of a masculine lecturer.
availability, both of which comply to a nursing, care-taking feminine gender role (Abele and Wojciszke, 2014). Finally, the masculine lecturer received higher ratings on organization and preparation. It should, however, be noted that the feminine and masculine descriptions do not describe gender per se, but rather traits and behaviors associated with gender. This is interesting, because the behavior seemed to be more important than the lecturer's gender, and also more important than whether a lecturer engages in congruent or incongruent gender behavior. In short, behavior and characteristics seem to trump gender information regarding how the lecturers in our study were evaluated, however, the evaluations still follow stereotypical patterns of femininity and masculinity. Moreover, gender information and gender stereotypical behavior and characteristics sometimes seem to clash, potentially leading to a very precarious situation for lecturers in general.

DISCUSSION
Two experiments tested if the conflict between the gender role for women and the role of a university lecturer would be the reason that previous research has shown a general gender bias in SET. Previous research shows that women often receive lower SET compared to men, but also that SET follow gendered expectations (MacNell et al., 2015;Boring et al., 2016;Mengel et al., 2018;Mitchell and Martin, 2018). This article makes several important contributions. First, we use an experiment manipulating gender congruency in behavior, second, even though our hypotheses were not supported, the results highlight new knowledge about the gendered nature of SET, and thereby also contributes to the on-going discussion about SET and their use. In two experiments, we found that evalutions of a target lecturer depended on their stereotypically gendered displayed behavior and described characteristics, and that these evaluations heavily followed gendered expectations.
Much research in social psychology shows that women and men are thought to possess different traits and characteristics that correspond to general behaviors displayed by their respective gender group on an aggregated level (Ellemers, 2018). When there is a lack of fit or incongruence between the stereotypical ideas of how someone should be or behave, in regards to gender, and the stereotypical associations to the role they hold, this incongruence may lead to biases and criticism (Heilman, 1983(Heilman, , 2001(Heilman, , 2012. The lack of fit can be driven by actual job segregation (such as in this case, where more men than women are observed in the role of university lecturers) or stereotypical ideas that a university lecturer is a man, as we found in the pilot study. Hence, we expected that women lecturers overall would receive lower SET than men, because a lack of fit between gender stereotypes and professional role. Second, we hypothesized that a woman lecturer described as masculine in terms of behavior and characteristics would be rated lowest on SET, because of the major violation of gender norms. However, none of the hypotheses were supported.
Hence it seems that in this situation, violations of gender roles and behavior does not seem to elicit negative perceptions of the lecturer. This points to a positive development within the context of higher education since it implies that both women and med can engage in both gender stereotypical and non-stereotypical behavior without being punished (or rewarded) through SET. This means that from this study we can not say that it is an inconsistency between women lecturers' behavior that has led to the generally lower SET for women that has previously been observed (MacNell et al., 2015;Boring et al., 2016;Mengel et al., 2018). We suggest that more studies should be performed to truly establish that this is the case.
There was a fairly consistent and strong pattern that the described behavior and characteristics influenced evaluations, although not in the hypothesized direction. Instead, the feminine behavior was at large evaluated more positively, compared to the masculine behavior. Nonetheless, the pattern makes sense from a gender stereotype perspective. Overall, the ratings conformed to gender stereotypes about femininity and masculinity, even though there were some differences between the experiments. In Experiment 1, the feminine condition led to better, more positive evaluations almost across the board of questions. However, higher work load, demands and requirements were more strongly associated with the masculine behavior. These are not necessarily indicative of negativity, but are more clearly associated with a masculine stereotype of being stern, assertive, and demanding (Abele and Wojciszke, 2014). Still, the participants strongly preferred the lecturer with feminine behavior, despite the lecturer's gender. As mentioned, one reason for the overwhelmingly positive evaluations of the feminine behavior in Experiment 1, could be the assymmetric description with respect to valence where the feminine version did not include many negative aspects, while the masculine version included few positive aspects, at least from a student perspective. For instance, in the masculine condition, the lecturer was presented as a leading researcher, which is not necessarily something that the students care about. Hence, the results of Experiment 1 should be interpreted with caution.
Nevertheless, the tendencies identified in Experiment 1 were at large confirmed in Experiment 2, where the stimuli material was more ambiguous in terms of valence. Because stereotypes are heuristics in impression formation (Heilman, 2012), evaluators may rely more heavily on them when there is little or ambiguous information. The results of the second experiment were accordingly slightly different, but the general pattern showed that evaluations largely conformed to gender stereotypes. The lecturer described as masculine was percieved as a better leader, more competent, a better pedagogue, "tougher, " and students expected to learn more from their course. Hence, evaluations of the masculine behavior followed mainly from stereotypically masculine attributes such as leadership skills, competence and goal-orientation (Abele and Wojciszke, 2014). However, the feminine lecturer was percieved as being more approachable and was more liked. Moreover, and similar to the Experiment 1, participants preferred to attend the course when the lecturer was a woman. Again, these features conform to a feminine gender stereotype which is focused on the maintenance of relationships (Abele and Wojciszke, 2014).
These two experiments highlight the precarious situation that lecturers may face. While the feminine behavior increased liking, the masculine behavior increased competence ratings. Even though there were no interactions with the lecturer's gender, it is plausible to assume that this balance is more difficult for women lecturers where the likable traits and behaviors are expected, and cannot be bargained with (Heilman and Okimoto, 2007). It may be difficult for a lecturer to be rated good on both liking (or warmth) and competence, which is in line with research on gender stereotypes (Fiske et al., 2007;Heilman, 2012). Given that SET form the basis of hiring and promotion decisions (Cashin, 1999;Seldin, 1999;Clayson, 2009;Davis, 2009;Seldin et al., 2010), the results of the present reseach contributes to the literature.
Much of the international research on SET use questions specifically about lecturers as individuals, and their traits (Goebel and Cashen, 1979;Ryan and Wilson, 2014;Wilson et al., 2014;MacNell et al., 2015). However, whether a person is seen as compassionate or caring does not reveal information about their ability to perform as a lecturer, or about their pedagogical skills, which should be the focus of SET, regardless of how SET are to be used. Therefore, other questions should be given space, such as questions relating to the set-up of the course, the organization, the study materials etc. It is plausible to believe that such evaluations would better estimate a lecturer's pedagogical skills and abilities. However, as shown in the two experiments in this article, these judgements still obey to gendered expectations about behavior. These results line up with previous research by Boring et al. (2016) and Mitchell and Martin (2018) who found that a gender bias affected judgment of seemingly objective aspects of teaching.

Limitations and Suggestions for the Future
To our knowledge, this is the first experimental design that test gender bias in SET. The benefit of using experiments in research is also their drawback-the setting is sterile and context-free. The positive side is that the experiment allows for high control over potential confounds. In this first attempt, we aimed to have as little confounding information as possible. Hence, the stimuli material did not, for example, present what field the lecturer is active in, which is a factor previous shown to affect gender bias in SET Mengel et al., 2018). This implies that the description may be too "clean" and generic, which might result in difficulties for the participants to truly engage in the described lecturer. Because the lack of substantial information to relate the lecturer to, this may lead to social desirability-that answers are colored by a desire to appear gender egalitarian. In line with this, the expected effects of the lecturer's gender were not found in any of the experiments, nor were the interactions between gender and incongruent behavior. One reason may be that the participants were aware of gender aspects in these kinds of situations, which could lead to socially desireable answers. Indicative of this interpretation is that when the participants were asked to indicate their thoughts regarding the purpose of the study, several suggested that the study regarded gender issues. Hence, we also suspect that the gender manipulation may be more strongly influenced by social desireability compared to the behavior manipulation. Future studies may apply a more subtle way to manipulate the lecturer's gender, perhaps by using a photo of the lecturer.
There were some inconsistencies between the results found in Experiment 1 and 2, which probably were due to the non-balanced valence of the descriptions used in Experiment 1. From a student perspective, a lecturer who is engaged with the teaching, being caring and responsive should lead to higher ratings. Therefore, the results from Experiment 2 is more informative. It would be beneficial to develop the descriptions more, and for instance describe a lecturer as having both feminine and masculine behaviors. We believe this to be important knowledge for all researchers conducting this kind of text-based experiments.

Conclusions
The present study showed that behavior and characteristics seem to trump the lecturer's gender in SET, at least in this kind of relatively artificial experimental setting. This result could be interpreted as a positive outcome, since evaluations are based on behavior, rather than gender of the lecturer. Nonetheless, the evaluations of behavior follow gender stereotypes, where a lecturer described as showing masculine behavior was also seen as possessing characteristics such as competence and professionalism, whereas a lecturer described as showing feminine behavior also was seen as possessing characteristics such as being caring and nurturing. In this way, the results of this research align with social psychological theory on gender stereotypes (Eagly and Wood, 2012;Abele and Wojciszke, 2014).
However, the participants displayed somewhat contradictory responses in that they liked the caring and nurturing (i.e., the feminine) lecturer better, although they gave the masculine lecturer higher ratings on work performance. This finding is problematic, because it leaves the individual lecturer in a difficult situation. Should a lecturer focus on being professional and making sure that students actually learn, or should they be accommodating and responsive, which hence results in being liked and increases students' desire for attending the course. Therefore, these kinds of results should be communicated not only to lecturers, but also to students, so they can be aware of their own biases. The finding contributes to the ongoing discussion about the validity of SET in judging individual lecturers' pedagogical skills (Yunker and Yunker, 2003;Wright, 2006;Beecham, 2009;Hoefer et al., 2012;Spooren et al., 2013;Braga et al., 2014;Stark and Freishtat, 2014;Boring et al., 2016). Given the results of the present study, there is an urge to develop reliable and valid measures of SET. To some extent, the ratings seem to fall out on two dimensions, where for instance the lecturer's availability and ability to encourage may not necessarily go along with their pedagogical skills, such as course set-up, materials, leadership etc. We therefore join the scholars before us, and raise critical voices regarding the use of SET in their current form as the main tool for assessing lecturers' pedagogical skills and abilities, for instance regarding hiring or promotion purposes. If SET are to be used for such purposes, they should be further developed and validated to better capture actual ability of a lecturer and not reflect popularity or biases. For instance, collegial evaluations, exam results, or performance in subsequent courses could be used to validate SET, and comprise part of the evaluation of a lecturer's competence.
However, it is important to remember that SET were introduced for formative purposes, that is, to improve the teaching and student-relations (Hornstein, 2016). In that sense, SET may be better used. It is important that teachers and students share a common goal in the teaching process and that the student perspective is present when courses are developed.
We believe that two main important outcomes of this article should be highlighted. First, this is to our knowledge the first attempt to make causal inferences regarding the mechanism behind gender biases in SET, using a strict experimental paradigm. Second, we find that gender information does not seem to evoke negative evaluations of women lecturers on a general level. Moreover, gender incongruent behavior is not sanctioned by lower SET. However, students' ratings are somewhat contradictory in that they prefer a lecturer that they see as less competent and pedagogically skilled. This could leave individual lecturers in a difficult position.

DATA AVAILABILITY STATEMENT
The original contributions presented in the study are publicly available. This data can be found at the Open Science Framework: https://osf.io/sfcym/.

ETHICS STATEMENT
Ethical review and approval was not required for the study on human participants in accordance with the local legislation and institutional requirements. The patients/participants provided their written informed consent to participate in this study.

AUTHOR CONTRIBUTIONS
ER contributed to the general idea, first design, analyses, and manuscript drafts. MG and AL contributed to finalizing the design, continuous discussions about methods and results, and finalizing the manuscript. All authors contributed to the article and approved the submitted version.