What Does the Research Say about Student Course Evaluations?

Research on SETs reveals a ton, actually. As early as 1987, higher education scholars noted that student evaluation of teaching was one of the topics most emphasized in American educational research.9 First, what do scholars tell us about why SETs are so widely used? Typically, SETs are used for two sometimes conflicting purposes: a formative purpose—to provide you, the instructor, with feedback to improve your teaching—and a summative one—to provide colleagues and administrators with data to inform decisions about reappointment, promotion, and tenure. Yet this single survey instrument is now often asked to serve even more varied roles for even more varied individuals: at institutions where the results are made available to students, SET data can help students in course selections. They also inform research on instruction in higher education.10 Additionally, a well-designed SET has the potential to prompt students’ own critical reflection on their engagement with learning, what they have learned, and on their own contributions to the class. Perhaps most importantly—and often overlooked in the controversy surrounding the validity of SET data—is the point we’ve made previously: these recurring surveys provide a mechanism via which students can have their voices heard and their experiences in class recognized. In fact, SETs originated as a way for students to share their experiences in the classroom.11

Help! My Job Depends on My Student Ratings

For faculty who are not in a tenured or tenure-track role—whether in annual contracts or part-time positions—SET data may determine whether your contract is renewed or you are assigned classes to teach. If you’re unsure how your institution uses SET data, we strongly advise you to ask. By distilling some of the research, we seek to inform and empower you to minimize the data’s negative consequences. In addition to the ideas in this unit, we’d offer three recommendations to all faculty who have a lot at stake and/or are anxious about their SET results: (1) work closely with your supervisor when making changes to your practice, (2) use the SET questions to survey your students a few weeks into the term, and (3) monitor your results carefully.

Because supervisors make and/or support personnel decisions, it’s helpful to let them know what adjustments you plan to make and why. Coauthor Isis Artze-Vega used the last two strategies when she was a new faculty member, knowing that a great deal depended on her student responses. At the midpoint of every term, she surveyed her students using the exact questions from her university’s SETs form. She discussed the results with her classes and identified refinements she could make long before her students received and completed the official SETs. Whether because of these adjustments or the message to students that their voices mattered, students consistently provided higher ratings in the official SETs.

Isis also plotted her results on a graph and monitored changes over time. This graph was helpful when her program director pointed out a low student response on a question; she was able to show that, over time, her ratings scores for all of the SET questions had consistently increased, and she was able to describe how she planned to respond to the data.

The SETs debate has primarily centered on the summative purpose of SETs, that is, institutions’ use of evaluation results, sometimes exclusively, in making decisions about faculty reappointment, promotion, and tenure. This singular focus on SET results likely explains why, for over forty years, researchers have questioned the use of SETs as a reliable measure of teaching effectiveness on the grounds that such ratings are biased by variables such as gender, race, and even academic discipline.12 Many faculty and scholars from across disciplines have been compelled to study these surveys because they’re a recurring, consequential part of their professional life.

Given the intensity of this debate, we want to state directly that we find the usage of SET data for summative decision making both problematic and inequitable. Teaching is a multifaceted and complex endeavor, such that teaching effectiveness cannot be reduced to any one data source. Instead, we concur with scholars and practitioners who argue for comprehensive teaching evaluation that uses multiple sources of data and perspectives. For instance, Justin Esarey and Natalie Valdes’s SETs analysis concludes that “evaluating instruction using multiple imperfect measures, including but not limited to SETs, can produce a fairer and more useful result compared to using SETs alone.”13 These multiple measures ideally include the perspectives of faculty themselves, and that of their peers and of their students, all in alignment with an institutionally defined vision or framework of teaching excellence.

For the aims of this guide and research overview, however, we shift our attention away from the use of SET data for administrative decision-making. Instead, we focus on how each of us, as faculty, can engage with and learn from our students’ responses. Unpacking the results of SETs research, much of which reexamines previous studies, can be complex—almost as complex as the many factors that seem to influence the results of the SETs themselves. We therefore organize our research summary into two parts: (1) studies on the relationship between SET data and student learning and success, and (2) the most recent research on the topic of bias in SETs, including suggestions for how such bias can be mitigated and strategies for using such feedback ethically and effectively. Our brief summary of studies of bias in SETs isolates the variables of instructor gender, race, and ethnicity, as well as other confounding factors.14

Pause to Consider

  • What are some of the main reasons or motivations for you to make changes to your courses year over year (policy changes, student data/feedback, program assessment, professional development, etc.)?
  • Which of the questions in your institution’s SETs survey are most meaningful to you, and why?

Do SETs Give Us Insight into “Teaching Effectiveness” and Student Learning?

Educational theorists Thomas Angelo and Patricia Cross explain that when someone asks, “Are student ratings valid?” what they’re really asking is “Are students really good judges of effective teaching?”15 Our response, after engaging with the research, is yes and no. Yes, of course, students can provide us with great insights into their educational experiences that can be essential to improving our teaching. Michael Scriven, an expert in evaluation, writes that “students are the most frequent observers of many facets of teaching and learning and their collective opinions as witnesses can provide useful information, particularly when they are asked to observe specific behaviors or materials.”16 Weimer concurs: “[Students] are there for the course from start to finish; their experience is first-hand and fresh. They can say better than anyone else whether the course design and teacher actions motivated and expedited their learning.”17

Data about this front-row-seat experience may be even more important in online classes for at least two reasons: (1) Students may have even fewer opportunities to describe their experiences. And (2) in asynchronous courses, most faculty lack access to the nonverbal feedback we rely on from students in in-person courses: their body language, the tone of their responses, where they sit, and so on.

Then again, as Scriven notes, students’ opinions can be useful “particularly when they are asked to observe specific behaviors or materials.” In other words, the quality of the data we gather depends considerably on the questions we ask in our SET surveys. Too often, SET questions focus on students’ evaluation of the instructor, not on their own experiences in the instructor’s class. For instance, in their SETs study, business professors Karen Loveland and John Loveland identify ten factors commonly measured by traditional student evaluation forms, including criteria such as the instructor’s knowledge of the subject, communication skills, enthusiasm, organization and preparation, timeliness of feedback, and fairness in grading. Almost all of these factors focus on a faculty member’s teaching and knowledge, which are only tangentially related to the student’s own learning and their experiences of being in that class.18

This disconnect between the questions we pose to students in SETs and students’ direct experiences suggests that a key challenge associated with SETs validity is the implication that students are “evaluating” faculty and their teaching. Students may or may not be able to judge the effectiveness of a teacher, or even assess their own learning. Recognizing that student evaluations of teaching is a misnomer, institutions like Florida International University rebranded their SETs as SPOTs: student perceptions of teaching surveys. This new wording makes it clearer to all parties involved that the data represent student views and perceptions, not their evaluations of teaching effectiveness, a much more complex task.

Turning to the relationship between SET results and learning, political science professors Rebecca Kreitzer and Jennie Sweet-Cushman’s extensive meta-analysis of bias in SETs leads them to affirm that “Student Evaluations of Teaching (SETs) have low or no correlation with learning,” and as such, “are poor metrics of student learning and are, at best, imperfect measures of instructor performance.”19 Although older studies had identified a correlation, Bob Uttl, Carmela White, and Daniela Wong Gonzalez’s recent meta-analysis of teaching effectiveness reanalyzes the data and finds no significant correlations between SET ratings and learning.20

Bias Affects SET Results

The most recent critiques of SETs focus on the evaluations’ inherent bias, particularly related to the personal or social identity of the instructor. Kreitzer and Sweet-Cushman, after reviewing more than one hundred articles on bias, state that there is little doubt that women and other historically marginalized groups face “significant biases in standard evaluations of teaching.”21 In addition, the effect of gender is conditional on a host of other factors, such as discipline, course characteristics, gender expectations, and the students’ political disposition, as well as the instructor’s sexual orientation, accent, and so on. Here, we synthesize findings associated with gender, race, and ethnicity, and a few additional factors (including course modality).

Sources of Bias: Gender. The research leaves little doubt that most SETs are subject to gender bias, and some critics suggest that such bias “can be large enough to cause more effective instructors to get lower SET than less effective instructors.”22 Study after study demonstrates “a multitude of ways that men benefit from evaluation, while women do not fare as positively.”23 Bias has also been identified in relation to how instructors “perform” their gender and meet students’ gender expectations. For example, in a comprehensive study of qualitative comments in SETs, researcher Sophie Adams and her colleagues argue that “student evaluations of teaching seem to measure conformity with gendered expectations rather than teaching quality, with particularly negative effects for women.”24 They conclude “that male-identified teachers are more likely to receive positive evaluations than female-identified teachers, with the ‘male effect’ being particularly strong in particular disciplines—greatest in the natural sciences, lowest in the humanities, with the social sciences being mixed—and stronger amongst male students.”25 Meanwhile, Kreitzer and Sweet-Cushman synthesize a variety of specific ways in which students’ perceptions of male- versus female-identified faculty differ:

Disparate research demonstrates that men are perceived as more accurate in their teaching, have higher levels of education, are less sexist, more enthusiastic, competent, organized, professional, effective, easier to understand, prompt in providing feedback, and are less-harshly penalized for being tough graders.26

This varied list suggests that for many students, the image of the “college professor” continues to be male.

Sources of Bias: Race and Ethnicity. Compared to the number of studies on SETs and gender, there are significantly fewer studies on bias in SETs related to the perceived race and ethnicity of the faculty member, but most of the research suggests that such bias exists. Using student feedback from RateMyProfessor.com from twenty-five of the top liberal arts colleges, Langdon Reid finds that in areas related to overall quality, helpfulness, and clarity, faculty perceived to be from racially minoritized groups—particularly Black and Asian—were evaluated more negatively than those perceived to be White faculty. Minoritized faculty, however, were rated “easier” than White faculty. Reid does not find a strong gender effect but notes that “Black male faculty were rated more negatively than other faculty.”27

In another study, Mara Aruguete and colleagues examine the effects of race and clothing style on student evaluations and find that students—both Black and White—rated Black professors less favorably than White professors. Students also had more trust in Black professors who dressed more formally, whereas they had more trust in White professors who dressed more casually. The researchers suggest that Black professors have to “exert more personal effort to attain the favorable evaluations that seem to come more naturally to White professors.”28 Similarly, Kreitzer and Sweet-Cushman conclude that “SETs disproportionately penalize faculty who are already marginalized by their status as minority members of their disciplines.”29 Finally, psychologists Susan Basow, Stephanie Codos, and Julie Martin examined race, gender, student ratings, and learning by showing students an animated lecture given by both male- and female-appearing professors who appeared either White or Black. The students were then given a quiz to evaluate how much attention they had paid to the lecture. The data showed, among other things, that the animated Black “professors” were rated higher than their White counterparts on their hypothetical interactions with students. However, the quiz scores indicated that students who had the White “professors” scored higher, perhaps because the students paid closer attention to the lecture.30

Additional Sources of Bias and Variability, Including Course Modality. Kreitzer and Sweet-Cushman’s review of research suggests that evaluations are influenced by myriad intersecting factors, including “discipline, student interest, class level, class difficulty, meeting time, and other course-specific characteristics, but not generally instructor quality.”31 Citing gender, discipline, and other factors that affect bias, researchers Anne Boring, Kellie Ottoboni, and Philip B. Stark believe that “it is not possible to adjust for the bias, because it depends on so many factors.”32

An additional confounding factor is the modality of the course being taught—whether fully online (asynchronous) or in person. The limited research related to online instruction is inconclusive. For example, in 2006, Alfred Rovai and his educational research colleagues found a significant difference between how students evaluated online versus in-person courses.33 However, in a second study, published a year later in the same journal, Henry Kelly and his coresearchers found that open-ended comments included similar proportions of praising and negative comments in courses taught by the same instructor, one online and one in person. The topics of these comments, however, did differ in proportion between online and in person: comments about online courses focused more on issues of organization and materials, whereas those about in-person courses focused more on the instructor’s knowledge.34

Professors Loveland and Loveland offer perhaps the most helpful and interesting analysis of potential bias in SETs for online relative to in-person courses. As noted previously, they identify ten general criteria for effective teaching that are common across many SETs, and they argue that the same criteria apply across modalities. They therefore attribute variations in student responses to the influence of the course modality. Loveland and Loveland find that the instructor’s writing is more important to students online than in person and influences the evaluation of criteria such as “teaching effectiveness,” “knowledge of the subject,” and “rapport with students.”35 You may want to refer back to Unit 3, where we explore the literature on warm tone in your syllabus; we also encourage you to use warm tone in your written online class materials (e.g., instructions, mini-text lectures, rubric criteria, discussion prompts, etc.).

At this point, you may be wondering how taking an equity-minded approach to teaching might shift your student ratings results. We want to be candid: although limited data exist, some have suggested that such teaching could, in fact, lead to a decline in SETs.36 We think it’s likely this effect will vary based on the specific changes you make to your practice. For instance, adding transparency to the design of your assignments or being more intentional about your students’ sense of belonging could reasonably result in improved SETs, whereas teaching topics related to social justice and other relevant issues could manifest itself in mixed responses from students, especially if you’re teaching these topics for the first time.

As with any significant changes to your teaching, it’s a good idea to talk to whomever conducts your annual evaluation ahead of time, to let them know how you plan to adjust your teaching and why. In general, if you suspect your SETs will be lower because you are bringing in new practices to the course, we encourage you to be more transparent with your students about your practices (i.e., clarifying to students why you are making a certain change/adjustment, and how you intend for it to help their learning and/or success). Relatedly, consider how you might cultivate a class culture in which student voices are sought and valued consistently, as opposed to only at the end of the term. The ideas in Unit 9 should help!

Endnotes

  • Herbert W. Marsh, “Students’ Evaluations of University Teaching: Research Findings, Methodological Issues and Directions for Future Research,” International Journal of Educational Research 11, no. 3 (1987): 253–388, https://doi.org/10.1016/0883-0355(87)90001-2.Return to reference 9
  • Fadia Nasser and Knut Hagtvet, “Multilevel Analysis of the Effects of Student and Instructor/Course Characteristics on Student Ratings,” Research in Higher Education 47 (2006): 559–90, https://doi.org/10.1007/s11162-005-9007-y.Return to reference 10
  • Jonathan Zimmerman, The Amateur Hour: A History of College Teaching in America (Baltimore: Johns Hopkins University Press, 2020).Return to reference 11
  • Herbert W. Marsh, J. U. Overall, and Steven P. Kesler, “Validity of Student Evaluations of Instructional Effectiveness: A Comparison of Faculty Self-Evaluations and Evaluations by Their Students,” Journal of Educational Psychology 71, no. 2 (April 1979): 149–60, https://doi.org/10.1037/0022-0663.71.2.149.Return to reference 12
  • Justin Esarey and Natalie Valdes, “Unbiased, Reliable, and Valid Student Evaluations Can Still Be Unfair,” Assessment & Evaluation in Higher Education 45, no. 8 (2020): 1106–20, https://doi.org/10.1080/02602938.2020.1724875.Return to reference 13
  • For brevity’s sake, we are not attempting a comprehensive review of the research related to gender, race/ethnicity, and other identities of potential bias in SETs. Also, despite myriad studies, some potential areas for bias, such as ability/disability, gender conformity, sexuality, class, and nationality, need more exploration and study. A few of the more comprehensive surveys of research are Linse, “Interpreting and Using Student Ratings Data,” Kreitzer and Sweet-Cushman, “Evaluating Student Evaluations” (both cited later in this unit), and Stephen L. Benton and William E. Cashin, “Student Ratings of Teaching: A Summary of Research and Literature,” IDEA Paper, no. 50 (January 2, 2011).
    Return to reference 14
  • Thomas A. Angelo and K. Patricia Cross, Classroom Assessment Techniques: A Handbook for College Teachers (San Francisco: Jossey-Bass, 1993), 317.Return to reference 15
  • Michael Scriven, “Critical Issues in Faculty Evaluation: Valid Data and the Validity of Practice,” in Valid Faculty Evaluation Data: Are There Any? (Montreal: 2005 AERA Symposium, April 14, 2005), 7, http://www.cedanet.com/metA.Return to reference 16
  • Weimer, Inspired College Teaching, 51.Return to reference 17
  • Karen A. Loveland and John P. Loveland, “Student Evaluations of Online Classes versus On-Campus Classes,” Journal of Business and Economics Research 1, no. 4 (2003): 1–10.Return to reference 18
  • Rebecca Kreitzer and Jennie Sweet-Cushman, “Evaluating Student Evaluations of Teaching: A Review of Measurement and Equity Bias in SETs and Recommendations for Ethical Reform,” Journal of Academic Ethics 20 (2022): 73–84, https://doi.org/10.1007/s10805-021-09400-w.Return to reference 19
  • Bob Uttl, Carmela White, and Daniela Wong Gonzalez, “Meta-Analysis of Faculty’s Teaching Effectiveness: Student Evaluation of Teaching Ratings and Student Learning Are Not Related,” Studies in Educational Evaluation 54 (September 2017): 22–42.Return to reference 20
  • Kreitzer and Sweet-Cushman, “Evaluating Student Evaluations,” 73.Return to reference 21
  • Anne Boring, Kellie Ottoboni, and Philip Stark, “Student Evaluations of Teaching (Mostly) Do Not Measure Teaching Effectiveness,” ScienceOpen Research (January 2016): 10.Return to reference 22
  • Kreitzer and Sweet-Cushman, “Evaluating Student Evaluations,” 76.Return to reference 23
  • Sophie Adams et al., “Gender Bias in Student Evaluations of Teaching: ‘Punish[ing] Those Who Fail to Do Their Gender Right,’” Higher Education 83 (2022): 787–807, https://doi.org/10.1007/s10734-021-00704-9.Return to reference 24
  • Adams et al., “Gender Bias in Student Evaluations,” 790.Return to reference 25
  • Kreitzer and Sweet-Cushman, “Evaluating Student Evaluations,” 76.Return to reference 26
  • Langdon D. Reid, “The Role of Perceived Race and Gender in the Evaluation of College Teaching on RateMyProfessors.com,” Journal of Diversity in Higher Education 3, no. 3 (September 2010): 137, https://doi.org/10.1037/a0019865.Return to reference 27
  • Mara S. Aruguete, Joshua Slater, and Sekela R. Mwaikinda, “The Effects of Professors’ Race and Clothing Style on Student Evaluations,” Journal of Negro Education 86, no. 4 (Fall 2017): 499, https://doi.org/10.7709/jnegroeducation.86.4.0494.Return to reference 28
  • Kreitzer and Sweet-Cushman, “Evaluating Student Evaluations,” 80.Return to reference 29
  • Susan A. Basow, Stephanie Codos, and Julie L. Martin, “The Effects of Professors’ Race and Gender on Student Evaluations and Performance,” College Student Journal 47 (2013): 352–63.Return to reference 30
  • Kreitzer and Sweet-Cushman, “Evaluating Student Evaluations,” 80.Return to reference 31
  • Boring, Ottoboni, and Stark, “Student Evaluations of Teaching,” 1.Return to reference 32
  • Alfred P. Rovai et al., “Student Evaluation of Teaching in the Virtual and Traditional Classrooms: A Comparative Analysis,” The Internet and Higher Education 9, no. 1 (2006): 23–35.Return to reference 33
  • Henry F. Kelly, Michael K. Ponton, and Alfred P. Rovai, “A Comparison of Student Evaluations of Teaching between Online and Face-to-Face Courses,” The Internet and Higher Education 10, no. 2 (2007): 89–101.Return to reference 34
  • Loveland and Loveland, “Student Evaluations of Online Classes,” 4.Return to reference 35
  • Guy Boysen, “Student Evaluations of Teaching: Can Teaching Social Justice Negatively Affect One’s Career?” in Navigating Difficult Moments in Teaching Diversity and Social Justice, ed. Mary E. Kite, Kim A. Case, and Wendy R. Williams (Washington, DC: American Psychological Association, 2021), 235–46, https://doi.org/10.1037/0000216-017.Return to reference 36