- 1CENTRUM Católica Graduate Business School, Lima, Peru
- 2Pontificia Universidad Católica del Perú, Lima, Peru
- 3Institute for the Future of Education, Tecnologico de Monterrey, Monterrey, Mexico
This study examines the integration of Generative Artificial Intelligence (GenAI) into rubric-based qualitative assessment to strengthen authentic evaluation and complex thinking competencies in business education. Conducted at a graduate business school in Peru with 120 student portfolios, the research adopted a qualitative exploratory-documentary approach, complemented by a correlational analysis between human and AI evaluations to enhance the interpretive validity of the findings. A pilot subsample (n = 12) was used for the correlational analysis comparing human and AI-assisted assessments, while descriptive analyses of complex thinking levels were conducted on the full sample (n = 120) using the AI agent. The study employed the GPT-eComplex Assistant, a GenAI-based evaluator developed from ChatGPT and configured with the eComplex rubric, to identify recurring patterns, strengths, and areas for improvement without replacing human pedagogical judgment. The findings revealed that: (a) the GPT-eComplex Assistant closely mirrored human evaluators’ judgments, reinforcing consistency, traceability, and transparency, although still requiring teacher calibration; (b) the digital portfolio served as an authentic learning artifact to capture systemic and critical thinking, while showing limitations in the scientific dimension; (c) the use of GenAI for assessment in higher education remains incipient and under-researched, underscoring the need for empirical evidence and ethical guidelines; and (d) the responsible integration of GenAI demands active teacher mediation, ethical awareness, and institutional transparency, ensuring that technology complements rather than replaces pedagogical judgment. This study contributes a pedagogical strategy that combines formative portfolio assessment, AI-supported co-evaluation, and the AI-PROMPT Framework, offering a replicable model for embedding GenAI into authentic, reflective, and ethically grounded assessment practices in higher education. The results represent an innovative contribution by positioning GPT as a complementary tool in authentic assessment, reinforcing the central role of human judgment and opening new perspectives for AI-supported evaluation in the Ibero-American and beyond.
1 Introduction
Cutting-edge Artificial Intelligence (AI) models, as noted by Giannini (2023), promise speed and efficiency but risk displacing the reflective pace essential for deep learning and human connection. Rather than fearing AI, educators must engage with it deliberately and ethically. As Horn (2023) observes, “AI is not coming for education—it is already here.” The key question is not whether we use it, but how and for whom. When guided by pedagogical intent, AI can help modernize classrooms and foster new and more inclusive models of learning.
In this context, Generative Artificial Intelligence (GenAI) has become an increasingly influential development in education. This technology relies on machine learning models that generate new data instead of merely predicting outcomes from existing sets. A generative AI system is designed to create new outputs that resemble the data on which it was trained (Zewe, 2023). Within this category fall Large Language Models (LLMs) such as ChatGPT, which generate natural language text in response to specific inputs—prompts, keywords, or questions. LLMs typically comprise millions or even billions of parameters trained on vast textual corpora, including books, scholarly articles, websites, and social media content. They can perform a wide range of linguistic tasks—answering questions, summarizing texts, composing essays, crafting captions, and generating narratives—while continuously refining their performance through iterative learning (Center for Teaching Innovation, Cornell University, 2025). Consequently, GenAI supports a remarkable diversity of applications, serving as a versatile aid across educational and professional domains.
The ability of GenAI to produce coherent and contextually relevant content carries profound implications across multiple sectors—particularly in education, media, and professional practice—and has sparked debate about its ethical use and its impact on human cognition and social structures (Çayir, 2023). One domain where its application holds notable promise is educational assessment. In particular, GenAI-supported online assessment can perform essential pedagogical functions in digital learning environments, such as scaffolding, real-time feedback, and adaptive learning (Ifenthaler and Schumacher, 2023).
This study seeks to demonstrate the potential of GenAI as a complementary tool for authentic assessment, while reaffirming the irreplaceable role of human judgment. Its ultimate goal is to inform the development of transparent and pedagogically sound guidelines for teacher training in this field. Specifically, the research presents a comparative analysis between evaluations generated by the GPT-eComplex Assistant—a chatbot developed from ChatGPT and configured with the eComplex rubric, designed to assess complex thinking competencies—and those conducted by human instructors using the same criteria. In doing so, the study addresses the current gap in research on the reliability, consistency, and pedagogical validity of AI-assisted qualitative evaluation, proposing a model that preserves the centrality of teacher judgment while encouraging the ethical and reflective use of technology in education.
This contribution is especially relevant to the Ibero-American context, where empirical studies on GenAI applications in authentic and formative assessment, particularly within higher education, remain scarce. By situating the analysis in this regional framework, the study aims to strengthen both empirical evidence and practical guidance for integrating AI in ways that are ethical, context-responsive, and pedagogically meaningful.
The specific objectives of the study are to:
• Examine the consistency of the GPT-eComplex Assistant in applying the eComplex rubric across the four dimensions of complex thinking—systemic, scientific, critical, and innovative.
• Determine the degree of agreement between human and AI-generated evaluations.
• Identify recurring strengths and shared areas for improvement in student portfolios identified by the GPT-eComplex Assistant.
• Reflect on the pedagogical value and ethical challenges of integrating GenAI into rubric-based qualitative assessment in Ibero-American higher education.
• Provide practical recommendations on transparency, data protection, and teacher training for the responsible use of AI-assisted assessment.
This study represents an innovation in the use of ChatGPT as a supportive evaluation agent in higher education settings across Ibero-America, employing digital portfolios as tools for formative and authentic assessment. Unlike earlier studies focused on essays or exams, this research applies the eComplex rubric to evaluate complex-thinking competencies while ensuring that final evaluative decisions remain under the educator’s responsibility. The findings are intended to assist teachers, instructional designers, and academic leaders by providing evidence-based and ethically grounded guidance for incorporating GenAI into the assessment of complex thinking—thereby enhancing teacher development, transparency in evaluation, and evidence-driven educational innovation.
Based on the objectives outlined above, this study is guided by the following research questions:
RQ1. To what extent does the GPT-eComplex Assistant apply the eComplex rubric consistently across the four dimensions of complex thinking (systemic, scientific, critical, and innovative)?
RQ2. What level of agreement exists between human evaluators and the GPT-eComplex Assistant when assessing student portfolios using the eComplex rubric?
RQ3. What recurring strengths and areas for improvement in complex thinking are identified in student portfolios through AI-assisted analysis?
RQ4. What pedagogical and ethical implications emerge from integrating generative AI into rubric-based qualitative assessment in higher education?
2 Literature review
2.1 Formative and authentic assessment in higher education
Assessment in higher education has evolved toward approaches that seek not only to measure outcomes but also to accompany the learning process. Within this framework, formative assessment is understood as a continuous process that monitors students’ progress and provides ongoing feedback to guide improvements in performance (Inman and Roberts, 2021). This approach involves the design of meaningful tasks that encourage reflective questioning, foster self-regulated learning, and are integrated throughout the teaching–learning process, thereby empowering students (Šteh and Šarić, 2020).
Recent literature highlights multiple benefits of formative assessment across diverse contexts: it enhances student engagement and the quality of feedback (Ma et al., 2023), supports self-regulation (Yan, 2024), and contributes to professional growth (Wylie and Lyon, 2015). However, relatively few studies have examined its effectiveness in higher education, a limitation often attributed to the ambiguous conceptualization of formative assessment (Šteh and Šarić, 2020). Moreover, the integration of digital tools requires rigorous implementation and clear alignment with educational objectives (Ma et al., 2023). Within this scenario, authentic assessment has emerged as a complementary approach that seeks to situate evaluation within more meaningful, real-world contexts.
Authentic assessment aims to replicate real-world tasks and performance standards. It engages students in activities that reflect professional environments (Sokhanvar et al., 2021), while enhancing learning experiences through the development of transversal skills such as communication, collaboration, critical thinking, and problem-solving (Villarroel et al., 2017). The integration of formative and authentic assessment can enrich learning by combining continuous feedback with engagement in realistic tasks (Gikandi, 2021). This approach promotes higher-order cognitive skills and prepares students for professional practice.
When approaching authentic assessment, learning products such as portfolios emerge. Digital portfolios have become effective tools within authentic assessment frameworks, as they foster critical thinking and ownership of learning (Sultana et al., 2020). Their authenticity is essential, as they require students to present a genuine account of their learning, context, and professional identity (Trevitt and Stocks, 2011), while providing evidence of the progressive development of competencies (Erumeda et al., 2024). Recent studies emphasize the potential of portfolios to become more meaningful and engaging strategies for students (Ajjawi et al., 2023; Wake et al., 2023). These findings reveal emerging trends and suggest opportunities for adaptation within Spanish-speaking higher education systems.
In this broader landscape, assessment has become an essential component for strengthening educational quality, particularly in developing countries (Leal Uhlig et al., 2023). The international literature presents contrasting perspectives: in Anglo-Saxon and non-Hispanic European contexts, portfolios are more institutionalized and digitized, with a strong emphasis on employability, professional accreditation, and the development of higher-order competencies (Dave and Mitchell, 2025; Mudau, 2022). In contrast, within the Ibero-American context, encompassing Spain and Latin American countries, portfolios are valued as instruments that promote reflection, self-regulation, and connection with real learning environments (Harada, 2020; Trejo González, 2024). Yet, their implementation in this region tends to be sporadic and dependent on individual faculty initiatives, while tensions persist between innovative practices and traditional certification-based models. Discrepancies between student and teacher perceptions also remain, underscoring the need to strengthen assessment literacy and institutional culture in the region (Gallardo-Fuentes et al., 2025).
These similarities and contrasts in the implementation of formative and authentic assessment—along with the use of portfolios—highlight the need to explore innovative alternatives that can address persistent gaps, such as limited feedback and the persistence of traditional practices. Among these alternatives, Artificial Intelligence emerges as a promising resource to expand, personalize, and make assessment processes more sustainable in higher education.
In this study, the evaluation process is based on competency-based education. Competency-based education is an educational approach that guides teaching and learning toward the integrated development of knowledge, skills, attitudes, and values, with the aim of enabling students to mobilize and articulate various resources to act effectively in real and complex situations in specific contexts. From this perspective, a competency implies the ability to face problems through reflective and contextualized action, beyond the mere accumulation of content (Perrenoud, 2004). It is based on underlying characteristics of the individual that are causally related to effective or superior performance, integrating knowledge, know-how, and interpersonal skills (Spencer and Spencer, 1993) and is oriented toward the achievement of relevant, ethical, and quality performance based on the resolution of contextual problems, promoting comprehensive training linked to personal, social, and professional development (Tobón, 2013).
For this research it was considered appropriate to use rubric as a tool for assessing competencies. The use of rubrics as a tool for measuring competencies is justified because they allow for the systematic, transparent, and criteria-based evaluation of complex performances, aligning evaluation criteria with the expected levels of achievement in a competency. In this sense, rubrics facilitate the assessment of authentic performance by explicitly describing quality indicators and degrees of mastery, which contributes to both the objectivity of the assessment and formative feedback (Brookhart, 2013). Likewise, by focusing on observable performance and the integration of knowledge, skills, and attitudes, rubrics become a key tool for competency-based assessment, as they guide learning, promote self-regulation, and strengthen coherence between teaching, learning, and assessment (Tobón, 2013).
2.2 Generative artificial intelligence and authentic assessment
Since the release of ChatGPT in 2022, Generative Artificial Intelligence (GenAI) has become both a challenge and an opportunity for educators and institutions, particularly regarding learning assessment and competency development. One of the initial responses has been to return to traditional assessment methods—such as in-person exams or oral presentations—yet these strategies often provide only a false sense of security in the face of the growing use of GenAI tools by both students and teachers (Perkins and Roe, 2023). Moreover, assessments such as multiple-choice tests or written essays present significant limitations, as they tend to measure surface-level learning and lack authenticity and contextual relevance (Cope et al., 2025).
This situation calls for rethinking assessment, shifting the focus from products that AI can easily generate toward processes that evaluate self-regulated learning, connection to real contexts, and the student’s reflective intervention in an ethical and transparent manner (Cope et al., 2025). Beyond ensuring that teachers are prepared to enhance their evaluation methods, it is essential to guarantee that AI serves as a support tool rather than a substitute for pedagogical judgment, maintaining human responsibility in decisions related to learning outcomes (UNESCO, 2025). Consequently, there is a need to explore concrete experiences in which GenAI has been applied to assessment in higher education, in order to identify the scope and limitations of these emerging practices.
The integration of GenAI into authentic assessment has begun to be explored in various higher education contexts, revealing both significant progress and persistent challenges. In a postgraduate psychology program in the United Kingdom, Martin et al. (2025) designed an AI co-created assessment in which students developed academic blogs assisted by generative tools, combining automated production, critical review, and ethical reflection. This approach fostered AI literacy and the ability to discern between human and machine-generated content, although difficulties remained in maintaining academic integrity and identifying text provenance. In the field of economics, Nguyen Thanh et al. (2023) applied Bloom’s Taxonomy to examine how different models—ChatGPT, Bard, and Bing—performed authentic analytical tasks. Their results indicated that GenAI excels at lower cognitive levels (remembering and understanding) but performs less effectively at higher ones (analyzing, evaluating, and creating), thus exposing limitations in promoting critical and creative thinking.
Similarly, Salinas-Navarro et al. (2024) approached GenAI from an experiential learning perspective, proposing it as a pedagogical support agent capable of enhancing constructive alignment by assisting in the formulation of learning outcomes, the creation of experiential activities, and the facilitation of authentic tasks. Collectively, these studies highlight the potential of GenAI to improve feedback, personalization, and reflective learning, while also emphasizing ethical, equity, and reliability challenges related to authorship, quality of responses, teacher workload, and the need to design transparent evaluation frameworks centered on human agency.
Recent empirical studies have begun to explore the potential of ChatGPT as an evaluative agent for academic work in higher education, revealing notable advances alongside methodological and ethical limitations. In a university course context, Flodén (2025) compared grades assigned by ChatGPT-3.5 with those of human instructors in written exams, finding a high level of discrepancy (70%) and a tendency of the model to assign mid-range scores—raising concerns about reliability without precise rubrics and teacher supervision. In contrast, Jauhiainen and Garagorry Guerra (2025) implemented a structured evaluation process using ChatGPT-4 to analyze open-ended responses across multiple subjects. Through advanced prompting and a three-phase procedure (pre-evaluation, evaluation, and post-evaluation), the AI was able to replicate grading criteria and generate coherent feedback, reducing interpretive errors. However, the authors emphasized the need to verify model fidelity and prevent “hallucinations” before adopting it as a grading assistant.
Along similar lines, Manning et al. (2025) compared the assessments of ChatGPT-3.5 and GPT-4 with those of three human evaluators in annotated bibliography tasks, observing high internal consistency in the AI’s scoring but low agreement with human judgment, as ChatGPT tended to assign more generous grades. Taken together, these studies converge on the idea that AI can improve efficiency and consistency in assessment processes but cannot replace the non-delegable human responsibility inherent in educational judgment (Perkins and Roe, 2023).
According to UNESCO (2025), integrating AI into assessment requires promoting transparency, digital literacy, and teacher agency, encouraging a conscious and ethical use of technology. Within this framework, it becomes relevant to explore how the GPT-eComplex Assistant—an evaluation agent developed from ChatGPT and configured with the eComplex rubric—can support teachers in the assessment of digital portfolios, particularly in Ibero-American contexts where empirical evidence remains limited. This approach not only advances understanding of the evaluative potential of GenAI but also examines its contribution to the development of complex thinking as a core competency in higher education.
2.3 Complex thinking as a competency in higher education
Complex thinking is a fundamental competency in today’s world, where immediacy and fragmented perspectives often undermine students’ ability to reflect deeply. It functions as a metacompetency composed of four interrelated forms of thinking: systemic, scientific, critical, and innovative (Castillo-Martínez et al., 2024). Systemic thinking enables individuals to address problems by interpreting data from multiple scientific fields and determining the relevance of each element within a system through the analysis of interconnected datasets (Izvorska, 2016; Jaaron and Backhouse, 2018). Scientific thinking allows individuals to question reality objectively and solve problems using valid and reliable methodologies. It involves analyzing data to assess veracity and encompasses cognitive processes such as inductive and deductive reasoning, problem-solving, and hypothesis formulation and testing (Koerber et al., 2015; Suryansyah et al., 2021; Zimmerman and Croker, 2014).
Critical thinking enables individuals to analyze information and situations objectively, recognize multiple perspectives regarding the same issue, identify fallacies, and construct sound arguments based on evidence (Leighton et al., 2021). This capacity is essential for making informed and well-reasoned decisions (Hapsari, 2016). Innovative thinking supports the creation of creative and original solutions to identified problems (Qosimova, 2022). These four types of thinking should be addressed holistically to achieve a comprehensive understanding that fosters innovative solutions to complex problems through deep analysis of contextual characteristics and needs, always grounded in robust scientific evidence.
Over the past 5 years, studies focused on measuring complex thinking have increased considerably. Research has been conducted in business and economics (Portuguez-Castro and Ramírez-Montoya, 2025; Ramírez-Montoya and Portuguez-Castro, 2024; Elsner, 2025), entrepreneurship (Alonso-Galicia et al., 2025; George-Reyes and Oliva-Córdova, 2025), health sciences (Suárez-Brito et al., 2024), and gender studies (Suárez-Brito et al., 2025), as well as in comparative analyses across disciplines (Vázquez-Parra et al., 2024a) and in the evaluation of educational platforms and models designed to promote complex thinking (Alvarez-Icaza et al., 2025), among others. Collectively, these studies reaffirm complex thinking as a cross-cutting competency that is essential for addressing present and future societal challenges.
One instrument widely used in the aforementioned studies is the eComplexity Questionnaire, a Likert-scale tool that measures individuals’ self-perceived mastery of complex thinking across its four dimensions—systemic, scientific, critical, and innovative—through 25 items (Vázquez-Parra et al., 2024b). This instrument served as the foundation for designing the eComplex rubric, which in turn provided the analytical framework for the GPT-eComplex Assistant. This GenAI-based assistant was developed to evaluate portfolios of graduate students in business education programs. Unlike self-perception instruments, the eComplex rubric enables the assessment of actual demonstrated mastery of complex thinking, offering a more direct and evidence-based approach to measuring this metacompetency.
In this context, prompt engineering plays a critical methodological role in translating the eComplex rubric into an operational AI-assisted assessment process. Rather than functioning as a purely technical configuration, prompt engineering structures how evaluative criteria are interpreted, how evidence is extracted from student portfolios, and how formative feedback is generated in a transparent and traceable manner. Prior research highlights that well-designed prompts are essential for aligning generative AI systems with pedagogical objectives and for preserving human judgment in educational assessment (Lee and Palmer, 2025). In the present study, prompt engineering constitutes the mechanism that enables the systematic, auditable, and responsible application of the eComplex rubric through the GPT-eComplex Assistant.
3 Methodology
This study adopted a qualitative exploratory–documentary design focused on the analysis of student portfolios assessed using the eComplex rubric. This qualitative approach was complemented by a correlational statistical analysis comparing human expert scores with those generated by the GPT-eComplex Assistant, with the specific purpose of examining alignment and consistency between human and AI-assisted evaluations and strengthening the validity of the assessment process.
For the assessment of student work, the GPT-eComplex Assistant—a GenAI-based evaluative agent developed from ChatGPT and configured according to the criteria of the eComplex rubric—was employed. Its use made it possible to identify recurring patterns, strengths, and areas for improvement, without replacing human pedagogical judgment.
Inferential statistics were incorporated as a validation mechanism within the overall qualitative design, allowing for a systematic comparison between human and AI-generated assessments in a controlled subsample. This step served to assess the degree of agreement between evaluators before extending the AI-assisted analysis to the full dataset. Subsequently, descriptive analyses were applied to the complete sample to explore broader trends in the development of complex thinking competencies.
This methodological approach aligns with the authentic assessment paradigm, which emphasizes the evaluation of competencies in meaningful and contextually grounded situations. Finally, triangulation with human instructors’ evaluations sought to ensure the interpretive validity of the findings and to explore the potential and limitations of GenAI in formative assessment processes within business education in the Ibero-American context.
3.1 Research context
The study was conducted within an Ibero-American higher education context, specifically at a graduate business school in Peru. The institution operates under the Positive Business Impact Model, which integrates the dimensions of purpose, planet, people, and prosperity to guide business education toward inclusion and innovability—the integration of sustainability into innovation processes and new business models (Portuguez-Castro and Castillo Martínez, 2025). Within this framework, students develop thesis projects conceived as entrepreneurial initiatives with positive impact, aligned with the Sustainable Development Goals (SDGs).
As part of this learning process, an individual portfolio was selected as the primary learning artifact and used as an authentic assessment instrument. Its purpose was to promote critical reflection on each student’s thesis project, considering the proposed solution, its relationship with the SDGs, and the role of corporate responsibility in social transformation. The portfolio was structured into six sections: (1) Personal introduction; (2) problem description and its connection to the SDGs; (3) analysis of the proposed solution; (4) proposed adjustments; (5) self-reflection across three dimensions—personal, professional, and civic; and (6) critical commentary on corporate responsibility.
To complement the evaluation, the portfolios were analyzed using the GPT-eComplex Assistant, configured as an AI-based evaluation agent. Its role was to generate feedback grounded in textual evidence and aligned with the criteria of the eComplex rubric, which measures four dimensions of complex thinking: systemic, scientific, critical, and innovative. The GenAI did not assign final grades; instead, it provided evidence-based insights for comparison with human evaluations, allowing for the calibration of student performance and the exploration of consistency between automated and human judgment within the framework of formative and authentic assessment.
3.2 Procedure
The research procedure was structured into four phases.
In the configuration phase, the GPT-eComplex Assistant was programmed. At this stage, a corpus of 120 portfolios was identified, produced by students from 17 cohorts enrolled in the courses Innovation and Sustainability, Innovative and Sustainable Business Ventures, and Disruptive Technological Innovation and New Business Models, conducted between July 2024 and July 2025. From this corpus, a 10% sample (12 portfolios) was selected to be evaluated first by two instructors and subsequently by the GPT-eComplex Assistant in a single run. This selection served as an exploratory calibration phase, aimed at verifying the initial consistency of the agent before applying it to the full sample; therefore, the results of this phase were not considered generalizable.
During the testing phase, a quality-control checklist was applied to verify whether the agent’s outputs adhered to the prompt instructions and maintained alignment with the rubric criteria. In the comparison phase, the evaluations produced by the instructors and those generated by the GPT-eComplex Assistant were contrasted to identify convergences, discrepancies, and areas for improvement. Based on these results, preliminary adjustments were made to reinforce the agent’s consistency prior to its full-scale application.
Finally, in the implementation phase, the agent analyzed all 120 portfolios to systematize patterns of strengths and areas for development in students’ complex thinking, within a formative and authentic assessment framework. These phases were designed and documented to inform the proposed AI-PROMPT Framework, ensuring traceability of evaluative criteria and evidence use throughout the process.
3.3 Instruments
3.3.1 eComplex rubric
The eComplex rubric is derived from a prior instrument called eComplexity (Castillo-Martínez and Ramírez-Montoya, 2022a). While the eComplexity instrument measures students’ self-perceived level of complex thinking, the eComplex rubric assesses the demonstrated mastery level through actual learning evidence. The rubric comprises 27 items distributed across the four dimensions of complex thinking: systemic (6 items), scientific (7 items), critical (9 items), and innovative (5 items). Each dimension is evaluated on three performance levels—basic, intermediate, and advanced—which reflect the progression in the articulation, depth, and application of complex thinking in real-world contexts. The full version of the eComplex rubric is provided in Appendix A.
3.3.2 GPT-eComplex assistant—prompt-based design and configuration
To ensure the systematic, consistent, and traceable application of the eComplex rubric to student portfolios, the GPT-eComplex Assistant was designed and implemented. This agent was conceptualized as a generative artificial intelligence conversational model based on ChatGPT (OpenAI). Its role was strictly analytical and supportive, without issuing final grades, in order to preserve the professional judgment of instructors and maintain coherence with the principles of authentic assessment.
The design followed a prompt-engineering approach grounded in the recommendations of White et al. (2023) and the Vanderbilt University resources on prompt pattern design (Vanderbilt University, 2024; Vanderbilt University Libraries, 2024), complemented by insights from the open-access training courses Prompt Engineering for Everyone and Agentic AI available on Coursera (Vanderbilt University and Coursera, 2023a; Vanderbilt University and Coursera, 2023b). These sources propose modular structures and reusable patterns that support the creation of transparent, traceable, and ethically responsible educational agents. Specifically, three key prompt patterns were applied: Persona, to define the agent’s role and ethical boundaries; Template/Flow (Input–Process–Output), to organize the analytical sequence (rubric → evidence → feedback); and Cite Sources, to anchor each evaluative judgment in verifiable textual citations from the portfolio and a corresponding list of factual observations.
This methodological approach aligns with Vanderbilt’s guidelines on Prompt Patterns for Generative AI and established best practices in responsible educational AI design. The design was operationalized through an explicit prompt that: (a) anchored the analysis in verifiable portfolio evidence (with textual citations); (b) linked each observation to the descriptors of the eComplex rubric; (c) requested tentative performance levels (1–3) with criterion-based justification; and (d) generated actionable formative recommendations. The complete final prompt used to configure the GPT-eComplex Assistant is provided in Appendix B. The original prompt was developed and implemented in Spanish, as the student portfolios analyzed in this study were written in Spanish; the English version is presented for transparency and replicability.
The agent’s communication style was governed by the CAPITAL framework—Confidence, Amiability, Professionalism, Interactivity, Transparency, Adaptability, and Lexicography—to ensure clarity, consistency, and pedagogical usefulness. Table 1 summarizes the agent’s design architecture, including its role definition, alignment with the eComplex rubric, communication framework, operational sequence, and the calibration and quality-assurance mechanisms implemented to stabilize performance.
Table 2 details the minimum outputs required from the agent—including the 27-item scoring table, dimension-based analysis, synthesis summary, and overall note—as well as the rules for managing ambiguity and declaring insufficient evidence.
The GPT-eComplex Assistant operates as a supportive evaluator applying the eComplex rubric with full traceability (Tables 2, 3). Its configuration integrates quality-control mechanisms and triangulation procedures designed to preserve teacher judgment and the validity of the assessment process.
3.4 Data analysis
The data analysis was conducted in three complementary stages. In the first stage, the GPT-eComplex Assistant was calibrated in the application of the eComplex rubric using an initial sample of 12 portfolios. During this phase, the formal quality of the AI-generated reports was audited according to three criteria:
1. Coverage and structure, verifying the presence of the four required components—item table, narrative analysis, synthesis, and global score;
2. Quality and style, evaluating the inclusion of verifiable textual citations, explicit reference to the eComplex rubric descriptors in each item, and the formulation of specific formative recommendations; and.
3. Numerical consistency, reviewing the accuracy of average computations by dimension and the correct classification of performance levels.
In the second stage, a comparative review was performed between human evaluations and those generated by the GenAI for the same 12 portfolios. First, a descriptive agreement analysis was conducted, which considered: (a) exact matches between the agent and each instructor, as well as with the rounded average of both; (b) practical matches within a margin of ±1 point (adjacent categories); and (c) wide discrepancies (±2 points), specifying the items and portfolios where these occurred. Subsequently, inferential analyses were applied to estimate inter-rater reliability more robustly: Kendall’s W for overall concordance, Fleiss’ κ for exact agreement adjusted for chance, the Intraclass Correlation Coefficient (ICC) for both individual and average ratings, and Spearman’s ρ for the association between the agent’s scores and the instructors’ mean ratings. Data processing was performed in Python (version 3.11.6).
Finally, after calibration and validation, the tool was applied to the full corpus of 120 portfolios. The data were systematized to identify patterns in complex thinking competencies across the four dimensions—systemic, scientific, critical, and innovative—highlighting recurrent strengths, areas for improvement, and cross-cutting trends. To reinforce interpretive validity, the two human evaluators rated the portfolios blind to the GenAI results, thus avoiding contrast bias. Likewise, the agent was executed once on the 120 portfolios, using an adjusted prompt derived from the comparison phase, allowing for the incorporation of improvements suggested during the initial review prior to large-scale implementation.
4 Results
The results section reports two complementary analyses: (a) a correlational comparison between human and AI-assisted assessments conducted on a pilot subsample (n = 12), and (b) descriptive findings on levels of complex thinking derived from the AI agent’s analysis of the full dataset (n = 120).
4.1 Consistency of the GPT-eComplex assistant in the application of the eComplex rubric
The qualitative checklist was used to evaluate the degree of consistency demonstrated by the GPT-eComplex Assistant in applying the eComplex rubric during the calibration phase with the 12 selected portfolios. The analysis revealed an overall stable performance, with a predominance of high-consistency cases and a smaller number of partial-consistency cases, which indicated opportunities for prompt refinement and adjustment (Figure 1).
Figure 1. Consistency of the GPT-eComplex assistant in applying the eComplex rubric during the calibration phase (n = 12 portfolios).
The coverage and structure of the reports were complete in 100% of the portfolios, confirming that the GPT-eComplex Assistant systematically generated all required components—the item table, narrative analysis, and global synthesis. Regarding quality and style, eight reports (67%) achieved a high level by including verifiable textual citations, explicit reference to rubric descriptors, and actionable recommendations. In contrast, four reports (33%) presented more general justifications or lacked citations, reducing the traceability of the evaluation.
The numerical consistency was solid across all cases, as all average calculations and global classifications were accurate. Among the most frequent qualitative observations were the occasional omission of textual citations and the need to systematically include a “Suggestion for the instructor” block when evidence was ambiguous or when the assigned level was basic.
On a positive note, the agent demonstrated the ability to personalize analyses, justify intermediate level assignments, and recognize partial evidence, thereby providing useful input for feedback and reinforcing its role as a complementary analytical aid to teacher judgment within an authentic assessment framework.
4.2 Comparison between human and GPT-eComplex assistant evaluations
The following results correspond to the pilot calibration phase, conducted with a sample of 12 student portfolios. First, descriptive analyses were performed to identify preliminary patterns of agreement and divergence between human and AI-generated evaluations. These findings are subsequently complemented by inferential analyses (Section 4.2.3) to estimate inter-rater agreement more robustly.
4.2.1 Points of agreement between the agent and instructors
The descriptive analysis revealed a substantial number of matches between the evaluations produced by the GPT-eComplex Assistant and those of the human instructors. Of the 324 items evaluated (27 items × 12 portfolios), the agent’s scores matched those of Instructor 1 in 188 cases (58.0%) and Instructor 2 in 175 cases (54.0%). Moreover, in 119 items (36.7%), all three evaluators (GenAI and both instructors) assigned exactly the same score. When compared with the rounded mean of the human scores, the agent achieved the same rating in 188 items (58.0%), suggesting a consistent tendency toward alignment with the average of human judgments.
When the criterion of agreement was broadened to include a ± 1-point margin relative to each instructor and to the human mean, the level of coincidence exceeded 98% of cases, with maximum discrepancies limited to ±2 points. This indicates that, although exact matches were not always achieved, the GPT-eComplex Assistant’s scores tended to remain close to human ratings.
Given that the evaluation scale (1–3) is ordinal and relatively narrow, a one-point tolerance was also considered appropriate for identifying minimal differences that did not alter the pedagogical meaning of the rubric. For a more comprehensive interpretation, these descriptive findings are complemented by chance-adjusted and ordinal-sensitive metrics—including weighted kappa and intraclass correlation coefficients (ICC)—which are detailed in the inferential analysis section.
At the dimensional level, Critical Thinking showed the highest relative alignment, with 65.7% agreement between the agent and Instructor 1 and the same percentage when compared with the human mean. Systemic Thinking also exhibited consistently high agreement with both instructors (61.1%) and the highest proportion of full three-way matches (47.2%). In contrast, Scientific Thinking displayed lower agreement—48.8% with Instructor 1, 41.7% with Instructor 2, and only 25% three-way concordance. Finally, Innovative Thinking reached intermediate values, showing closer proximity to Instructor 2 (58.3%) than to Instructor 1 (53.3%).
Figure 2 visualizes these results, showing the percentages of exact agreement between the GPT-eComplex Assistant and each instructor, as well as the joint agreements and those corresponding to the rounded mean of the human scores.
Figure 2. Exact agreement percentages between the GPT-eComplex Assistant and human instructors by dimension.
Overall, the descriptive data suggest that agreement was strongest in the Critical and Systemic Thinking dimensions, whereas Scientific Thinking showed the lowest alignment and Innovative Thinking occupied an intermediate position. These preliminary results reveal distinct patterns of proximity between the GenAI agent and human evaluators, depending on the type of complex-thinking competency assessed.
4.2.2 Areas of discrepancy and divergence
The detailed analysis revealed that, although most discrepancies between the GPT-eComplex Assistant and the instructors remained within narrow margins (0 to ±1 point), a very limited number of cases showed differences of ±2 points. These instances were rare and concentrated in four specific items, particularly within the Scientific Thinking dimension (Portfolio P12, Items 8 and 9) and the Critical Thinking dimension (Portfolio P8, Item 15; Portfolio P9, Item 16), as shown in Table 3.
The observed differences may be explained by the fact that some performances fell at the boundaries between rubric categories—that is, at the thresholds separating one level of achievement from another. In such cases, the evidence presented by students may not align precisely with a single descriptor but instead exhibit intermediate characteristics. This situation increases the likelihood that small interpretive variations lead to different score assignments. Another possible factor concerns how evidence is weighted: while the GPT-eComplex Assistant tends to assign greater importance to explicit formulations, human instructors may recognize implicit achievements reflected in reasoning or idea integration.
To address the instances in which wider discrepancies were observed, a targeted review was conducted of the items corresponding to portfolios P12, P8, and P9, with the aim of verifying descriptor clarity and the consistency of the evidence presented.
In Portfolio P12 (Scientific Thinking), the absence of explicit indicators of logical or deductive reasoning suggested the need to reinforce the task instructions by incorporating hypothesis formulation and testing. In Portfolio P8 (Critical Thinking), the student proposed potential solutions without articulating hypotheses or contrastive reasoning; thus, it is recommended to maintain the task structure but make the requirement to support solutions with verifiable arguments or evidence more explicit. Finally, in Portfolio P9 (Innovative Thinking), the task lacked supporting theoretical references. It is therefore advisable to retain the current format while adding an instruction prompting students to link their proposals to relevant innovation theories or models within the analyzed sector.
The results showed that the rate of ±2 discrepancies dropped to 0%, while agreement within a ± 1-point margin increased to 100%, accompanied by slight improvements across all concordance metrics (Kendall’s W de ~ 0.679 a ~ 0.684, Fleiss’ κ de ~ 0.28 a ~ 0.285, ICC de ~ 0.51 a ~ 0.528, and Spearman’s ρ de ~ 0.549 a ~ 0.581). These findings indicate that the detected discrepancies were limited to specific cases located at the boundaries between rubric categories and did not affect the overall consistency of the agent’s performance.
Therefore, the GPT-eComplex Assistant demonstrates a stable and coherent evaluation pattern aligned with human judgment, reinforcing its validity as a complementary support tool within a formative and authentic assessment framework. These observations also illustrate how the interaction between the AI agent and human reviewers can evolve into a bidirectional feedback process, strengthening both the validity of the assessment and the continuous improvement of teaching practices.
4.2.3 Inferential statistical analysis of concordance between the AI agent and human evaluators
To complement the descriptive analyses, inferential statistical tests were applied to estimate the degree of agreement between human evaluators and the AI-based evaluation agent. This subsection presents descriptive results based on the AI agent’s analysis of the full sample of student portfolios (n = 120). Given that the scoring scale was ordinal and limited (1–3), metrics that account for chance agreement and respect the ordinal nature of the data were used. The results are presented in Table 4.
Table 4. Inferential results of concordance between human evaluators and the AI agent (pilot phase, n = 12 portfolios).
First, Kendall’s W was used to assess overall concordance among the three evaluators (Instructor 1, Instructor 2, and the AI agent). The result, W = 0.68 (p < 0.001), represents a substantial level of agreement (Landis and Koch, 1977). Since this coefficient ranges from 0 to 1, with higher values indicating stronger concordance, the finding suggests that the evaluators consistently ranked the portfolios similarly (i.e., agreeing on which performed better or worse), even if their exact categorical scores did not always match.
Second, Fleiss’ κ was computed to estimate exact agreement across all items. The value obtained, κ = 0.28 (P̄ = 0.57, Pe = 0.403), indicates that evaluators agreed on about 57% of the items, with the adjusted concordance corresponding to a fair agreement (Landis and Koch, 1977). This result reflects that the AI agent did not always assign the exact same category as the human evaluators but tended to select a neighboring level (e.g., 2 instead of 3).
Third, the Intraclass Correlation Coefficient (ICC) was calculated to assess inter-rater reliability. Results showed moderate reliability when considering a single rating (ICC(2,1) = 0.51) and good reliability when using the average of all three evaluators (ICC(2,k) = 0.76), according to Koo and Li (2016). These findings indicate that while individual ratings may vary, consistency improves significantly when multiple judgments are combined—suggesting that the AI agent adds more value as part of a multi-evaluator system than as a stand-alone rater.
Fourth, Spearman’s ρ correlation between the AI-generated and average human scores was ρ = 0.55 (p < 0.001), representing a moderate positive association (Akoglu, 2018). This means that when instructors assigned higher scores, the AI agent tended to assign higher scores as well, even if it did not always match the exact category.
Taken together, both descriptive and inferential analyses point in the same direction: the AI agent consistently followed the general evaluative trend established by the instructors and contributed to stabilizing results when multiple evaluations were considered. Although exact matches were not always achieved, over 98% of the differences remained within one point, with larger discrepancies being rare and localized. These statistical findings confirm that the evaluators tended to rank the portfolios similarly (substantial agreement via Kendall’s W), that the AI agent mirrored the general hierarchy of human judgments (moderate Spearman correlation), and that reliability improved considerably when averaging across raters (ICC).
In practical terms, these results suggest that the GPT-eComplex Assistant can serve as a valid co-evaluator, provided it operates under teacher supervision and calibration protocols that ensure validity, traceability, and fairness in assessment decisions.
Since this was a pilot phase with a relatively small sample of 12 portfolios, the results should be interpreted as exploratory rather than generalizable. Nonetheless, they provide a solid foundation for advancing from concordance analysis toward the identification of strengths and areas for improvement in student portfolios—an issue addressed in the next section.
4.3 Complex thinking in student portfolios
4.3.1 Characterization of the analyzed projects
Among the 120 evaluated portfolios, there was a notable thematic and sectoral diversity, with a predominance of projects related to circular economy, recycling, and sustainable materials (25 projects), followed by initiatives focused on social inclusion, employment, and community development (16), sustainable agriculture and agribusiness (13), and education and training (11). Other relevant areas included health and well-being (9), sustainable food systems (9), and sustainable construction and housing (9), as illustrated in Figure 3.
Regarding the Sustainable Development Goals (SDGs), SDG 12 (Responsible Consumption and Production) was the most frequently addressed, appearing in 57 projects, followed by SDG 9 (Industry, Innovation, and Infrastructure), SDG 13 (Climate Action), and SDG 8 (Decent Work and Economic Growth)—each represented in approximately 33 portfolios. SDG 10 (Reduced Inequalities) and SDG 11 (Sustainable Cities and Communities) also appeared recurrently, while SDG 3 (Good Health and Well-being), SDG 7 (Affordable and Clean Energy), SDG 2 (Zero Hunger), and SDG 15 (Life on Land) were represented to a moderate extent.
In terms of project types, web platforms and applications accounted for 23%, followed by training and educational programs (19%), communication or awareness campaigns (17%), and data panels or analytical dashboards (15%). This distribution highlights a practical orientation toward developing technological solutions applicable to real-world contexts, with a strong emphasis on environmental sustainability, digital innovation, and social inclusion.
The analysis of portfolios by level of mastery in complex thinking revealed that the highest average scores corresponded to projects related to technology and innovation (M = 2.63) and sustainable mining (M = 2.60), showing a solid integration of scientific and innovative dimensions—particularly through digital solutions, automation, and process optimization. Projects in sustainable agriculture (M = 2.51) and environmental management (M = 2.49 − 2.50) achieved strong results in the systemic dimension, demonstrating an understanding of the interrelationships among actors, resources, and ecosystems.
Meanwhile, projects in education (M = 2.45) and responsible retail (M = 2.40) showed greater development in the critical dimension, reflecting ethical awareness and the evaluation of social impacts. Finally, although projects in circular economy and sustainable materials (M = 2.44) achieved solid overall performance, there remains room to strengthen scientific argumentation and evidence-based decision-making, which would further enhance the rigor and impact of these proposals.
4.3.2 Strengths in the development of students’ complex thinking
The comprehensive analysis of the 120 portfolios made it possible to assess the level of development of complex thinking across its four dimensions—systemic, scientific, critical, and innovative—following the eComplex rubric (Castillo-Martínez and Ramírez-Montoya, 2022b). Quantitative results indicated an overall mean score of 2.40, corresponding to an intermediate–advanced level, with a clear tendency toward higher performance in the innovative (2.66) and systemic (2.56) dimensions, compared to lower values in the critical (2.31) and scientific (2.06) dimensions (Figure 4).
The radar chart shows that students demonstrated stronger performance in applied creativity and systems comprehension, while scientific rigor and critical argumentation remain areas for improvement.
The qualitative analysis revealed consistent strengths in the integration of complex thinking dimensions. In the systemic dimension, students demonstrated a deep understanding of the interconnection between social, environmental, and economic variables, articulating integrated and sustainable solutions. This is reflected in excerpts such as:
“Solving soil contamination from a holistic and sustainable perspective leads to environmental, economic, and social benefits, contributing to SDG 15 and fostering a more equitable and resilient society” (P1).
“Addressing the issue through a combination of organizational strengthening, sustainable production, and fair trade will generate sustainable social, economic, and environmental impacts” (P84).
Other reflections include:
“This project helped me become aware of the problem of food waste and its impact on climate change… promoting responsible consumption and food reuse practices that contribute to improving food security in the country” (P4).
“The implementation of smart meters is a step toward a more sustainable and consumption-conscious future… we must focus not only on technology but also on how it transforms customer relationships and contributes to a more responsible energy model” (P7).
These examples demonstrate applied systemic understanding and an orientation toward collaborative and multidimensional solutions.
In the scientific dimension, students showed progress in structuring evidence-based proposals and maintaining coherence between diagnosis, data, and solution. For instance:
“Our smart leasing proposal for small farmers in Sullana is based on a vulnerability analysis supported by SDGs 2 and 8, incorporating technologies such as sensors and drones to optimize resources and ensure sustainable solutions” (P112).
“Our project connects evidence on sustainability and food innovation with concrete solutions such as circular economy models, digital traceability, and support for small producers, ensuring coherence between diagnosis, data, and social impact proposal” (P22).
These examples reveal a growing capacity for evidence-based reasoning, a key component of mature complex thinking oriented toward informed decision-making.
The critical dimension emerged through ethical reflection, recognition of tensions, and the formulation of reasoned judgments about contemporary dilemmas. One student noted that
“The lack of trust in uncertified caregivers creates anxiety among parents… informal advertising and recommendations do not guarantee quality or safety” (P17) highlighting ethical tensions between informality and professionalization.
Another reflected that
“Corporate responsibility is a key driver… companies that prioritize social and environmental impact have the power to transform communities” (P33) demonstrating ethical awareness of the role of business in sustainability.
Similarly, another proposed that
“Implementing a community composting system is a sustainable and scalable solution aligned with SDGs 12 and 13” (P31).
Connecting critical thinking with transformative action. Collectively, these reflections illustrate a transition from identifying dilemmas to formulating well-grounded positions integrating social responsibility, sustainability, and professional ethics.Finally, the innovative dimension stood out for generating creative proposals with high impact potential. For example:
“The ‘EcoPavers’ proposal suggests paving stones made from 60% recycled construction waste, offering a sustainable and economically viable alternative for the construction industry” (P82).
Another student proposed:
“We aim to reuse whey powder through circular economy strategies, promoting its sustainable use and reducing its environmental impact” (P98).
A third project converted plastic waste into industrial igloo tents, noting that,
“Producing these tents not only reduces plastic waste but also provides innovative housing and logistical solutions, strengthening the transition toward sustainable models” (P101).
These portfolios exemplify an advanced understanding of innovation as an integrative process that merges creativity, sustainability, and social value, translating the principles of complex thinking into applied solutions for sustainable development.
4.3.3 Areas for improvement identified through AI-assisted analysis
The automated review of 120 portfolios using the GPT-eComplex Assistant revealed cross-cutting patterns for improvement in the development of complex thinking. The results suggest that, while most students achieved an intermediate–advanced level in the systemic and innovative dimensions, there remain notable gaps in scientific rigor and depth of critical reasoning.
In particular, the model identified the need for greater methodological precision in the formulation and validation of proposals. Across multiple cases, portfolios contained descriptions of processes or hypotheses without an explicit method or evaluation indicators. The system clustered comments such as “needs stronger methodological rigor in research,” “apply scientific reasoning and justify process stages,” or “define impact metrics and validate results,” highlighting the opportunity to strengthen the scientific dimension of complex thinking—specifically, the ability to contrast theory and evidence, establish metrics, and communicate verifiable results.
Similarly, opportunities were observed in the critical dimension, especially regarding the articulation of well-grounded judgments and the ethical evaluation of proposed solutions. The model synthesized observations such as “should strengthen critical argumentation and interpretation of results” and “requires empirical validation of findings,” pointing to an emerging but still limited reflection on the social and environmental impacts of innovation.
Overall, the AI-assisted analysis suggests that students tend to prioritize creativity and feasibility—dimensions in which they show consistent strength—but need to deepen their empirical validation and critical-scientific reasoning to reach higher levels of cognitive integration. This finding confirms the value of generative AI tools as support mechanisms in formative assessment, enabling the identification of performance patterns and guiding pedagogical strategies toward more reflective, evidence-based, and ethically grounded learning.
The digital portfolio consolidated its role as an authentic assessment tool by integrating critical reflection, real-world contextualization, and progressive evidence of learning. It provided tangible evidence of situated learning and of the integration of the four dimensions of complex thinking—systemic, scientific, critical, and innovative—illustrating how students articulated theory, practice, and personal values in real proposals aligned with the Sustainable Development Goals (SDGs). Moreover, the portfolio fostered deep reflection, self-regulated learning, and traceable cognitive and ethical development throughout the course, thanks to its flexible digital format that supported personalization, continuous feedback, and iterative improvement.
The combined qualitative and automated analysis revealed that the portfolio effectively demonstrated coherence among diagnosis, method, action, and results, becoming tangible evidence of complex thinking in action. In this sense, it functioned not only as an evaluation tool but also as a reflective learning environment, where students connected academic knowledge with their civic, professional, and entrepreneurial roles, thereby strengthening their capacity to generate sustainable and innovative solutions from an ethical and contextualized perspective.
5 Discussion
The incorporation of AI-assisted assessment systems emerges as an effective complementary strategy to strengthen coherence and traceability in evaluating complex competencies within business education programs. As shown by the statistical analysis, the GPT-eComplex Assistant reproduced the overall evaluative trend of instructors, maintaining close alignment in ratings and performance hierarchies across portfolios. However, complete agreement was not achieved, indicating that the agent should be regarded as a supportive analytical tool rather than a replacement for human judgment. It enhances consistency and transparency, yet still depends on teacher calibration and oversight to ensure validity. This finding aligns with prior studies emphasizing AI’s potential to improve objectivity and reliability in formative assessment (Salinas-Navarro et al., 2024; Martin et al., 2025) while also reflecting concerns about automation without pedagogical mediation (Flodén, 2025). Consequently, human–AI co-assessment shows a positive trend toward greater stability and dependability, opening new research directions on optimizing calibration, feedback, and the ethical integration of AI in the assessment of complex thinking.
The digital portfolio serves as authentic evidence of complex thinking development among business students, enabling the integration of reflection, contextual analysis, and applied learning. The analysis of 120 portfolios showed that the instrument most clearly captures the systemic and critical dimensions—linked to holistic problem understanding and ethical reasoning—while it is less effective in representing the scientific dimension, particularly regarding data use and hypothesis formulation. This finding is consistent with previous studies indicating that portfolios are primarily used for students to present genuine learning narratives, contextual analyses, and professional reflections (Trevitt and Stocks, 2011). In the Ibero-American context, portfolios are typically implemented to foster reflection, self-regulation, and real-world learning connections (Harada, 2020; Trejo González, 2024), but are seldom used to cultivate scientific thinking skills, which are often developed through other types of learning products.
Accordingly, this study reaffirms the pedagogical value of portfolios as authentic assessment tools while underscoring the need to adjust task design and evidence requirements to better balance the representation of the critical and scientific dimensions. This has implications for both teaching practice—encouraging the redesign of activities that combine reflection with data-based analysis—and educational research, supporting the validation of hybrid instruments that integrate narrative and empirical evidence in GenAI-assisted evaluations of complex thinking.
The use of Generative AI (GenAI) in rubric-based qualitative assessment represents a pedagogical advancement, broadening opportunities for feedback, traceability, and consistency in analyzing learning evidence. The results indicate that its educational value lies not only in its technical capability but in how it amplifies teacher reflection and supports the design of authentic, learning-centered assessments focused on deeper understanding and continuous improvement. Within the Ibero-American context, although digital portfolios are recognized as valuable tools for fostering reflection and real-world engagement (Harada, 2020; Trejo González, 2024), their adoption remains limited and often dependent on individual initiative or traditional certification-oriented models (Gallardo-Fuentes et al., 2025). Moreover, few studies have systematically examined the use of GenAI in assessing complex learning processes, highlighting the need for more empirical evidence and practical guidance on its pedagogical application.
In this context, integrating GenAI introduces new ethical and pedagogical challenges, requiring that human responsibility and transparency remain central to all assessment processes (Cope et al., 2025). Proper application of GenAI therefore demands that technology function as a resource that enhances pedagogical judgment and promotes a reflective, accountable assessment culture.
In an educational landscape transformed by generative artificial intelligence, this study demonstrates that AI-assisted assessment can strengthen coherence, authenticity, and formative feedback, provided that teacher mediation and ethical integrity are maintained at every stage. These findings emphasize the need to develop practical frameworks ensuring transparency, data protection, and teacher preparation as essential conditions for the responsible use of GenAI in higher education (UNESCO, 2025). The following sections build on these directions through pedagogical and institutional implications, recommendations for designing authentic formative portfolios, and the introduction of the AI-PROMPT Framework—a methodological guide for the ethical and formative integration of AI into complex thinking assessment.
5.1 Implications for teaching practice and faculty development in the Ibero-American context
Complex thinking constitutes an essential metacompetence in higher education, as it integrates systemic understanding, scientific reasoning, critical reflection, and creativity to address real-world challenges. The findings of this study show that authentic assessment through portfolios provides an effective means to demonstrate mastery of this competence—especially when combined with a Generative AI–based evaluation agent that complements teachers’ work, strengthens formative feedback, and enhances the traceability of the assessment process. This integration demonstrates the potential of AI to support the evaluation of complex learning, provided that human judgment remains the ethical and pedagogical cornerstone.
5.1.1 Implications for teaching practice
Within the Ibero-American context, the findings underscore the need to strengthen faculty capacity to integrate Generative AI (GenAI) into the teaching and assessment of complex thinking. Educators must understand both the structure and purpose of this metacompetence and be able to design authentic learning tasks and assessments aligned with its four dimensions—systemic, scientific, critical, and innovative. This involves incorporating open-ended tasks, interdisciplinary projects, and guided reflection processes within student portfolios.
Moreover, AI literacy for educators is essential. Teachers must be equipped to design and refine prompts according to rubric criteria, interpret AI-generated reports, and use them as a basis for personalized feedback. In this regard, the study introduces, in the following sections, an illustrative portfolio design and the AI-PROMPT Framework—a methodological guide for constructing educational prompts consistent with the eComplex rubric and the principles of authentic assessment. These actions foster a teaching culture grounded in ethical and reflective use of AI, where the educator maintains interpretive control and ensures the pedagogical validity of the evaluative process.
5.1.2 Implications for institutional policy
In business schools, integrating AI into teaching and assessment processes represents an opportunity to strengthen the ethical, critical, and sustainability-oriented formation of future leaders. It is recommended that institutions promote faculty development programs focused on AI literacy, educational prompt design, and the use of digital rubrics to ensure more transparent, consistent, and formative evaluation processes. The AI-PROMPT Framework proposed in this study may serve as an institutional guideline for standardizing best practices and ensuring coherence among competencies, tools, and learning outcomes.
More broadly, across Ibero-American higher education, these initiatives can help align pedagogical innovation with international frameworks of quality and sustainability, fostering an academic culture that promotes the ethical, responsible, and human-centered use of AI in both learning and assessment.
5.2 Recommendations for designing a formative and authentic portfolio
Based on the findings of this study and prior literature, the following design principles are proposed for a formative and authentic portfolio aimed at developing complex thinking in higher education:
• Reflective—Encourages critical analysis and self-regulated learning (Harada, 2020; Sultana et al., 2020). Portfolios should help students articulate ethical and social awareness, progressing from description to critical argumentation and evidence-based decision-making. Structured spaces—such as guided questions, reflective journals, or self-assessments—can foster metacognition and ethical reflection throughout the learning process.
• Authentic—Links learning with real-world tasks and professional standards (Villarroel et al., 2017; Trevitt and Stocks, 2011). Students should apply theoretical knowledge to social, environmental, or professional challenges, validating their proposals with empirical evidence such as field data, stakeholder feedback, or pilot results that demonstrate the feasibility and impact of their solutions.
• Evidential and Progressive—Makes the development of competencies visible over time (Erumeda et al., 2024). Portfolios should trace the student’s reasoning from hypothesis → method → data → results → decisions, showing how knowledge is constructed through evidence, reflection, and iterative improvement.
• Interactive and Feedback-Oriented—Integrates teacher and peer feedback to enrich learning (Gallardo-Fuentes et al., 2025; Ajjawi et al., 2023). As a dialogic space, the portfolio should include midterm reviews, peer co-assessment, and reflective responses showing how students revise their work based on feedback—promoting collaboration and shared responsibility for learning.
• Personalized and Contextualized—Reflects each student’s voice, context, and trajectory (Trevitt and Stocks, 2011). A personal statement or learning narrative can highlight goals, motivations, and professional connections, reinforcing authorship and meaningful learning transfer to real contexts.
• Digital and Flexible—Leverages multimedia, remote access, and adaptability (Dave and Mitchell, 2025; Mudau, 2022). Digital platforms should support multiple evidence types—textual, visual, and audiovisual—allowing asynchronous feedback, progress analytics, and inclusive, sustainable learning environments that enhance reflective and collaborative digital culture.
In sum, a formative and authentic portfolio integrates reflection, evidence, feedback, personalization, and digital adaptability to foster complex thinking. Combined with GenAI tools, it becomes a dynamic environment that supports ethical, evidence-based, and continuous learning.
5.3 Methodological recommendation: the AI-PROMPT framework for designing GPT-based assessment agents
Based on the findings of this study, the AI-PROMPT Framework (Figure 5) is proposed as a flexible methodological guide for designing prompts that enable educators across disciplines to apply the eComplex rubric in authentic evaluations of complex thinking. The framework integrates principles of responsible prompt engineering (White et al., 2023; Vanderbilt University, 2024) with formative assessment foundations, emphasizing transparency, traceability (clear linkage between evidence and criteria), and teacher judgment as central components.
Figure 5. AI-PROMPT framework for designing GPT-based evaluative agents aligned with the eComplex rubric.
The framework supports the creation of AI evaluation agents that keep complex thinking as their core focus, allow for disciplinary adaptation, and strengthen transparency and traceability—that is, the explicit documentation of how each piece of student evidence connects to rubric criteria and evaluative decisions. Teachers retain the interpretive and ethical role, ensuring pedagogical validity and accountability in GenAI-assisted evaluation.
5.4 Limitations of the study
Although this study provides robust evidence on the pedagogical and technical feasibility of GenAI-assisted evaluation, it presents certain limitations regarding its disciplinary scope, sample size, and reliance on predefined prompts. Moreover, it did not explore teachers’ and students’ perceptions nor address technological disparities specific to the Ibero-American context, which opens pathways for future research. Despite these constraints, the results offer a valuable empirical foundation for advancing toward hybrid assessment models where artificial intelligence complements pedagogical judgment and fosters more transparent and formative feedback practices.
6 Conclusion
This study demonstrates the potential of integrating Generative Artificial Intelligence (GenAI) into rubric-based qualitative assessment to strengthen authentic evaluation and foster the development of complex thinking in business education. The findings revealed that: (a) the GPT-eComplex Assistant closely mirrored teachers’ judgments, enhancing consistency, traceability, and transparency in the evaluative process—though still requiring human calibration; (b) the digital portfolio proved to be a valuable form of authentic evidence for observing systemic and critical thinking, while showing limitations in the scientific dimension; (c) within the Ibero-American context, the incorporation of GenAI in educational assessment remains emergent and under-researched, emphasizing the need for empirical evidence and practical guidance; and (d) the responsible use of GenAI requires active teacher mediation, ethical awareness, and institutional transparency, ensuring that technology complements rather than replaces pedagogical judgment and promotes a more reflective, human-centered assessment culture.
These results reinforce the importance of combining pedagogical innovation with the ethical use of AI in higher education. Integrating such tools into academic programs enhances not only cognitive development but also ethical awareness, self-regulation, and social responsibility within assessment and learning processes. The quality of the prompt and the triangulation between the eComplex rubric, teacher judgment, and the AI agent emerged as key factors for ensuring the validity, coherence, and traceability of the process. Moreover, the analysis of 120 portfolios highlighted strong performance in the innovative and systemic dimensions—associated with creativity and holistic problem understanding—and opportunities for improvement in the scientific and critical dimensions, particularly regarding rigorous data use and evidence-based reasoning.
As an outcome of this research, an integrated pedagogical strategy is proposed, combining formative assessment through digital portfolios, the use of GenAI-based evaluative agents, and the AI-PROMPT Framework as a methodological guide for designing educational prompts aligned with the eComplex rubric. This model is grounded in teacher mediation, ethics, and process traceability, offering a replicable structure to develop learning experiences oriented toward complex thinking and the SDGs.
Conceptually, this study contributes by linking educational innovation, formative assessment, and artificial intelligence within an applied model; methodologically, by combining quantitative and qualitative analyses that validate alignment between AI and human judgment; practically, by providing design principles and a replicable framework to strengthen complex thinking competencies in AI-assisted educational environments; and contextually, by contributing empirical evidence from an Ibero-American setting, where GenAI-supported assessment remains limited yet essential for reducing digital gaps and fostering an ethical and sustainable evaluation culture.
Finally, future research should explore the GPT-eComplex approach and eComplex rubric across other disciplines and learning products, test the scalability and adaptability of the AI-PROMPT model in diverse educational levels and cultural contexts, and conduct comparative international studies to analyze how pedagogical and sociotechnical factors influence the efficacy and ethics of GenAI-assisted assessment.
Data availability statement
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.
Ethics statement
This study involved the analysis of digital portfolios produced by graduate students in business education programs in Peru and was conducted in accordance with internationally accepted ethical standards for educational research. Ethical approval was granted by the Research Ethics Committee for Social Sciences, Humanities, and Arts of the Pontificia Universidad Católica del Perú. Informed consent was obtained from all participants, authorizing the use of their academic work for research purposes. All data were anonymized prior to analysis, participation was voluntary, and the study had no impact on students’ academic evaluation.
Author contributions
MP-C: Conceptualization, Investigation, Methodology, Project administration, Resources, Supervision, Visualization, Writing – original draft, Writing – review & editing. IC-M: Conceptualization, Formal analysis, Funding acquisition, Validation, Writing – original draft, Writing – review & editing.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Acknowledgments
The authors acknowledge the technical and financial support of Writing Lab, Institute for the Future of Education, Tecnologico de Monterrey, Mexico, in the production of this work.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that Generative AI was used in the creation of this manuscript. Generative AI was used as an evaluative assistant (GPT-eComplex Assistant) to analyze student portfolios based on the eComplex rubric, identifying patterns of complex thinking and providing formative feedback without replacing human judgment.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2026.1729156/full#supplementary-material
References
Ajjawi, R., Tai, J., Dollinger, M., Dawson, P., Boud, D., and Bearman, M. (2023). From authentic assessment to authenticity in assessment: broadening perspectives. Assess. Eval. High. Educ. 49, 499–510. doi: 10.1080/02602938.2023.2271193
Akoglu, H. (2018). User's guide to correlation coefficients. Turkish Journal of Emergency Medicine 18, 91–93. doi: 10.1016/j.tjem.2018.08.001,
Alonso-Galicia, P. E., Vázquez-Parra, J. C., Castillo-Martínez, I. M., and Ramírez-Montoya, M. S. (2025). Complex thinking as a component in entrepreneurship education and engineering classes: an empirical study. J. Int. Educ. Bus. 18, 218–233. doi: 10.1108/JIEB-08-2024-0101
Alvarez-Icaza, I., González-Pérez, L. I., López-Caudana, E. O., Huerta, O., Muñoz-Casillas, F., and Ramírez-Montoya, M. S. (2025). “User satisfaction analysis of the OpenEDR4C platform: developing complex thinking competence” in Proceedings of TEEM 2024: The 12th international conference on technological ecosystems for enhancing Multiculturality. eds. R. Molina-Carmona, C. J. Villagrá-Arnedo, P. Compañ-Rosique, F. J. García-Peñalvo, and A. García-Holgado (Singapore: Springer), 853–862.
Brookhart, S. M. (2013). How to create and use rubrics for formative assessment and grading. Alexandria, VA: ASCD.
Castillo-Martínez, I.M., and Ramírez-Montoya, M.S. (2022a). eComplexity: Medición de la percepción de estudiantes de educación superior acerca de su competencia de razonamiento para la complejidad. Available online at: https://hdl.handle.net/11285/643622 (Accessed: 13 October 2025).
Castillo-Martínez, I.M., and Ramírez-Montoya, M.S. (2022b). eComplex: medición de los niveles de dominio de la competencia de razonamiento complejo en estudiantes universitarios. Available online at: https://hdl.handle.net/11285/650169 (Accessed: 13 October 2025).
Castillo-Martínez, I.M., Velarde Camaqui, D., Ramírez-Montoya, M.S., and Sanabria Zepeda, J.C. (2024) ‘eComplexity: psychometric properties to test the validity and reliability of the instrument’, Art. Available online at: https://hdl.handle.net/11285/675764 (Accessed: 13 October 2025).
Çayir, A. (2023). A literature review on the effect of artificial intelligence on education. İnsan ve Sosyal Bilimler Dergisi 6, 276–288. doi: 10.53048/johass.1375684
Center for Teaching Innovation, Cornell University (2025). Generative Artificial Intelligence. Cornell University. Available online at: https://teaching.cornell.edu/generative-artificial-intelligence (Accessed: 13 October 2025).
Cope, B., Kalantzis, M., and Saini, A. K. (2025). “The ends of tests: possibilities for transformative assessment and learning with generative AI” in AI and the future of education: Disruptions, dilemmas and directions. ed. S. Isaacs (Paris: UNESCO), 81–88.
Dave, K., and Mitchell, K. (2025). Enhancing flexible assessment through eportfolios: a scholarly examination. J. Univ. Teach. Learn. Pract. 22, 1–17. doi: 10.53761/wearsf41
Elsner, W. (2025). “Complexity, sustainability, and the history of economic thought” in Routledge international handbook of complexity economics. eds. P. Chen, W. Elsner, and A. Pyka (London: Routledge).
Erumeda, N. J., George, A. Z., and Jenkins, L. S. (2024). Evidence of learning in workplace-based assessments in a family medicine training programme. S. Afr. Fam. Pract. 66:a5850. doi: 10.4102/safp.v66i1.5850,
Flodén, J. (2025). Grading exams using large language models: a comparison between human and AI grading of exams in higher education using ChatGPT. Br. Educ. Res. J. 51, 201–224. doi: 10.1002/berj.4069
Gallardo-Fuentes, F., Carter-Thuillier, B., Peña-Troncoso, S., Pérez-Norambuena, S., and Gallardo-Fuentes, J. (2025). Perceptions of learning assessment in practicum students vs. initial teacher education faculty in Chilean physical education: a comparative study of two cohorts. Educ. Sci. 15:459. doi: 10.3390/educsci15040459
George-Reyes, C. E., and Oliva-Córdova, L. M. (2025). Pensamiento complejo como habilitador del emprendimiento científico: autovaloración desde la educación superior en Guatemala. Pixel-Bit. Revista de Medios y Educación 73:5. doi: 10.12795/pixelbit.111533
Giannini, S. (2023). Foreword Guidance for generative AI in education and research. Paris: UNESCO, 5–6.
Gikandi, J. W. (2021). Enhancing e-learning through integration of online formative assessment and teaching presence. Int. J. Online Pedagogy Course Des. 11, 48–61. doi: 10.4018/IJOPCD.2021040104
Hapsari, S. (2016). A descriptive study of the critical thinking skills of social science at junior high school. J. Educ. Learn. 10, 228–234. doi: 10.11591/edulearn.v10i3.3791,
Harada, A. S. (2020). Avaliação formativa: o portfólio como instrumento de avaliação para o desenvolvimento do aprendizado reflexivo. Revista Meta Avaliação 12, 826–826. doi: 10.22347/2175-2753v12i37.2880
Horn, A. (2023). Rethinking education in the age of artificial intelligence guidance for generative AI in education and research. Paris: UNESCO, 38–40.
Ifenthaler, D., and Schumacher, C. (2023). Reciprocal issues of artificial and human intelligence in education. J. Res. Technol. Educ. 55, 1–6. doi: 10.1080/15391523.2022.2154511
Inman, T. F., and Roberts, J. L. (2021). Authentic, formative, and informative. Routledge EBooks, 205–236. doi: 10.4324/9781003236696-13
Izvorska, D. (2016). A model for development of students’ professional competency in technical universities. Educ. Res. 45, 961–974. doi: 10.1109/ELMA.2019.8771682
Jaaron, A., and Backhouse, C. (2018). Operationalisation of service innovation: a systems thinking approach. Serv. Ind. J. 38, 561–583. doi: 10.1080/02642069.2017.1411480
Jauhiainen, J. S., and Garagorry Guerra, A. B. (2025). Educational evaluation with large language models (LLMs): chatGPT-4 in recalling and evaluating students’ written responses. J. Inf. Technol. Educ. Innov. Pract. 24:2. doi: 10.28945/5433
Koerber, S., Mayer, D., Osterhaus, C., Schwippert, K., and Beate, S. (2015). The development of scientific thinking in elementary school: a comprehensive inventory. Child Dev. 86, 327–336. doi: 10.1111/cdev.12298,
Koo, T. K., and Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163. doi: 10.1016/j.jcm.2016.02.012,
Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33, 159–174.
Leal Uhlig, E. F., Garza León, C., Cruz Vargas, X., Hernández Franco, S., and Portuguez-Castro, M. (2023). Lëttëra web platform: a game-based learning approach with the use of technology for reading competence. Front. Educ. 8:1180283. doi: 10.3389/feduc.2023.1180283
Leighton, J. P., Cui, Y., and Cutumisu, M. (2021). Key information processes for thinking critically in data-rich environments. Front Educ. 6, 1–15. doi: 10.3389/feduc.2021.561847
Lee, D., and Palmer, E. (2025). Prompt engineering in higher education: a systematic review to help inform curricula. Int. J. Educ. Technol. High. Educ. 22:7. doi: 10.1186/s41239-025-00503-7
Ma, T., Yuan, H., Yang, X., Li, Y., Yao, J., and Mu, D. (2023). Design of online formative assessment of nursing humanities curriculum during the COVID-19 pandemic: a teaching practice research. Nurse Educ. Today 128:105874. doi: 10.1016/j.nedt.2023.105874,
Manning, J., Baldwin, J., and Powell, N. (2025). Human versus machine: the effectiveness of ChatGPT in automated essay scoring. Innov. Educ. Teach. Int. 62, 1500–1513. doi: 10.1080/14703297.2025.2469089
Martin, A. F., Tubaltseva, S., Harrison, A., and Rubin, G. J. (2025). Participatory co-design and evaluation of a novel approach to generative AI-integrated coursework assessment in higher education. Behav. Sci. 15:808. doi: 10.3390/bs15060808,
Mudau, P. K. (2022). Lecturers’ views on the functionality of e-portfolio as alternative assessment in an open distance e-learning. Int. J. Educ. Methodol. 8, 81–90. doi: 10.12973/ijem.8.1.81
Nguyen Thanh, B., Vo, D. T. H., Nguyen Nhat, M., Pham, T. T. T., Thai Trung, H., and Ha Xuan, S. (2023). Race with the machines: assessing the capability of generative AI in solving authentic assessments. Australas. J. Educ. Technol. 39, 59–81. doi: 10.14742/ajet.8902
Perkins, M., and Roe, J. (2023). ‘The end of assessment as we know it: GenAI, inequality and the future of knowing’. in AI and the future of education: Disruptions, dilemmas and directions. UNESCO. 76–80. doi: 10.54675/KECK1261
Portuguez-Castro, M., and Castillo Martínez, I. M. (2025). Leadership competencies for innovability: bridging theory and practice for sustainable development. J. Entrep. Manag. Innov. 21, 15–32. doi: 10.7341/20252122
Portuguez-Castro, M., and Ramírez-Montoya, M. S. (2025). Transformative economies and complex thinking: enhancing sustainability competencies in business education. Int. J. Manag. Educ. 23:101223. doi: 10.1016/j.ijme.2025.101223
Qosimova, Z. A. (2022). Traditional and innovative models of teaching. Galaxy int. interdiscip. res. J. 10, 248–260. Available online at: https://internationaljournals.co.in/index.php/giirj/article/view/2125
Ramírez-Montoya, M. S., and Portuguez-Castro, M. (2024). Expanding horizons for the future with an open educational model for complex thinking: external and internal validation. On the Horizon: The International Journal of Learning Futures 32, 32–48. doi: 10.1108/OTH-12-2023-0042
Salinas-Navarro, D. E., Vilalta-Perdomo, E., Michel-Villarreal, R., and Montesinos, L. (2024). Using generative artificial intelligence tools to explain and enhance experiential learning for authentic assessment. Educ. Sci. 14:83. doi: 10.3390/educsci14010083
Sokhanvar, Z., Salehi, K., and Sokhanvar, F. (2021). Advantages of authentic assessment for improving the learning experience and employability skills of higher education students: a systematic literature review. Stud. Educ. Eval. 70:101030. doi: 10.1016/j.stueduc.2021.101030
Spencer, L. M., and Spencer, S. M. (1993). Competence at work: Models for superior performance. New York, NY: John Wiley & Sons.
Šteh, B., and Šarić, M. (2020). Implementation of formative assessment in higher education. Horiz. Psychol. 29, 79–86. doi: 10.20419/2020.29.515
Suárez-Brito, P., Elizondo-Noriega, A., Lis-Gutiérrez, J. P., Henao-Rodríguez, C., Forte-Celaya, M. R., and Vázquez-Parra, J. C. (2025). Differential impact of gender and academic background on complex thinking development in engineering students: a machine learning perspective. On Horiz. 33, 14–31. doi: 10.1108/OTH-11-2023-0036
Suárez-Brito, P., Vázquez-Parra, J. C., López-Caudana, E. O., and Buenestado-Fernández, M. (2024). Examining the level of perceived achievement of complex thinking competency in health sciences students and its relevance to the graduate profile. Int. J. Educ. Res. Open 6, 1–8. doi: 10.1016/j.ijedro.2023.100314
Sultana, F., Lim, C. P., and Liang, M. (2020). E-portfolios and the development of students’ reflective thinking at a Hong Kong university. J. Comput. Educ. 7, 277–294. doi: 10.1007/s40692-020-00157-6
Suryansyah, A., Kastolani, W., and Somantri, L. (2021). Scientific thinking skills in solving global warming problems. IOP Conference Series: Earth and Environmental Science 683:012025. doi: 10.1088/1755-1315/683/1/012025
Tobón, S. (2013). Formación integral y competencias: Pensamiento complejo, currículo, didáctica y evaluación. 4th Edn. Bogotá: Ecoe Ediciones.
Trejo González, H. (2024). Avaliação autêntica em contextos universitários através de portefólios eletrónicos para aprendizagem. Rev. Port. Educ. 37:e24028. doi: 10.21814/rpe.29644
Trevitt, C., and Stocks, C. (2011). Signifying authenticity in academic practice: a framework for better understanding and harnessing portfolio assessment. Assess. Eval. High. Educ. 37, 245–257. doi: 10.1080/02602938.2010.527916
Vanderbilt University (2024) Prompt patterns for generative AI. Vanderbilt University Center for generative AI. Available online at: https://www.vanderbilt.edu/generative-ai/prompt-patterns/ (Accessed January 12, 2026).
Vanderbilt University Libraries (2024) AI for business: prompting and responsible AI use. Vanderbilt University. Available online at: https://researchguides.library.vanderbilt.edu/AI4Biz/Prompting (Accessed January 12, 2026).
Vázquez-Parra, J. C., Henao-Rodríguez, L. C., Lis-Gutiérrez, J. P., Castillo-Martínez, I. M., and Suárez-Brito, P. (2024a). Ecomplexity: validation of a complex thinking instrument from a structural equation model. Front. Educ. 9, 1–15. doi: 10.3389/feduc.2024.1334834
Vázquez-Parra, J. C., Tariq, R., Castillo-Martínez, I. M., and Naseer, F. (2024b). Perceived competency in complex thinking skills among university community members in Pakistan: insights across disciplines. Cogent Educ. 12, 1–15. doi: 10.1080/2331186X.2024.2445366
Villarroel, V., Bloxham, S., Bruna, D., Bruna, C., and Herrera-Seda, C. (2017). Authentic assessment: creating a blueprint for course design. Assess. Eval. High. Educ. 43, 840–854. doi: 10.1080/02602938.2017.1412396
Wake, S., Pownall, M., Harris, R., and Birtill, P. (2023). Balancing pedagogical innovation with psychological safety? Student perceptions of authentic assessment. Assess. Eval. High. Educ. 49, 511–522. doi: 10.1080/02602938.2023.2275519
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., et al. (2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv, 1–19. doi: 10.48550/arXiv.2302.11382
Wylie, E. C., and Lyon, C. J. (2015). The fidelity of formative assessment implementation: issues of breadth and quality. Assess. Educ. 22, 140–160. doi: 10.1080/0969594X.2014.990416
Yan, Q. (2024). Exploring Chinese university EFL students’ perceptions of formative assessment: a qualitative study. System 125:103391. doi: 10.1016/j.system.2024.103391
Zewe, A. (2023) ‘Explained: generative AI’, MIT News. Available online at: https://news.mit.edu/2023/explained-generative-ai-1109 (Accessed January 12, 2026).
Keywords: authentic assessment, business education, complex thinking, educational innovation, formative evaluation, generative AI, higher education, Ibero-America
Citation: Portuguez-Castro M and Castillo-Martínez IM (2026) GenAI-supported portfolio assessment for complex thinking: a GPT-based innovation in business education. Front. Educ. 11:1729156. doi: 10.3389/feduc.2026.1729156
Edited by:
Yan Liu, Carleton University, CanadaReviewed by:
Vanessa Scherman, International Baccalaureate (IBO), NetherlandsSelcuk Kilinc, Texas A&M University, United States
Copyright © 2026 Portuguez-Castro and Castillo-Martínez. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: May Portuguez-Castro, bWF5LnBvcnR1Z3VlekBwdWNwLmVkdS5wZQ==; Isolda Margarita Castillo-Martínez, aXNvbGRhLmNhc3RpbGxvQHRlYy5teGFz