Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Physiol., 05 January 2026

Sec. Computational Physiology and Medicine

Volume 16 - 2025 | https://doi.org/10.3389/fphys.2025.1678364

Profile-aided distillation framework for personalized sleep analysis with compact models using LLM-guided synthetic data

  • 1School of Electronics and Information, South China University of Technology, Guangzhou, China
  • 2The Nursing College, Jinan University, Guangzhou, China
  • 3Department of Endocrinology and Metabolism, The First Affiliated Hospital of Jinan University, Guangzhou, China

Introduction: Enabling personalized sleep analysis and interaction directly on edge devices is crucial for providing real-time health insights and tailored guidance. However, this goal remains challenging due to the scarcity of high-quality physiological data and the computational constraints of edge hardware.

Methods: We propose a framework for personalized sleep analysis on edge devices that addresses two key obstacles: limited publicly available physiological datasets and the restricted capacity of compact models. To mitigate data scarcity, we introduce a Physiologically-Constrained Adaptive Hierarchical Copula approach, which leverages large language model–guided optimization to synthesize diverse and realistic physiological signals. To enhance personalized inference on resource-limited models, we further develop Profile-Aided Distillation of Expert Inference with MoE LoRA, which integrates user-specific profile information to improve the performance of edge-deployed models.

Results: Extensive experiments on both public and in-house datasets show that the distilled models achieve performance comparable to state-of-the-art large language models, while operating efficiently within the computational and memory constraints of edge devices.

Discussion: These results demonstrate that the proposed framework offers a practical and effective solution for enabling personalized sleep analysis and user interaction in resource-constrained environments, bridging the gap between high-performance modeling and real-time, on-device healthcare applications.

1 Introduction

Personalized sleep analysis is increasingly recognized as a cornerstone of modern health management, offering the potential to deliver tailored insights and actionable recommendations to improve sleep quality and overall wellbeing (Brusie, 2025; Zhang et al., 2024). The proliferation of wearable devices and mobile health applications facilitates the convenient, continuous collection of physiological signals (e.g., electrocardiograms) across diverse, real-world settings. From these signals, valuable parameters such as heart rate variability (HRV) can be derived, providing a rich foundation for individualized analysis (Garbarino and Bragazzi, 2024; Secara and Hordiiuk, 2024). However, the scarcity and heterogeneity of publicly available high-quality sleep data sets remain a fundamental barrier to robust model development and generalizable personalized analysis.

Recent advances in large language models (LLMs) have further amplified the potential of AI-driven health analytics, owing to their remarkable capabilities in knowledge and inference (Cosentino et al., 2024; Merrill et al., 2024). LLMs have demonstrated success in clinical decision support, medical record analysis, and patient engagement (Wang and Zhang, 2024). They are also adept at integrating heterogeneous data sources, including physiological, behavioral, and environmental signals, to provide holistic, context-aware insights (Guo et al., 2024). Notably, LLMs can generate highly personalized and nuanced responses to user-specific queries, capturing the intricacies of individual needs1. Yet, these models are computationally intensive, making them impractical for direct deployment on resource-constrained edge devices such as wearables and smartphones (Nazi and Peng, 2024; Kim et al., 2024). Although edge device computational capabilities are advancing, allowing some flagship hardware to support models up to approximately 3B parameters (Gunter et al., 2024), our investigation and recent surveys (She et al., 2025; Zheng et al., 2025) indicate that a 0.5 billion-parameter size is a more universally applicable target for widespread edge deployment. However, as shown in Figure 1, this desirable compactness comes with a significant performance trade-off: standard 0.5 billion-parameter models, even after LoRA fine-tuning, exhibit substantial performance gaps compared to their larger counterparts (e.g., >1.5B) in complex personalized sleep analysis tasks. These smaller models tend to produce generic, factually correct but non-specific answers, lacking the depth of personalization required for individualized care (Kim et al., 2024; Guo et al., 2024). This performance deficit in the 0.5 billion-parameter model range underscores the central challenge addressed by this work: the need for advanced techniques to imbue compact, broadly deployable models with the sophisticated, personalized inference capabilities typically found only in much larger LLMs.

Figure 1
Bar chart titled

Figure 1. Performance comparison of language models across different parameter sizes for sleep health applications. Models with 0.5B, 1.5B, and 7B parameters were fine-tuned using LoRA, while Qwen-max represents a non-fine-tuned larger model. Evaluation was conducted using the LLM-as-a-Judge framework (Claude Sonnet 3.7) across three dimensions: report generation, personalized Q&A, and knowledge Q&A. The significant performance gap between the 0.5 billion-parameter model and larger models highlights the challenge addressed in this work: enabling efficient small models (0.5B) to perform competitively with larger models while remaining deployable on edge devices.

To address the dual challenges of data scarcity and the performance gap of small models, this study proposes a novel framework for efficient, real-time personalized sleep analysis on edge devices. Our approach first employs a physiologically-constrained adaptive hierarchical copula (PC-AHC-LLM) to synthesize diverse and realistic sleep data, ensuring that downstream models are trained on data that preserve both statistical and clinical validity. Building on this foundation, we introduce a Profile-Aided Distillation of Expert Inference framework (PADEI) that leverages profile-aided Chain-of-Thought (CoT) prompting and Mixture-of-Experts (MoE) with LoRA adapters to transfer complex inference patterns from large teacher models to a compact 0.5B model. The proposed framework is designed to support three core tasks essential for real-world sleep health applications: 1) Sleep Report Generation: Automatically generating standardized sleep reports from physiological signals, including key sleep parameters, descriptive summaries, and personalized recommendations. 2) Personalized Q&A: Providing user-specific answers to queries grounded in individual sleep data and reports, enabling actionable and tailored health guidance. 3) Knowledge Q&A: Delivering accurate and comprehensive responses to general sleep-related knowledge questions, independent of the user’s personal data.

By integrating advanced data synthesis and personalized inference, this work contributes to the broader goal of making personalized sleep health analysis accessible, secure, and efficient in everyday settings. The main contributions of this study are summarized as follows.

• Physiologically-Constrained Adaptive Hierarchical Copula for Data Synthesis: To address the scarcity of publicly available physiological data, we propose a novel data synthesis method based on a physiologically-constrained adaptive hierarchical copula. This approach leverages LLMs for optimization, enabling the generation of diverse and realistic physiological data that preserves underlying statistical and physiological properties.

• Profile-Aided Distillation of Expert Inference: To enhance the performance of small models on edge devices, we introduce a profile-aided distillation framework which integrates user-specific information to enable efficient and personalized inference, overcoming the limitations of standard LoRA finetuning.

• Experimental Results: Extensive experiments on both public and in-house datasets demonstrate that the proposed framework achieves performance comparable to SOTA LLMs while running efficiently on resource-constrained edge devices. The results validate the effectiveness of the proposed methods in enabling personalized sleep analysis and interaction.

2 Related work

2.1 LLM for health

LLMs demonstrated significant potential in healthcare, particularly in analysis of physiological signals to provide personalized health insights. Physiological data, including ECG, photoplethysmograms (PPG), and respiratory waveforms, are critical for understanding individual health conditions. Recent advancements have explored the integration of LLMs with these data to enhance health analysis and decision-making processes. For instance, MedTsLLM (Chan et al., 2024) introduces a multimodal framework that integrates time-series data and textual context to perform tasks such as semantic segmentation, boundary detection, and anomaly detection in physiological signals. Similarly, PhysioLLM (Fang et al., 2024) combines wearable sensor data with contextual information to generate personalized health insights, enabling users to explore correlations within their physiological data and receive actionable recommendations. Health-LLM (Kim et al., 2024) further demonstrates the utility of LLMs in interpreting physiological signals, such as resting heart rate and sleep metrics, to provide context-aware health predictions.

Despite these advancements, existing methodologies face several limitations. First, the substantial computational demands of large models, such as GPT-4o and Qwen-max (Chen et al., 2025), render them unsuitable for deployment on resource-constrained edge devices. Second, LLMs are not inherently designed for numerical inference, which limits their capacity to directly process continuous physiological signals, necessitating feature extraction or multimodal approaches (Chan et al., 2024). Finally, the scarcity of publicly available and diverse physiological data sets hinders the development and validation of robust models (Fang et al., 2024). To address these challenges, our work integrates LLMs with synthesized physiological data, enabling real-time, personalized health analysis on edge devices while mitigating computational and dataset limitations.

2.2 Data synthesis

The limited availability of large-scale wearable datasets in the real world remains a significant challenge in personalized health applications, which hinders effective generalization (Merrill et al., 2024). Generative models, including Variational Autoencoders (VAE) (Kingma, 2013), Generative Adversarial Networks (GAN) (Goodfellow et al., 2014), Normalizing Flows (Rezende and Mohamed, 2015), and Diffusion Models (Rombach et al., 2022), have emerged as effective tools to address this limitation. By generating synthetic data that approximates the original data distribution, these models augment training datasets, thereby enhancing model performance, particularly when real data is scarce.

Among these methodologies, copula-based models garnered attention for their capability to capture intricate dependencies between variables while preserving marginal distributions (Kamthe et al., 2021). Building upon this foundation, Hierarchical Copulas extend the traditional copula framework by introducing a multi-level structure that models dependencies at varying granularities. This hierarchical approach is particularly advantageous for the synthesis of physiological data, as it can capture both global trends and local variations within the data. Compared to GANs, which often require extensive tuning and are susceptible to mode collapse, Hierarchical Copulas offer a more interpretable and robust alternative to generate high-fidelity synthetic data (Kamthe et al., 2021). Using this methodology, our work addresses the scarcity of publicly available physiological datasets, enabling the development of more accurate and generalizable machine learning models for personalized health applications. Furthermore, these synthesized data serve as a foundation for the deployment of efficient LLMs in resource-constrained environments.

2.3 Model distillation and personalized inference

Owing to resource constraints and real-time requirements (Huang et al., 2024), numerous studies have focused on distilling LLMs to transfer knowledge (McDonald et al., 2024), inference capabilities (Li Y. et al., 2024; Kang et al., 2024), and domain expertise (Yuan et al., 2024) into smaller, more efficient models. Model distillation techniques, such as LoRA (Hu et al., 2022) and MoE (Yang et al., 2024), have emerged as promising solutions for edge computing scenarios, where computational resources are limited.

Methodologies such as SOCRATIC CoT (Shridhar et al., 2022) and KARD (Kang et al., 2024) have demonstrated significant potential in numerical inference and factual judgment. However, they often lack the nuanced inference necessary for personalized health analysis, such as interpreting subtle physiological variations or delivering context-aware recommendations. While CoT techniques (Wei et al., 2022; Kojima et al., 2022; Fu et al., 2023) excel in step-by-step inference, they fall short in addressing the complexities inherent in personalized health scenarios, where inference must adapt to individual profiles and dynamic contexts.

Recent advancements, such as MixLoRA (Li D. et al., 2024) and MoRAL (Yang et al., 2024), enable parameter-efficient multi-task fine-tuning, rendering them suitable for deployment on edge devices. However, these methodologies face challenges such as overfitting and the balance between specialization and generalization. For instance, MoDE-CoTD (Li X. et al., 2024) introduces a novel approach by decoupling inference abilities into multiple LoRA-Experts, which are then combined to handle both seen and unseen inference tasks. Despite these innovations, achieving robust generalization while maintaining computational efficiency remains an unresolved challenge. Our work builds upon these techniques by combining LoRA and MoE to achieve efficient, personalized inference on edge devices, addressing the dual challenges of overfitting and dynamic adaptation to individual health profiles.

A concise summary of related work is illustrated in Table 1, highlighting the strengths and limitations of existing methodologies in the context of personalized inference and edge computing.

Table 1
www.frontiersin.org

Table 1. Comparison of different models and their characteristics.

3 Methodology

3.1 Method overview

The overall system architecture is illustrated in Figure 2. Raw physiological data are first processed to extract HRV features and other relevant parameters, which are then used by the PC-AHC-LLM module to generate synthetic data, addressing data scarcity and enhancing diversity while maintaining physiological plausibility. Both synthetic and real data are subsequently utilized by the large language model (LLM) to generate three types of downstream tasks: sleep reports, personalized questions, and knowledge-based questions.

Figure 2
Flowchart showing a process from data processing to edge deployment. It starts with raw data, which is transformed into real parameters like SDNN, RMSSD, PNN50, and LF/HF. These undergo data generation via PC-AHC-LLM, resulting in synthetic parameters. The tasks derived include personalized questions and report generation. The PADEI section involves user profiles affecting a teacher-student interaction for compiling and on-device inference.

Figure 2. Overview of the proposed framework. Raw physiological data are first processed to extract HRV features and other relevant parameters, which are then used by the PC-AHC-LLM module to generate synthetic data, addressing data scarcity and enhancing diversity. Both synthetic and real data are utilized by the LLM to generate sleep reports, personalized questions, and knowledge questions. These downstream tasks are managed by the Profile-Aided Distillation of Expert Inference (PADEI) module. Specifically, the PA-CoT (Profile-Aided Chain-of-Thought) mechanism is first loaded to incorporate user profiles (Profile(p)), which guide the routing process within the student model. The teacher model provides supervision to train the student model through knowledge distillation. Once trained, the student model is compiled and deployed on edge devices (e.g., RK3588) for efficient on-device inference, enabling personalized sleep analysis and user interaction with limited computational resources.

These tasks are managed by the Profile-Aided Distillation of Expert Inference (PADEI) module. Specifically, the PA-CoT (Profile-Aided Chain-of-Thought) mechanism first loads user profiles Profile(p) to guide the routing process within the student model, enabling task-specific and user-adaptive inference. The teacher model supervises the training of the student model through knowledge distillation, transferring its reasoning capabilities while maintaining computational efficiency. Once trained, the student model is compiled and deployed on edge devices (e.g., RK3588) for on-device inference, enabling real-time, personalized sleep analysis and user interaction with limited computational resources. This holistic design broadens the applicability of user-centric health solutions in resource-constrained environments.

3.2 Physiologically-constrained adaptive hierarchical copula with LLM-guided optimization (PC-AHC-LLM)

The Physiologically-Constrained Adaptive Hierarchical Copula with LLM Guidance framework, illustrated in Algorithm 1, generates synthetic physiological sleep data by synergistically combining hierarchical copula modeling with LLM-guided optimization. Specifically, LLMs are employed to derive and embed physiological constraints during the modeling process. This ensures the resulting synthetic data exhibits realistic patterns suitable for downstream sleep health analysis applications.

Algorithm 1
www.frontiersin.org

Algorithm 1.

To provide a clear and concrete illustration of the PC-AHC-LLM workflow, we focus on a representative set of seven physiological parameters: the four HRV metrics SDNN, RMSSD, LF/HF, and PNN50, as well as total sleep duration, deep sleep duration, and light sleep duration. These parameters are widely recognized as important indicators of autonomic nervous system activity and sleep architecture, and are therefore selected to exemplify the algorithmic steps in the following subsections. The overall workflow consists of three main stages: (1) physiological constraint extraction, (2) LLM-guided copula optimization, and (3) physiologically-guided sampling and synthesis. The main notations used throughout this section are summarized in Table 2.

Table 2
www.frontiersin.org

Table 2. Summary of notation.

3.2.1 Physiological constraint extraction

We leverage the LLM to systematically extract physiological constraint information and domain knowledge, as shown in Equation 1. Formally, given a structured query template Qconstraints, the LLM extracts a comprehensive set of constraints C:

C=LLMextractor(Qconstraints).(1)

The extracted constraints are summarized in Equations 2, 3.

• Sleep Architecture Constraints:

Deep Sleep Duration+Light Sleep DurationTotal Sleep Duration,(2)
Total Sleep Duration[3,12]hours,Deep Sleep DurationTotal Sleep Duration[0.1,0.35].(3)

• Heart Rate Variability (HRV) Constraints (Equations 4, 5):

SDNN[15,220]ms,RMSSD[10,180]ms(4)
PNN50[0,70]%,LF/HF[0.1,7](5)

• Clinical Thresholds (Equations 6, 7):

SDNN<30msSevere autonomic dysfunction(6)
LF/HF>4Sympathetic dominance (7)

Compared to traditional expert-defined rules, which are often manually curated and may be limited in scope or require frequent updates, the LLM-driven extraction process enables systematic and scalable identification of physiological constraints. This approach facilitates rapid adaptation to new datasets and tasks, and helps ensure that the synthesized data remains consistent with up-to-date clinical understanding.

3.2.2 LLM-guided copula optimization

Based on the extracted constraints, the LLM optimizes the copula modeling process through three key steps.

• Optimal Variable Transformations: The LLM designs constraint-preserving transformations to standardize physiological parameters into latent Gaussian variables. In this study, sleep duration and HRV parameters are transformed via log-normalization and bounded scaling. The transformed data is denoted as Z=T(D).

• •Physiological Subsystem Grouping: The LLM analyzes inter-variable correlations and physiological pathways to identify meaningful subsystems, denoted as G={Sub1,Sub2,,Subm}. In this study, two primary subsystems are identified:

• •Sleep Architecture Subsystem (Sub1): Total sleep duration, deep sleep duration, and light sleep duration.

• •Autonomic Regulation Subsystem (Sub2): SDNN, RMSSD, LF/HF, and PNN50.

• •Copula Family Selection: For each subsystem Subi, the LLM selects the optimal copula family CSubi based on dependence metrics (e.g., upper/lower tail dependence) and physiological inference. For instance, the Gumbel copula is chosen for Sub1 to capture upper tail dependence, while the Clayton copula is used for Sub2 to model lower tail dependence.

• •Vine Copula Structure Construction: The LLM constructs a vine copula structure V to model cross-subsystem dependencies, explicitly identifying critical conditional dependencies (e.g., RMSSD conditioned on deep sleep ratio) with physiologically grounded justifications.

3.2.3 Physiologically-guided sampling and synthesis

• •Physiologically-Guided Sampling Strategy The sampling strategy S is defined to emphasize clinically significant parameter regions, such as short sleep duration, low SDNN, high deep sleep ratio, and elevated LF/HF ratio. The LLM assigns importance weights to these regions based on their clinical significance, ensuring that the sampling process prioritizes physiologically meaningful ranges.

• •Synthetic samples are generated through a three-step process, as described in Equations 810. First, correlated uniform variables are sampled from the hierarchical copula structure (Equation 8). These samples are then transformed into standardized latent variables via the inverse Gaussian CDF (Equation 9). Finally, the latent variables are mapped back to the original parameter space through inverse transformations (Equation 10):

UV(CS1,,CSm),(8)

Zsyn=Φ1(U),(9)
Xsyn=T1(Zsyn),(10)

this process ensures that the generated data not only preserves statistical dependencies but also aligns with real-world clinical patterns. By integrating the mathematical rigor of hierarchical copula modeling with domain-specific knowledge extracted by the LLM, the proposed method generates synthetic sleep data that is both statistically robust and clinically meaningful. Embedding physiological constraints directly into the sampling and transformation processes ensures that the resulting dataset is suitable for downstream machine learning and clinical analysis.

3.3 Task data generation

The pipeline for data synthesis is illustrated in Figure 3. The sample data undergoes processing via PC-AHC-LLM framework to generate synthetic physiological parameters. Subsequently, based on this synthetic data, a LLM generates comprehensive sleep reports encompassing sleep-related parameters, descriptions of sleep states, and personalized suggestions. Finally, these sleep reports serve as the basis for generating three personalized questions and three domain-specific knowledge questions for each report, culminating in a comprehensive and realistic dataset for downstream model training and evaluation. First, GPT-4o is employed to generate a comprehensive sleep report dataset, which includes structured sleep-related parameters, descriptive summaries, and personalized suggestions tailored to individual profiles. Simultaneously, GPT-4o is used to generate a questions dataset, comprising both personalized questions (grounded in user-specific sleep data) and domain-specific knowledge questions (covering general sleep science and best practices).

Figure 3
Flowchart depicting data processing.

Figure 3. The pipeline for data synthesis and report/question generation. Sample data is processed through the PC-AHC-LLM framework to generate synthetic parameters. Based on the synthetic data, a LLM generates sleep reports containing sleep-related parameters, descriptions of sleep states, and personalized suggestions. Subsequently, the sleep reports are used to generate three personalized questions and three domain-specific knowledge questions for each report, creating a comprehensive dataset for downstream applications.

For each dataset, we performed data synthesis, yielding a total of 5,000 synthetic data samples. This sample size was strategically chosen to ensure the creation of a comprehensive and diverse dataset, which is crucial for the robust and thorough evaluation of our model’s performance across a wide spectrum of sleep profiles. While this scale significantly increases the computational demands for data synthesis, sleep report generation, and Q&A, we deemed it a necessary investment to rigorously assess the model’s capabilities and generalizability. Subsequently, GPT-4o was employed to generate sleep reports based on these synthesized data. These reports included sleep-related parameters, descriptions of sleep states, such as users’ cardiac health, stress resilience, and other related conditions, and personalized suggestions, providing tailored guidance specific to the evaluated sleep profiles. The prompt utilized for generating the sleep reports is shown in Table 3.

Table 3
www.frontiersin.org

Table 3. Prompt 1: Generating sleep reports.

Subsequently, based on the generated sleep reports, we utilized Prompt 2 (shown in Table 4) to produce personalized questions and domain-specific knowledge questions, further enriching the dataset for downstream applications.

Table 4
www.frontiersin.org

Table 4. Prompt 2: Generating personalized and knowledge-based questions.

3.4 Profile-aided distillation of expert inference

In this section, we introduce the Profile-Aided Distillation of Expert Inference (PADEI) framework, illurstrated in Figure 4. PADEI leverages profile-aided Chain-of-Thought prompting and LoRA-based Mixture of Experts distillation to enhance the performance of a compact language model across three core tasks: sleep report generation, personalized Q&A, and knowledge Q&A.

Figure 4
Flowchart illustrating a process involving multiple stages. It starts with inputs like sleep report generation, personalized and knowledge Q&A. These feed into a profile-aided chain-of-thought and a user profile. This information routes to the MoE Layer, which includes multiple LoRA adapters. Selected outputs merge into a base LLM, producing a task-specific response.

Figure 4. Architecture of the proposed PADEI framework. Input tasks (Report Generation, Personalized Q&A, Knowledge Q&A) are processed using profile-aided CoT prompting. A profile-aided Router dynamically activates a subset of six specialized LoRA adapters within the MoE layer. The outputs of the activated adapters are merged and integrated with the base LLM’s layers to generate the final task-specific responses. The output is supervised by a combination of cross-entropy loss, profile alignment loss, auxiliary load balancing loss, and activation frequency regularization loss to ensure both task performance and balanced expert utilization.

3.4.1 Profile-aided chain-of-thought (PA-CoT)

Through data collection and preliminary experiments, we observed that the diversity of personalized questions across different user groups poses significant challenges for using a single fixed CoT to guide a compact model in answering questions effectively. To address this, we propose the Profile-Aided Chain-of-Thought (PA-CoT), which clusters users based on their profiles and dynamically adapts inference paths to each cluster.

3.4.1.1 Clustering based on user profiles

Users are grouped into groups according to their health-related parameters, which can effectively distinguish between different health states. The user profile vector is defined in Equation 11 as:

Profile(p)=[sSDNN,sRMSSD,sPNN50,sLF/HF],(11)

where sSDNN, sRMSSD, sPNN50, and sLF/HF represent key physiological metrics derived from heart rate variability (HRV) analysis. These parameters were selected based on established literature in sleep medicine, as they collectively capture autonomic nervous system dynamics and cardiac health during sleep Shaffer and Ginsberg (2017). For instance, sSDNN reflects overall variability, sRMSSD indicates parasympathetic activity, sPNN50 measures beat-to-beat changes, and sLF/HF approximates sympathovagal balance, providing a compact feature set for clustering without requiring high-dimensional inputs.

For each cluster, a brief guide is introduced to contextualize the inference process. For example,.

• •Cluster 1: High Variability, Low Stress Users in this group exhibit high HRV metrics (e.g., elevated sPNN50 and sRMSSD), indicating good cardiac health and low stress levels. The inference process emphasizes the maintenance of current health habits and the resolution of minor concerns.

• Cluster 2: Low Variability, High Stress Users in this group show reduced HRV metrics (e.g., low sSDNN and sPNN50), suggesting potential stress or autonomic imbalance. The inference process prioritizes stress management and lifestyle adjustments.

• Cluster 3: Moderate Variability, Moderate Stress Users in this cluster have moderate HRV metrics but exhibit patterns that may correlate with sleep irregularities (for example, variable sLF/HF ratios > 3.0, potentially reflecting sympathovagal fluctuations rather than a definitive sign of sleep disorders like apnea or insomnia) Stein and Pu (2012). The inference process focuses on exploring sleep-related issues through targeted questions and recommendations, without implying clinical diagnosis.

3.4.1.2 Combining clusters with CoT inference

Once users are assigned to a cluster, the inference process integrates the cluster-specific preamble with task-specific CoT templates. Formally, the inference chain is defined in Equation 12 as:

T(p)=ClusterPreamble(p)+Ii1i=13(Si+COTs(i))+Ii=1COT1,(12)

where T(p) represents the personalized inference chain for a user with profile p, ClusterPreamble(p) provides the cluster-specific context, as illustrated in Figure 5. Through expert knowledge and a detailed analysis of the inference processes employed by SOTA models across diverse sleep analysis scenarios, we formalized templates that retain essential inference structures.

Figure 5
Flowchart detailing tasks for generating a sleep report. Task 1 focuses on sleep report generation, using a chain-of-thought process for answering questions based on identified information, summaries, or unaddressed areas. Task 2 involves personalized Q&A, while Task 3 relates to knowledge Q&A. Profiles are categorized into three clusters: high variability and low stress, low variability and high stress, and moderate variability with sleep disorders, each emphasizing different health and lifestyle strategies.

Figure 5. Profile-Aided CoT prompting strategy for the three task types. For Task 1 (Sleep Report Generation), a standardized prompt (CoTg) is employed to ensure consistent output format. For Tasks 2 (Personalized Q&A) and 3 (Knowledge Q&A), a shared prompting logic is utilized: user profiles, derived from HRV parameters, are first clustered to establish context-specific focus. Subsequently, input questions are classified (S1-S3) based on the type of information required, which guides the selection of the appropriate inference path (CoTs1-CoTs3). The chosen inference path is then executed combined with the context determined by the user’s cluster.

3.4.2 Integration with MoE architecture

The proposed PA-CoT approach integrates with our LoRA-based MoE architecture to enhance personalized inference capabilities. As illustrated in Algorithm 2, user profile information serves a dual purpose: it guides the selection of appropriate CoT templates and influences the router’s expert assignment mechanism. This integration enables the model to dynamically adapt its inference pathways based on individual user characteristics.

Algorithm 2
www.frontiersin.org

Algorithm 2.

The profile-aided routing mechanism is formalized as:

g(ht,p)=Wgateht+Wprofilep+bgate,(13)

where g(ht,p)RN represents the gate logits for N experts, htRdin is the hidden state at position t, pRdprofile_enc is the encoded user profile vector, WgateRdin×N is the gate projection matrix, WprofileRdprofile_enc×N is the profile projection matrix, and bgateRN is the gate bias vector. This formulation ensures that both the current input content and user-specific characteristics jointly determine expert activation patterns.

3.4.2.1 Expert configuration

The MoE architecture in our framework employs six LoRA adapters (N=6), each with a rank of r=16, enabling parameter-efficient fine-tuning and supporting the model’s capacity to handle diverse inference patterns. During both training and inference, the router dynamically selects and activates a subset of these adapters for each input token using a top-k selection mechanism (k=3). This dynamic routing is based on the input content and user profile, allowing the model to flexibly allocate computational resources according to the specific requirements of each task and user context. This design allows the MoE architecture to adaptively discover and utilize specialized inference capabilities, without imposing explicit functional roles on individual adapters during the model design phase.

3.4.2.2 Dynamic weight allocation

The router dynamically assigns weights (W) to each adapter. The routing mechanism employs a temperature-controlled softmax function with top-k selection, which is given by:

pi=exp(gi/τ)j=1Nexp(gj/τ),i=1,,N(14)

Let I=TopK(p,k) denote the set of indices corresponding to the k largest elements in p. The sparse mask Msparse{0,1}N is defined as follows:

[Msparse]i=1,if iI0,otherwise(15)

The final sparse weight vector is then given by

W=pMsparse,(16)

where τ>0 is the temperature parameter, N=6 is the number of experts, k=3 is the sparsity level, denotes element-wise multiplication, and TopK(p,k) returns the indices of the k largest elements in p.

3.4.2.3 Loss formulation

To mitigate load imbalance among expert adapters due to varying task sample sizes and the top-k routing mechanism, we incorporated an auxiliary loss alongside the standard cross-entropy loss as a regularization term. We introduced a profile alignment loss to ensure consistency between CoT inference and user profiles. The profile loss is defined in Equation 17:

Lprofile=iδi,CCoT(p)logPadapteri(p),(17)

where CCoT(p) represents the deterministic CoT path index for a given user profile p, δi,CCoT(p) is the Kronecker delta function that equals 1 if i=CCoT(p) and 0 otherwise, and Padapteri(p) is the activation probability of expert adapter i computed by the MoE router. This loss ensures that the expert activations align with the CoT path determined by the user’s profile, promoting coherence between inference and model specialization. The hyperparameter λp controls the weight of this loss in the overall objective, balancing alignment with other training objectives.

The auxiliary load balancing loss encourages an even distribution of task load across experts, as defined in Equation 18:

Laux=i=1NFi1N2+Pi1N2,(18)

where Fi represents the fraction of tokens assigned to expert i. Pi represents the fraction of router probability allocated to expert i. N is the total number of experts. λa is a regularization parameter controlling the strength of load balancing.

The activation frequency regularization loss ensures uniform activation of all experts, as defined in Equation 19:

Lfreq=i=1NAiTN2,(19)

where Ai denote the selection count of expert i over T routing decisions. T is the total number of activations. N is the total number of experts.

The total loss function is a weighted sum of the components, as given in Equation 20:

Ltotal=LCE+λaLaux+λfLfreq+λpLprofile,(20)

where LCE is the cross-entropy loss for prediction accuracy, and λa, λf, and λp are regularization coefficients.

4 Experiments

4.1 Dataset

4.1.1 Public dataset

This study utilizes several publicly available, multimodal sleep-related datasets, summarized in Table 5, primarily sourced from PhysioNet, to enable cross-population analysis and support robust sleep health research. All selected datasets include ECG recordings acquired during sleep, which ensures the reliable extraction of four key HRV metrics: SDNN, RMSSD, LF/HF, and PNN50. These HRV metrics are widely recognized as critical indicators of ANS activity and sleep quality, and are essential for characterizing sleep-related cardiac dynamics.

Table 5
www.frontiersin.org

Table 5. Summary of sleep-related datasets used in this study.

4.1.2 In-house dataset

In addition to publicly available datasets, we established an in-house dataset as part of a population-based sleep study at Jinan University, involving 160 participants. The study, entitled “Survey and Analysis of the Health Status of Diabetic Patients Post-COVID-19 Pandemic”, was approved by the Ethics Committee of the First Affiliated Hospital of Jinan University (Approval Number: KY-2023299) and conducted in accordance with the Declaration of Helsinki. Sleep ECG data were recorded using the Bodyguard 2 device at a sampling rate of 1,000 Hz to ensure precision for HRV analysis. Following data cleaning and quality review by a board-certified physician, 162 valid overnight recordings were retained.

To further strengthen real-world representativeness and prospective evaluation, we conducted a second phase of data collection and enrolled an additional 150 participants. This cohort was intentionally enriched with children and older adults to increase physiological diversity, and included structured questionnaires on behavioral factors that may affect sleep (e.g., alcohol consumption and exercise habits). All recordings followed the same acquisition protocol and device specifications as in the first phase.

For model development and validation, 50 participants from the second-phase cohort were randomly selected and integrated with the first-phase data for training and synthesis. The remaining 100 participants were strictly held out to form an independent, prospectively collected test set that was time-separated from model development and used exclusively for final evaluation. This prospective test set was also used to assess report-level clinical event detection against physician-adjudicated references for two prespecified endpoints: autonomic dysfunction (SDNN < 30 ms and/or RMSSD < 20 ms) and sympathetic dominance (LF/HF > 4).

4.2 Training implementation

All experiments in this study were conducted following a unified workflow comprising data synthesis, model training, and evaluation. During the data synthesis stage, we employed the GPT-4o API2. To generate both physiological parameters and question-answer pairs, ensuring the diversity and physiological plausibility of the synthetic data. The synthesized datasets were carefully balanced across six sources, with equal sampling for each downstream task to mitigate potential data bias.

For model training, we adopted a teacher-student paradigm, where Qwen-max served as the teacher model and Qwen2.5 0.5B as the student. The training set consisted of 1,800 sleep report generation tasks, 4,800 personalized Q&A tasks, and 4,800 knowledge Q&A tasks. All models were trained using PyTorch 2.0.1 with CUDA 12.0 on a workstation equipped with an NVIDIA 4090 Ti GPU and Ubuntu 22.04. The batch size was set to 64, and the learning rate was 2×104. Training was performed for 3 epochs, and the MoE-PEFT3 framework was adapted to support the specific requirements of this study. The main architectural hyperparameters were set as follows: the hidden state dimension din was 768, the encoded user profile dimension dprofile_enc was set to 128, the number of experts N was 6, the LoRA rank r was 16, the top-k value for expert selection was 3, and the temperature parameter τ was set to 0.5. The regularization coefficients for the loss components were set as follows: λa=0.1, λf=0.15, and λp=0.05.

For evaluation, the Claude Sonnet 3.7 API was utilized as an LLM-based judge to assess model outputs. The evaluation set comprised 200 sleep report generation tasks, 1,200 personalized Q&A tasks, and 1,200 knowledge Q&A tasks. Each answer was independently evaluated three times, and the final score for each dimension was computed as the average of these ratings to reduce the impact of stochasticity in LLM-based assessment. Inference and real-time testing were conducted on an RK 3588 device with 8 GB RAM running a Linux-based operating system, simulating deployment in resource-constrained edge environments. Key metrics such as inference time and memory usage were recorded to assess the practical feasibility of the proposed framework.

For comprehensive benchmarking, the performance of the proposed method was compared with several SOTA and finetuned models on both internal and public datasets. All comparative experiments were conducted under identical hardware and software conditions, with consistent prompt settings to ensure fairness.

4.3 Evaluation metrics

4.3.1 Synthetic data evaluation

The quality of synthetic data in the era of LLMs can be evaluated across two dimensions Gan and Liu (2024): diversity and faithfulness. Diversity measures the variety and coverage of generated samples relative to the original dataset, ensuring broad representation of potential scenarios. Faithfulness assesses how closely the synthetic data adheres to the statistical and distributional properties of the real data, preserving key characteristics for reliable downstream use. Within the sleep scenario described in this study, the balance between faithfulness and diversity is task-dependent. Specifically, for report generation, where synthetic sleep-related parameters are involved, faithfulness is prioritized over diversity. This is because these parameters are quantitative and utilized in clinical or scientific settings, necessitating high fidelity to ensure accuracy and reliability. In contrast, tasks like personalized Q&A may benefit from greater diversity to capture a wider range of user queries and responses. To evaluate these dimensions, we employ a combination of quantitative metrics. Faithfulness is assessed using the Kullback-Leibler (KL) divergence, which quantifies the similarity between synthetic and real data distributions, and the Kolmogorov-Smirnov (KS) test, which compares empirical cumulative distribution functions (CDFs) to detect significant differences. Diversity is measured via the Hilbert-Schmidt Independence Criterion (HSIC), which evaluates the independence of dependencies in the data. These metrics collectively ensure a comprehensive assessment of synthetic data quality, tailored to the demands of personalized sleep analysis applications.

In contrast, evaluating the effectiveness of personalized questions presents a relatively subjective task. To address this, we conducted an expert evaluation, inviting three sleep center physicians to assess the questions generated. The evaluation was performed in two dimensions: relevance and diversity. Each dimension was scored from 1 to 5, with higher scores indicating better performance. This approach ensures that the generated questions are not only tailored to individual users but also contextually appropriate and diverse, providing a robust foundation for downstream applications.

4.3.2 Model output evaluation

Given that traditional evaluation metrics such as BLEU Reiter (2018), ROUGE Lin and Och (2004), and BERTScore Zhang et al. (2019) are insufficient to effectively differentiate model performance in this context, we adopt the LLM-as-a-Judge paradigm for evaluation. Specifically, Claude Sonnet 3.7 is employed as the evaluator, leveraging its advanced inference and contextual understanding capabilities. Inspired by RAGAS, Es et al. (2024), the evaluation framework assesses the performance of the model in four key dimensions.

• Personalization (Pers.): Evaluates the extent to which the recommendations and responses generated are tailored to the individual user’s data and specific needs.

• Relevance (Rel.): Measures the alignment of responses with the user’s context and the specific questions posed, ensuring that the information provided is pertinent and contextually appropriate.

• Completeness (Comp.): Assesses whether the responses adequately address all aspects of the query, ensuring that no critical details are omitted.

• Accuracy (Acc.): Evaluates the correctness and validity of the information provided, with a focus on domain-specific knowledge and the precision of personalized advice.

To mitigate the inherent stochasticity of the LLM-based evaluation, each answer is independently evaluated three times, and the final score for each dimension is calculated as the average of these ratings. The evaluation prompt is designed to guide the LLM in scoring the model outputs. The specific task and rating criteria are detailed in Table 6.

Table 6
www.frontiersin.org

Table 6. Task and rating criteria.

This LLM-as-a-Judge approach ensures a robust and nuanced evaluation of model performance, leveraging the advanced inference capabilities of Claude Sonnet 3.7 to provide detailed and context-aware assessments. In addition, a small subset of samples was randomly selected from the test sets of each dataset for manual evaluation. This human assessment is conducted to independently validate the LLM-based evaluation and provide complementary insights into the clinical appropriateness of generated outputs. In addition, Inter-rater reliability was quantified using Fleiss’ kappa (κ) Fleiss (1971) with the five Likert categories treated as nominal.

4.4 Data synthesis

4.4.1 Parameter synthesis

4.4.1.1 Distributional similarity

The results in Table 7 demonstrate the performance of different synthetic data generation methods in terms of faithfulness (KL divergence), diversity (HSIC), and distributional similarity (Kolmogorov-Smirnov, KS, p-value). Our proposed PC-AHC-LLM method achieves a KL divergence of 0.87 and a KS p-value of 0.14, indicating high faithfulness to the original data distribution and no significant differences from real-world samples, which is critical for tasks such as report generation that require accurate and reliable synthetic HRV parameters. These metrics were computed by directly comparing synthetic outputs against real dataset values from public sources, including the DREAMT and HMCSS datasets, ensuring that key physiological parameters (e.g., SDNN, RMSSD, LF/HF, and PNN50) exhibit realistic patterns aligned with clinical thresholds (e.g., SDNN < 30 ms indicating severe autonomic dysfunction). While the HSIC value of 0.35 suggests lower diversity compared to Gaussian Copula (0.54, KS p-value 0.09) and SMOTE (0.50, KS p-value 0.07), this trade-off aligns with the study’s focus on maintaining fidelity for clinically relevant applications, where preserving statistical robustness and physiological plausibility is paramount.

Table 7
www.frontiersin.org

Table 7. Metric computation for synthetic parameter.

4.4.1.2 Clinical event detection from generated reports

To evaluate clinical utility beyond distributional similarity, we assessed the ability of our framework to detect clinically meaningful events from generated sleep reports on the independent prospective test set (N = 100). Two prespecified endpoints were defined with physician adjudication: autonomic dysfunction (SDNN < 30 ms and/or RMSSD < 20 ms) and sympathetic dominance (LF/HF > 4). Generated reports were parsed using a predefined lexicon and regular expressions to extract binary predictions for each endpoint, which were then compared against physician-adjudicated reference labels.

For autonomic dysfunction, the model achieved a sensitivity of 0.86 [95% CI: 0.73, 0.95] and a specificity of 0.84 [95% CI: 0.74, 0.91]. For sympathetic dominance, sensitivity was 0.84 [95% CI: 0.68, 0.94] and specificity was 0.86 [95% CI: 0.77, 0.92]. These results demonstrate that the framework can reliably identify clinically relevant events from generated reports in a prospective, real-world setting. Detailed endpoint definitions, report parsing rules, and confusion matrices are provided in Supplementary Material S2.

4.4.2 Questions generation

For personalized questions generated from sleep reports, evaluation is inherently subjective due to the individualized nature of personalization, which lacks a fixed objective ground truth. Instead, we rely on expert human assessment as the proxy ground truth, conducted by three sleep medicine physicians whose judgments are grounded in clinical expertise. For each sleep sample, multiple questions were generated; to address data volume concerns, we randomly selected two questions per sample for evaluation. Physicians independently rated these on a scale of 1-5 across two dimensions: relevance (degree of alignment with user-specific sleep data and clinical facts, such as HRV implications for stress resilience) and diversity (variety in phrasing and coverage of sleep health aspects, ensuring broad applicability without redundancy). Ratings followed predefined criteria to minimize bias, including examples of high-relevance questions (e.g., those tying directly to individual HRV metrics). The final score for each dimension was computed as the average of all annotators’ ratings across evaluated questions, yielding averages of 4.1 for relevance and 4.2 for diversity (as reported in the Experiments section). This structured expert evaluation provides a reliable, replicable mechanism to confirm that the generated questions are contextually appropriate, sufficiently personalized, and varied for downstream applications like user-specific Q&A.

4.5 Personalized inference

4.5.1 Cluster validation

To validate the clusters on the user profile vectors and ensure the method’s robustness, we conducted a series of statistical and clinical assessments. Cluster quality was evaluated using the Silhouette score (average 0.68, indicating good separation and cohesion) and Calinski-Harabasz index (145.2, supporting three clusters as optimal). Robustness was confirmed through sensitivity analysis: bootstrapping with 100 resamples (80% data subsets) showed stable assignments (Jaccard similarity > 0.85), and introducing 10% Gaussian noise to HRV values resulted in only 5% reassignment, demonstrating resilience to variations in wearable sensor data common in real-world sleep monitoring.

Additionally, two sleep medicine physicians reviewed the clusters for physiological plausibility, rating at 4.2/5 on average and confirming that labels are descriptive rather than diagnostic tools. For example, the “Moderate Variability, Moderate Stress” cluster’s variable sLF/HF was seen as indicative of potential sympathovagal fluctuations, aligning with patterns in stress-related sleep irregularities but not serving as a standalone marker for disorders. These validations, performed on the DREAMT dataset with 10-fold cross-validation, affirm the clustering’s reliability for guiding personalized inference in our system.

4.5.2 Task performance

Table 8 presents a comprehensive evaluation comparing our proposed framework against several classes of models. The results demonstrate that our MoE-LoRA architecture, applied to a 0.5B parameter student model, achieves performance comparable to Qwen-max within a small margin. Across the six distinct datasets, the average scores for our model and Qwen-max in personalization, relevance, completeness, and accuracy are statistically comparable, with differences typically falling within a minimal 0.1-point margin. This indicates a practical performance parity, establishing that our lightweight solution can match the high standard set by leading LLMs like Qwen-max and Gemini for personalized sleep analysis tasks. The strong concordance between automated metrics and manual physician assessments further validates our evaluation methodology.

Table 8
www.frontiersin.org

Table 8. Evaluation results of personalized inference across different models and datasets.

While the standard 0.5B-class edge models (e.g., OPT-350M, Pythia-410M) and the baseline Qwen2.5 0.5B model show improved performance, their scores consistently remain below 4.0. Crucially, even with standard LoRA fine-tuning (“0.5B (lora)”), which provides a modest uplift, the performance remains substantially inferior to that of our proposed method and the top-tier commercial models. This comparison underscores that a simple fine-tuning approach is insufficient to bridge the performance gap. We verified that all between-model comparisons, particularly the parity between our model and Qwen-max, remained stable under the empirical equivalence margin (τ=0.2), with sensitivity checks at τ{0.1,0.3} yielding unchanged conclusions (see Supplementary Tables S1, S2).

In summary, these findings demonstrate the efficacy of the proposed profile-aided MoE-LoRA architecture. It successfully enables a lightweight student model (0.5B parameters) to achieve performance parity with SOTA LLMs, offering a computationally efficient alternative without sacrificing the quality required for personalized biomedical inference. Specific evidence supporting the model’s proficiency across different task types is provided through case studies in Supplementary Tables S8, S9.

4.5.3 Analysis of expert activation in MoE LoRA

To further validate the effectiveness of the MoE LoRA model, we analyzed the activation patterns of the six LoRA adapters during personalized inference tasks. The activation heatmap, as shown in Figure 6, illustrates the distribution of expert activations across three task types: report generation, personalized Q&A, and knowledge Q&A. Each row in the heatmap corresponds to a specific task instance, while each column represents a LoRA adapter.

Figure 6
Heatmap titled

Figure 6. LoRA Adapter Activation Heatmap: This heatmap illustrates the activation patterns of six LoRA adapters across three task types: report generation, personalized Q&A, and knowledge Q&A. The activation values range from 0 to 1, representing the strength of activation for each adapter. Adapters 1, 3, and 6 show strong activation for report generation tasks, Adapters 2, 4, and 6 for personalized Q&A, and Adapters 1, 5, and 6 for knowledge Q&A. The balanced and task-specific activation patterns highlight the model’s adaptability and specialization.

The results demonstrate that the MoE mechanism dynamically activates the most relevant experts in response to different task types. As illustrated in Figure 6, adapters 1, 3, and 6 are predominantly activated for report generation, adapters 2, 4, and 6 for personalized Q&A, and adapters 1, 5, and 6 for knowledge-based Q&A. A plausible interpretation of these activation patterns is that, during training, the MoE architecture implicitly encourages different adapters to specialize in distinct inference or analytical subtasks. For example, Adapters 1 and 3 may have developed a preference to extract and synthesize structured information required for report generation, while Adapters 2 and 4 are more tuned to the nuances of personalized Q&A, such as interpreting user profiles and contextual cues. Adapter 5 appears to be more involved in domain-specific knowledge inference, and Adapter 6, which is consistently activated across all tasks, likely serves as a general-purpose or knowledge-fusion expert, providing foundational support for various types of inference.

This emergent specialization is not manually assigned but arises organically from the data-driven training process and the dynamic routing mechanism. Thus, the observed activation patterns reflect the model’s ability to adaptively allocate computational resources, ensuring that the most relevant expertise is brought in for each type of task. Such a mechanism enhances both the robustness and interpretability of the model, as it allows for modular, task-aware inference in complex, multi-task personalized sleep analysis scenarios.

4.6 Ablation study

We disentangle the contributions of three components in the proposed framework: profile-aided Chain-of-Thought (PA-CoT), MoE-LoRA versus Single-LoRA under matched active parameter budgets, and profile-aided routing.

4.6.1 Effect of PA-CoT

We compare Global-CoT, PA-CoT without the cluster preamble, and full PA-CoT, on top of the same MoE-LoRA configuration and routing. The results in Table 9 clearly show that the full PA-CoT model significantly outperforms Global-CoT. In Personalized Q&A, personalization scores increase from 4.08±0.04 to 4.47±0.03, and in Report Generation, they rise from 4.12±0.04 to 4.38±0.03. These enhancements are complemented by parallel gains in relevance and competence, with non-overlapping confidence intervals confirming their significance. While removing the cluster preamble narrows the performance gap (e.g., 4.22±0.04 in Personalized Q&A), PA-CoT still maintains a clear advantage, highlighting the inherent structural benefits of our approach over generic CoT.

Table 9
www.frontiersin.org

Table 9. PA-CoT ablations (macro-average across datasets; mean ± SD).

4.6.2 Effect of MoE under matched active budgets

We compare Single-LoRA (rank-matched to kr) against MoE-LoRA with content-only routing to isolate the effect of sparse expertization from routing with profile signals. The result is shown in Table 10. Under matched budgets, MoE-LoRA improves personalization in Personalized Q&A by +0.25 absolute (3.984.23) and raises Report Generation personalization by +0.14 (4.014.15), while also increasing Knowledge Q&A accuracy (4.084.18). These gains indicate benefits from expert specialization that cannot be attributed to capacity.

Table 10
www.frontiersin.org

Table 10. MoE-LoRA vs. Single-LoRA under matched active budgets (macro-average; mean ± SD).

4.6.3 Effect of profile-aided routing

We compare content-only, profile-only, content + profile without Lprofile, and the full content + profile routing with Lprofile. As shown in Table 11, It is obvious that combining content and profile signals improves personalization relative to either alone; adding Lprofile further reduces coefficient of variation from 0.32 to 0.28 and increases activation entropy from 1.57 to 1.63, indicating better expert load balance alongside accuracy gains.

Table 11
www.frontiersin.org

Table 11. Routing ablations and expert utilization (mean ± SD). Lower CV is better; higher entropy indicates better load dispersion.

Across the three ablations, PA-CoT yields the largest personalization improvements, MoE-LoRA provides capacity-controlled gains attributable to expert specialization, and profile-aided routing contributes additional improvements and healthier expert utilization.

4.7 Edge profiling and inference

We assess the on-device feasibility of the proposed framework on two representative edge platforms: RK3588 (embedded edge node) and Snapdragon 8 Gen 3 (mobile edge; Oppo Find X7 Ultra). These cover typical deployment scenarios for clinic-side gateways and consumer smartphones, respectively.

4.7.1 Setup

We evaluate Single-LoRA (rank-matched to kr) and MoE-LoRA (N=6,k=3,r=16) under floating-point inference. To ensure cross-platform comparability, we report end-to-end latency per request (ms; including prefill and decode), decode throughput (tokens/s), and peak memory (GB). Device-specific backend details and extended measurements (energy/thermal and backend comparisons) are provided in the Supplementary (Supplementary Tables S6–S7; Supplementary Figure S2).

4.7.2 Device comparison

The result is shown in Table 12. Snapdragon 8 Gen 3 outperforms RK3588 across both variants. For Single-LoRA, latency decreases from 3950±120 ms to 3200±110 ms (750 ms, 19.0%), and decode throughput increases from 17.4±0.5 to 21.8±0.6 tok/s (+25.3%). For MoE-LoRA, latency decreases from 4200±130 ms to 3400±115 ms (800 ms, 19.0%), and throughput increases from 16.6±0.4 to 20.7±0.5 tok/s (+24.7%). Peak memory is higher on Snapdragon (0.78–0.82 GB) than on RK3588 (0.64–0.70 GB), yet remains well below available memory on both devices. MoE-LoRA introduces modest overhead while improving quality (Section 4.6). On RK3588, latency increases by 250 ms (+6.3%), decode throughput drops by 0.8 tok/s (4.6%), and peak memory rises by 0.06 GB (+9.4%). On Snapdragon 8 Gen 3, the latency increase is 200 ms (+6.3%), throughput decreases by 1.1 tok/s (5.0%), and memory rises by 0.04 GB (+5.1%). The relative overheads are consistent across devices.

Table 12
www.frontiersin.org

Table 12. Edge profiling on representative devices (mean ± SD).

4.7.3 Comparison with lightweight edge models

To comprehensively evaluate the proposed framework against models commonly used in edge deployment scenarios, we expanded our baseline comparisons to include both ultra-compact models and models within the same 0.5B parameter scale. Specifically, we evaluated.

• •Ultra-compact models: MobileBERT (66M parameters) and DistilGPT-2 (82M parameters), which are widely adopted for resource-constrained edge devices.

• 0.5B-class models: OPT-350M, Pythia-410M, and DeBERTa-v3-Large (˜400M parameters), which match the target parameter scale.

Table 13 presents the inference performance comparison on RK3588. Ultra-compact models (MobileBERT, DistilGPT-2) achieve substantially lower latency (580–720 ms) and memory footprint (0.18–0.21 GB) compared to the proposed framework (3950–4200 ms, 0.64–0.70 GB). However, as shown in Table 8, these models exhibit severe performance degradation in personalized sleep health tasks: MobileBERT and DistilGPT-2 score 2.87–3.24 across metrics, representing a 1.201.66 absolute drop (27.0%36.6% relative) compared to our framework (4.33–4.73).

Table 13
www.frontiersin.org

Table 13. Edge inference performance comparison with lightweight models on RK3588 (mean ± SD).

Models at the 0.5B parameter scale (OPT-350M, Pythia-410M, DeBERTa-v3-Large) present a more balanced profile. OPT-350M achieves 27.8% lower latency (2850 ms vs. 3950 ms) with 18.8% lower memory (0.52 GB vs. 0.64 GB) than Single-LoRA, while maintaining competitive task performance (3.94–4.21 vs. 4.33–4.73; 0.390.52 absolute, 9.0%11.3% relative). However, the proposed MoE-LoRA framework consistently outperforms all baselines across personalization-sensitive metrics (Pers.: +0.73+1.66 absolute over 0.5B-class models; +1.19+2.19 over ultra-compact models), confirming that the architecture-level innovations (MoE-LoRA with profile-aware routing) are necessary for preserving clinical utility in personalized biomedical applications.

4.7.4 Feasibility and trade-offs

Both devices sustain practical interactive inference under floating-point: decode rates of 16.6–21.8 tok/s and peak memory under 0.85 GB. The 6% latency overhead of MoE is a favorable trade-off for the quality gains in Section 4.6: MoE-LoRA improves Personalized Q&A personalization by +0.25 absolute over Single-LoRA, and full PA-CoT with profile-aided routing further adds +0.24 (total +0.49, 12.3% relative). While lightweight models (Section 4.7.3) offer lower latency, the quality degradation (9% to 37%) renders them unsuitable for personalized biomedical tasks that require high clinical fidelity. Supplementary backend results on Snapdragon 8 Gen 3 (Supplementary Table S8) show that GPU and NPU backends further reduce latency versus CPU by approximately 16% and 21%, respectively, offering options for latency-critical deployments.

4.8 Experiments on real-world data

To further illustrate the diversity and practical relevance of the evaluation, a set of representative real-world questions posed by human subjects is provided. Table 14 presents representative questions. Each set of four questions corresponds to a single individual and encompasses a broad spectrum of concerns related to sleep quality, health management, and personalized recommendations.

Table 14
www.frontiersin.org

Table 14. Representative questions asked by human subjects. Each group of four questions corresponds to one subject.

Subsequently, we evaluated the model’s responses to these real-world questions using Claude Sonnet 3.7 and human experts. The evaluations focused on three key aspects: personalization, relevance, and completeness. The human expert scores for these dimensions were 4.3, 4.5, and 4.3, respectively, while the LLM scores were 4.4, 4.4, and 4.5. The results are summarized in Table 15. The evaluation results of Claude Sonnet 3.7 are highly consistent with those of human experts in all dimensions.

Table 15
www.frontiersin.org

Table 15. Evaluation scores for model responses to real-world questions.

5 Discussion

5.1 Advantages over existing methods

The proposed framework demonstrates clear advantages over existing approaches in terms of both technical performance and practical deployment. In data synthesis, our method achieves a KL divergence of 0.85, indicating high fidelity to the original data distribution—an essential property for generating clinically meaningful HRV parameters. While the HSIC value (0.37) is lower than that of Gaussian Copula (0.60) and SMOTE (0.52), this reflects a deliberate emphasis on distributional accuracy over diversity, which is critical for reliable sleep report generation. In contrast, many existing methods prioritize diversity at the expense of clinical faithfulness. Furthermore, the personalized questions generated by our framework received high average scores for personalization (4.2), relevance (4.1), and diversity (4.1), outperforming traditional data augmentation techniques and demonstrating strong suitability for downstream analytical and interactive tasks.

For personalized inference, our model was validated across diverse populations and sleep scenarios, consistently matching or exceeding the performance of the teacher model (Qwen-max) and outperforming baseline models. For example, on the DREAMT dataset, our model achieved higher personalization (4.6) and relevance (4.7) scores compared to both the fine-tuned 0.5 billion-parameter and 1.5 billion-parameter models. The MoE LoRA architecture enables dynamic resource allocation, as shown by task-specific adapter activation patterns (Figure 6), ensuring robust and adaptive performance across heterogeneous datasets. This adaptability distinguishes our approach from conventional fine-tuning methods, which often lack such flexibility.

The framework’s practical value is confirmed by its successful deployment on the RK3588 edge device. It achieves a real-time inference speed of 16.6 tokens per second while maintaining a compact memory footprint of only 0.7 GB, demonstrating its suitability for resource-constrained environments. Our lightweight 0.5 billion-parameter base model offers a compelling balance between computational efficiency and analytical performance, making it a scalable and effective solution for real-world personalized health analysis.

5.2 Limitations and future work

Despite these strengths, several limitations warrant consideration. The framework’s reliance on user profiles for personalized inference implies that its effectiveness is influenced by the quality and completeness of input profiles. Enhancing robustness against incomplete or noisy profile data constitutes an important direction for future work. Additionally, while the current framework addresses multiple tasks effectively, its scalability to more complex or larger-scale applications remains to be fully validated. Whether the 0.5 billion-parameter base model, even with an increased number of adapters, advanced loss functions, or improved routing algorithms, can maintain high performance in more demanding scenarios is an open question. Future research will focus on hybrid data synthesis, advanced optimization strategies, and further architectural innovations to ensure continued scalability and adaptability.

6 Conclusion

This work presents a novel and efficient framework for personalized sleep analysis on edge devices, integrating profile-aided inference and adaptive data synthesis within a lightweight 0.5 billion-parameter model. The framework achieves a strong balance among accuracy, adaptability, and computational efficiency, enabling real-time, individualized health analysis in resource-constrained environments. Its modular design and successful deployment on edge hardware underscore its practical potential for wearable and mobile health applications.

By addressing key challenges in data scarcity, model efficiency, and user-specific inference, this study advances the state of the art in personalized health informatics. Looking ahead, the proposed approach lays a robust foundation for future innovations in scalable, secure, and accessible personalized healthcare, with the potential to expand beyond sleep analysis to broader domains of health monitoring and intervention.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.

Ethics statement

The studies involving humans were approved by the Human Research Review Committee of the First Affiliated Hospital of Jinan University (Approval Number: KY-2023299). The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation in this study was provided by the participants’; legal guardians/next of kin. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

Author contributions

HZ: Conceptualization, Data curation, Methodology, Writing – original draft. XA: Data curation, Writing – review and editing. XL: Data curation, Writing – review and editing. XoX: Supervision, Writing – review and editing. XnX: Funding acquisition, Writing – review and editing.

Funding

The authors declare that financial support was received for the research and/or publication of this article. The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Guangdong Provincial Key Laboratory of Human Digital Twin under Grant 2022B1212010004.

Acknowledgements

The authors would like to express their sincere gratitude to all individuals who contributed to this research. We thank the First Affiliated Hospital of Jinan University for approving this research plan, which enabled data collection.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that no Generative AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphys.2025.1678364/full#supplementary-material

Footnotes

1https://openai.com/index/whoop/

2https://platform.openai.com/docs/guides/text?api-mode=responses

3https://github.com/TUDB-Labs/MoE-PEFT/tree/main/moe_peft

References

Chan N., Parker F., Bennett W., Wu T., Jia M. Y., Fackler J., et al. (2024). Medtsllm: leveraging llms for multimodal medical time series analysis. ArXiv:2408.07773.

Google Scholar

Chawla N., Bowyer K., Hall L., Kegelmeyer W. P. (2002). Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357. doi:10.1613/jair.953

CrossRef Full Text | Google Scholar

Chen H., Deng W., Yang S., Xu J., Jiang Z., Ngai E. C. H., et al. (2025). Towards edge general intelligence via large language models: opportunities and challenges. IEEE Netw. 39, 263–271. doi:10.1109/MNET.2025.3539338

CrossRef Full Text | Google Scholar

Cosentino J., Belyaeva A., Liu X., Furlotte N. A., Yang Z., Lee C., et al. (2024). Towards a personal health large language model. ArXiv:2406.06474.

Google Scholar

Es S., James J., Espinosa Anke L., Schockaert S. (2024). “Ragas: automated evaluation of retrieval augmented generation,” in Proceedings of the 18th conference of the European chapter of the association for computational linguistics (EACL) (St. Julians, Malta: Association for Computational Linguistics), 150–158.

CrossRef Full Text | Google Scholar

Fang C. M., Danry V., Whitmore N., Bao A., Hutchison A., Pierce C., et al. (2024). Physiollm: supporting personalized health insights with wearables and large language models. ArXiv:2406.19283.

Google Scholar

Fleiss J. L. (1971). Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382. doi:10.1037/h0031619

CrossRef Full Text | Google Scholar

Fu Y., Peng H., Ou L., Sabharwal A., Khot T. (2023). “Specializing smaller language models towards multi-step reasoning,” in Proceedings of the 40th international conference on machine learning (ICML) (Honolulu, HI, USA: PMLR), 10421–10430.

Google Scholar

Gan Z., Liu Y. (2024). Towards a theoretical understanding of synthetic data in llm post-training: a reverse-bottleneck perspective. ArXiv:2410.01720.

Google Scholar

Garbarino S., Bragazzi N. (2024). Revolutionizing sleep health: the emergence and impact of personalized sleep medicine. J. Personalized Med. 14, 598. doi:10.3390/jpm14060598

PubMed Abstract | CrossRef Full Text | Google Scholar

Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., et al. (2014). Generative adversarial nets. NeurIPS 27, 139–144. doi:10.1145/3422622

CrossRef Full Text | Google Scholar

Gunter T., Wang Z., Wang C., Pang R., Narayanan A., Zhang A., et al. (2024). Apple intelligence foundation language models. ArXiv:2407.21075.

Google Scholar

Guo Z., Lai A., Thygesen J., Farrington J., Keen T., Li K. (2024). Large language models for mental health applications: systematic review. JMIR Ment. Health 11, e57400. doi:10.2196/57400

PubMed Abstract | CrossRef Full Text | Google Scholar

Hu E. J., Shen Y., Wallis P., Allen-Zhu Z., Li Y., Wang S., et al. (2022). Lora: low-rank adaptation of large language models. ICLR 1, 3. doi:10.48550/arXiv.2106.09685

CrossRef Full Text | Google Scholar

Huang Y., Wan L. J., Ye H., Jha M., Wang J., Li Y., et al. (2024). New solutions on llm acceleration, optimization, and application. ArXiv:2406.10903.

Google Scholar

Kamthe S., Assefa S., Deisenroth M. (2021). Copula flows for synthetic data generation. ArXiv:2101.00598.

Google Scholar

Kang M., Lee S., Baek J., Kawaguchi K., Hwang S. J. (2024). Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. Adv. Neural Inf. Process. Syst. 36, 48573–48602. doi:10.5555/3666122.3668231

CrossRef Full Text | Google Scholar

Kim Y., Xu X., McDuff D., Breazeal C., Park H. W. (2024). Health-llm: large language models for health prediction via wearable sensor data. ArXiv:2401.06866.

Google Scholar

Kingma D. (2013). Auto-encoding variational bayes. ArXiv:1312.6114.

Google Scholar

Kojima T., Gu S. S., Reid M., Matsuo Y., Iwasawa Y. (2022). “Large language models are zero-shot reasoners,” in Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November 2022- 9 December 2022, 22199–22213. doi:10.5555/3600270.3601883

CrossRef Full Text | Google Scholar

Li D., Ma Y., Wang N., Ye Z., Cheng Z., Tang Y., et al. (2024a). Mixlora: enhancing large language models fine-tuning with lora based mixture of experts. ArXiv:2404.15159.

Google Scholar

Li X., He S., Wu J., Yang Z., Xu Y., Jun Y. J., et al. (2024b). “Mode-cotd: chain-of-thought distillation for complex reasoning tasks with mixture of decoupled lora-experts,” in Proceedings of the LREC-COLING 2024, Torino, Italy, 11475–11485.

Google Scholar

Li Y., Yuan P., Feng S., Pan B., Sun B., Wang X., et al. (2024c). “Turning dust into gold: distilling complex reasoning capabilities from llms by leveraging negative data,” in Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20-27 February 2024, 18591–18599. doi:10.1609/aaai.v38i17.29821

CrossRef Full Text | Google Scholar

Lin C., Och F. (2004). “Looking for a few good metrics: Rouge and its evaluation,” in Proceedings of the NTCIR Workshop, Tokyo, Japan, 1–8.

Google Scholar

McDonald D., Papadopoulos R., Benningfield L. (2024). Reducing llm hallucination using knowledge distillation: a case study with mistral large and mmlu benchmark. TechRxiv Preprint. doi:10.36227/techrxiv.171665607.76504195/v1

CrossRef Full Text | Google Scholar

Merrill M. A., Paruchuri A., Rezaei N., Kovacs G., Perez J., Liu Y., et al. (2024). Transforming wearable data into health insights using large language model agents. ArXiv:2406.06464.

Google Scholar

Nazi Z., Peng W. (2024). Large language models in healthcare and medical domain: a review. Information 11, 57. doi:10.3390/informatics11030057

CrossRef Full Text | Google Scholar

Reiter E. (2018). A structured review of the validity of bleu. Comput. Linguist. 44, 393–401. doi:10.1162/coli_a_00322

CrossRef Full Text | Google Scholar

Reynolds D. (2009). Gaussian mixture models. Encycl. Biometrics 741 (3), 659–663. doi:10.1007/978-0-387-73003-5_196

CrossRef Full Text | Google Scholar

Rezende D., Mohamed S. (2015). “Variational inference with normalizing flows,” in Proceedings of the 32nd international conference on machine learning (ICML) (Lille, France: PMLR), 1530–1538.

Google Scholar

Rombach R., Blattmann A., Lorenz D., Ommer P. E. B (2022). “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, June 18-24 2022, 10684–10695.

Google Scholar

Secara I., Hordiiuk D. (2024). Personalized health monitoring systems: integrating wearable and ai. J. Intelligent Learn. Syst. Appl. 16, 44–52. doi:10.4236/jilsa.2024.162004

CrossRef Full Text | Google Scholar

Shaffer F., Ginsberg J. (2017). An overview of heart rate variability metrics and norms. Front. Public Health 5, 258. doi:10.3389/fpubh.2017.00258

PubMed Abstract | CrossRef Full Text | Google Scholar

She J., Zheng W., Liu Z., Wang H., Xing E., Yao H., et al. (2025). Token level routing inference system for edge devices. ArXiv:2504.07878.

Google Scholar

Shridhar K., Stolfo A., Sachan M. (2022). Distilling reasoning capabilities into smaller language models. ArXiv:2212.00193.

Google Scholar

Stein P., Pu Y. (2012). Heart rate variability, sleep and sleep disorders. Sleep. Med. Rev. 16, 47–66. doi:10.1016/j.smrv.2011.02.005

PubMed Abstract | CrossRef Full Text | Google Scholar

Wang D., Zhang S. (2024). Large language models in medical and healthcare fields: applications, advances, and challenges. Artif. Intell. Rev. 57, 299. doi:10.1007/s10462-024-10921-0

CrossRef Full Text | Google Scholar

Wei J., Wang X., Schuurmans D., Bosma M., Ichter B., Xia F., et al. (2022). “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 24824–24837. doi:10.5555/3600270.3602070

CrossRef Full Text | Google Scholar

Yang S., Ali M. A., Wang C. L., Hu L., Wang D. (2024). Moral: moe augmented lora for llms’ lifelong learning. ArXiv:2402.11260.

Google Scholar

Yuan Z., Shang Y., Zhou Y., Dong Z., Zhou Z., Xue C., et al. (2024). Llm inference unveiled: survey and roofline model insights. ArXiv:2402.16363.

Google Scholar

Zhang T., Kishore V., Wu F., Weinberger K. Q., Artzi Y., Wu F. (2019). Bertscore: evaluating text generation with bert. ArXiv:1904.09675.

Google Scholar

Zhang Y., Liu H., Xiao Y., Amoon M., Zhang D., Wang D., et al. (2024). Llm-enhanced multi-teacher knowledge distillation for modality-incomplete emotion recognition in daily healthcare. IEEE J. Biomed. Health Inf. 29, 6406–6416. doi:10.1109/JBHI.2024.3470338

PubMed Abstract | CrossRef Full Text | Google Scholar

Zheng Y., Chen Y., Qian B., Shi X., Shu Y., Chen J. (2025). A review on edge large language models: design, execution, and applications. ACM Comput. Surv. 57, 1–35. doi:10.1145/3719664

CrossRef Full Text | Google Scholar

Keywords: personalized sleep analysis, large language model (LLM), model distillation, data synthesis, edge computation

Citation: Zheng H, Ai X, Liu X, Xing X and Xu X (2026) Profile-aided distillation framework for personalized sleep analysis with compact models using LLM-guided synthetic data. Front. Physiol. 16:1678364. doi: 10.3389/fphys.2025.1678364

Received: 02 August 2025; Accepted: 24 November 2025;
Published: 05 January 2026.

Edited by:

Chunlin Xiong, China United Network Communications Group, China

Reviewed by:

Pardeep Singh, Bennett University, India
Sibo Qiao, Tianjin Polytechnic University, China

Copyright © 2026 Zheng, Ai, Liu, Xing and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Huimin Zheng, MjAyMjEwMTgyMTMwQG1haWwuc2N1dC5lZHUuY24=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.