- 1School of Electronics and Information, South China University of Technology, Guangzhou, China
- 2The Nursing College, Jinan University, Guangzhou, China
- 3Department of Endocrinology and Metabolism, The First Affiliated Hospital of Jinan University, Guangzhou, China
Introduction: Enabling personalized sleep analysis and interaction directly on edge devices is crucial for providing real-time health insights and tailored guidance. However, this goal remains challenging due to the scarcity of high-quality physiological data and the computational constraints of edge hardware.
Methods: We propose a framework for personalized sleep analysis on edge devices that addresses two key obstacles: limited publicly available physiological datasets and the restricted capacity of compact models. To mitigate data scarcity, we introduce a Physiologically-Constrained Adaptive Hierarchical Copula approach, which leverages large language model–guided optimization to synthesize diverse and realistic physiological signals. To enhance personalized inference on resource-limited models, we further develop Profile-Aided Distillation of Expert Inference with MoE LoRA, which integrates user-specific profile information to improve the performance of edge-deployed models.
Results: Extensive experiments on both public and in-house datasets show that the distilled models achieve performance comparable to state-of-the-art large language models, while operating efficiently within the computational and memory constraints of edge devices.
Discussion: These results demonstrate that the proposed framework offers a practical and effective solution for enabling personalized sleep analysis and user interaction in resource-constrained environments, bridging the gap between high-performance modeling and real-time, on-device healthcare applications.
1 Introduction
Personalized sleep analysis is increasingly recognized as a cornerstone of modern health management, offering the potential to deliver tailored insights and actionable recommendations to improve sleep quality and overall wellbeing (Brusie, 2025; Zhang et al., 2024). The proliferation of wearable devices and mobile health applications facilitates the convenient, continuous collection of physiological signals (e.g., electrocardiograms) across diverse, real-world settings. From these signals, valuable parameters such as heart rate variability (HRV) can be derived, providing a rich foundation for individualized analysis (Garbarino and Bragazzi, 2024; Secara and Hordiiuk, 2024). However, the scarcity and heterogeneity of publicly available high-quality sleep data sets remain a fundamental barrier to robust model development and generalizable personalized analysis.
Recent advances in large language models (LLMs) have further amplified the potential of AI-driven health analytics, owing to their remarkable capabilities in knowledge and inference (Cosentino et al., 2024; Merrill et al., 2024). LLMs have demonstrated success in clinical decision support, medical record analysis, and patient engagement (Wang and Zhang, 2024). They are also adept at integrating heterogeneous data sources, including physiological, behavioral, and environmental signals, to provide holistic, context-aware insights (Guo et al., 2024). Notably, LLMs can generate highly personalized and nuanced responses to user-specific queries, capturing the intricacies of individual needs1. Yet, these models are computationally intensive, making them impractical for direct deployment on resource-constrained edge devices such as wearables and smartphones (Nazi and Peng, 2024; Kim et al., 2024). Although edge device computational capabilities are advancing, allowing some flagship hardware to support models up to approximately 3B parameters (Gunter et al., 2024), our investigation and recent surveys (She et al., 2025; Zheng et al., 2025) indicate that a 0.5 billion-parameter size is a more universally applicable target for widespread edge deployment. However, as shown in Figure 1, this desirable compactness comes with a significant performance trade-off: standard 0.5 billion-parameter models, even after LoRA fine-tuning, exhibit substantial performance gaps compared to their larger counterparts (e.g.,
Figure 1. Performance comparison of language models across different parameter sizes for sleep health applications. Models with 0.5B, 1.5B, and 7B parameters were fine-tuned using LoRA, while Qwen-max represents a non-fine-tuned larger model. Evaluation was conducted using the LLM-as-a-Judge framework (Claude Sonnet 3.7) across three dimensions: report generation, personalized Q&A, and knowledge Q&A. The significant performance gap between the 0.5 billion-parameter model and larger models highlights the challenge addressed in this work: enabling efficient small models (0.5B) to perform competitively with larger models while remaining deployable on edge devices.
To address the dual challenges of data scarcity and the performance gap of small models, this study proposes a novel framework for efficient, real-time personalized sleep analysis on edge devices. Our approach first employs a physiologically-constrained adaptive hierarchical copula (PC-AHC-LLM) to synthesize diverse and realistic sleep data, ensuring that downstream models are trained on data that preserve both statistical and clinical validity. Building on this foundation, we introduce a Profile-Aided Distillation of Expert Inference framework (PADEI) that leverages profile-aided Chain-of-Thought (CoT) prompting and Mixture-of-Experts (MoE) with LoRA adapters to transfer complex inference patterns from large teacher models to a compact 0.5B model. The proposed framework is designed to support three core tasks essential for real-world sleep health applications: 1) Sleep Report Generation: Automatically generating standardized sleep reports from physiological signals, including key sleep parameters, descriptive summaries, and personalized recommendations. 2) Personalized Q&A: Providing user-specific answers to queries grounded in individual sleep data and reports, enabling actionable and tailored health guidance. 3) Knowledge Q&A: Delivering accurate and comprehensive responses to general sleep-related knowledge questions, independent of the user’s personal data.
By integrating advanced data synthesis and personalized inference, this work contributes to the broader goal of making personalized sleep health analysis accessible, secure, and efficient in everyday settings. The main contributions of this study are summarized as follows.
• Physiologically-Constrained Adaptive Hierarchical Copula for Data Synthesis: To address the scarcity of publicly available physiological data, we propose a novel data synthesis method based on a physiologically-constrained adaptive hierarchical copula. This approach leverages LLMs for optimization, enabling the generation of diverse and realistic physiological data that preserves underlying statistical and physiological properties.
• Profile-Aided Distillation of Expert Inference: To enhance the performance of small models on edge devices, we introduce a profile-aided distillation framework which integrates user-specific information to enable efficient and personalized inference, overcoming the limitations of standard LoRA finetuning.
• Experimental Results: Extensive experiments on both public and in-house datasets demonstrate that the proposed framework achieves performance comparable to SOTA LLMs while running efficiently on resource-constrained edge devices. The results validate the effectiveness of the proposed methods in enabling personalized sleep analysis and interaction.
2 Related work
2.1 LLM for health
LLMs demonstrated significant potential in healthcare, particularly in analysis of physiological signals to provide personalized health insights. Physiological data, including ECG, photoplethysmograms (PPG), and respiratory waveforms, are critical for understanding individual health conditions. Recent advancements have explored the integration of LLMs with these data to enhance health analysis and decision-making processes. For instance, MedTsLLM (Chan et al., 2024) introduces a multimodal framework that integrates time-series data and textual context to perform tasks such as semantic segmentation, boundary detection, and anomaly detection in physiological signals. Similarly, PhysioLLM (Fang et al., 2024) combines wearable sensor data with contextual information to generate personalized health insights, enabling users to explore correlations within their physiological data and receive actionable recommendations. Health-LLM (Kim et al., 2024) further demonstrates the utility of LLMs in interpreting physiological signals, such as resting heart rate and sleep metrics, to provide context-aware health predictions.
Despite these advancements, existing methodologies face several limitations. First, the substantial computational demands of large models, such as GPT-4o and Qwen-max (Chen et al., 2025), render them unsuitable for deployment on resource-constrained edge devices. Second, LLMs are not inherently designed for numerical inference, which limits their capacity to directly process continuous physiological signals, necessitating feature extraction or multimodal approaches (Chan et al., 2024). Finally, the scarcity of publicly available and diverse physiological data sets hinders the development and validation of robust models (Fang et al., 2024). To address these challenges, our work integrates LLMs with synthesized physiological data, enabling real-time, personalized health analysis on edge devices while mitigating computational and dataset limitations.
2.2 Data synthesis
The limited availability of large-scale wearable datasets in the real world remains a significant challenge in personalized health applications, which hinders effective generalization (Merrill et al., 2024). Generative models, including Variational Autoencoders (VAE) (Kingma, 2013), Generative Adversarial Networks (GAN) (Goodfellow et al., 2014), Normalizing Flows (Rezende and Mohamed, 2015), and Diffusion Models (Rombach et al., 2022), have emerged as effective tools to address this limitation. By generating synthetic data that approximates the original data distribution, these models augment training datasets, thereby enhancing model performance, particularly when real data is scarce.
Among these methodologies, copula-based models garnered attention for their capability to capture intricate dependencies between variables while preserving marginal distributions (Kamthe et al., 2021). Building upon this foundation, Hierarchical Copulas extend the traditional copula framework by introducing a multi-level structure that models dependencies at varying granularities. This hierarchical approach is particularly advantageous for the synthesis of physiological data, as it can capture both global trends and local variations within the data. Compared to GANs, which often require extensive tuning and are susceptible to mode collapse, Hierarchical Copulas offer a more interpretable and robust alternative to generate high-fidelity synthetic data (Kamthe et al., 2021). Using this methodology, our work addresses the scarcity of publicly available physiological datasets, enabling the development of more accurate and generalizable machine learning models for personalized health applications. Furthermore, these synthesized data serve as a foundation for the deployment of efficient LLMs in resource-constrained environments.
2.3 Model distillation and personalized inference
Owing to resource constraints and real-time requirements (Huang et al., 2024), numerous studies have focused on distilling LLMs to transfer knowledge (McDonald et al., 2024), inference capabilities (Li Y. et al., 2024; Kang et al., 2024), and domain expertise (Yuan et al., 2024) into smaller, more efficient models. Model distillation techniques, such as LoRA (Hu et al., 2022) and MoE (Yang et al., 2024), have emerged as promising solutions for edge computing scenarios, where computational resources are limited.
Methodologies such as SOCRATIC CoT (Shridhar et al., 2022) and KARD (Kang et al., 2024) have demonstrated significant potential in numerical inference and factual judgment. However, they often lack the nuanced inference necessary for personalized health analysis, such as interpreting subtle physiological variations or delivering context-aware recommendations. While CoT techniques (Wei et al., 2022; Kojima et al., 2022; Fu et al., 2023) excel in step-by-step inference, they fall short in addressing the complexities inherent in personalized health scenarios, where inference must adapt to individual profiles and dynamic contexts.
Recent advancements, such as MixLoRA (Li D. et al., 2024) and MoRAL (Yang et al., 2024), enable parameter-efficient multi-task fine-tuning, rendering them suitable for deployment on edge devices. However, these methodologies face challenges such as overfitting and the balance between specialization and generalization. For instance, MoDE-CoTD (Li X. et al., 2024) introduces a novel approach by decoupling inference abilities into multiple LoRA-Experts, which are then combined to handle both seen and unseen inference tasks. Despite these innovations, achieving robust generalization while maintaining computational efficiency remains an unresolved challenge. Our work builds upon these techniques by combining LoRA and MoE to achieve efficient, personalized inference on edge devices, addressing the dual challenges of overfitting and dynamic adaptation to individual health profiles.
A concise summary of related work is illustrated in Table 1, highlighting the strengths and limitations of existing methodologies in the context of personalized inference and edge computing.
3 Methodology
3.1 Method overview
The overall system architecture is illustrated in Figure 2. Raw physiological data are first processed to extract HRV features and other relevant parameters, which are then used by the PC-AHC-LLM module to generate synthetic data, addressing data scarcity and enhancing diversity while maintaining physiological plausibility. Both synthetic and real data are subsequently utilized by the large language model (LLM) to generate three types of downstream tasks: sleep reports, personalized questions, and knowledge-based questions.
Figure 2. Overview of the proposed framework. Raw physiological data are first processed to extract HRV features and other relevant parameters, which are then used by the PC-AHC-LLM module to generate synthetic data, addressing data scarcity and enhancing diversity. Both synthetic and real data are utilized by the LLM to generate sleep reports, personalized questions, and knowledge questions. These downstream tasks are managed by the Profile-Aided Distillation of Expert Inference (PADEI) module. Specifically, the PA-CoT (Profile-Aided Chain-of-Thought) mechanism is first loaded to incorporate user profiles (Profile(p)), which guide the routing process within the student model. The teacher model provides supervision to train the student model through knowledge distillation. Once trained, the student model is compiled and deployed on edge devices (e.g., RK3588) for efficient on-device inference, enabling personalized sleep analysis and user interaction with limited computational resources.
These tasks are managed by the Profile-Aided Distillation of Expert Inference (PADEI) module. Specifically, the PA-CoT (Profile-Aided Chain-of-Thought) mechanism first loads user profiles
3.2 Physiologically-constrained adaptive hierarchical copula with LLM-guided optimization (PC-AHC-LLM)
The Physiologically-Constrained Adaptive Hierarchical Copula with LLM Guidance framework, illustrated in Algorithm 1, generates synthetic physiological sleep data by synergistically combining hierarchical copula modeling with LLM-guided optimization. Specifically, LLMs are employed to derive and embed physiological constraints during the modeling process. This ensures the resulting synthetic data exhibits realistic patterns suitable for downstream sleep health analysis applications.
To provide a clear and concrete illustration of the PC-AHC-LLM workflow, we focus on a representative set of seven physiological parameters: the four HRV metrics SDNN, RMSSD, LF/HF, and PNN50, as well as total sleep duration, deep sleep duration, and light sleep duration. These parameters are widely recognized as important indicators of autonomic nervous system activity and sleep architecture, and are therefore selected to exemplify the algorithmic steps in the following subsections. The overall workflow consists of three main stages: (1) physiological constraint extraction, (2) LLM-guided copula optimization, and (3) physiologically-guided sampling and synthesis. The main notations used throughout this section are summarized in Table 2.
3.2.1 Physiological constraint extraction
We leverage the LLM to systematically extract physiological constraint information and domain knowledge, as shown in Equation 1. Formally, given a structured query template
The extracted constraints are summarized in Equations 2, 3.
• Sleep Architecture Constraints:
• Heart Rate Variability (HRV) Constraints (Equations 4, 5):
• Clinical Thresholds (Equations 6, 7):
Compared to traditional expert-defined rules, which are often manually curated and may be limited in scope or require frequent updates, the LLM-driven extraction process enables systematic and scalable identification of physiological constraints. This approach facilitates rapid adaptation to new datasets and tasks, and helps ensure that the synthesized data remains consistent with up-to-date clinical understanding.
3.2.2 LLM-guided copula optimization
Based on the extracted constraints, the LLM optimizes the copula modeling process through three key steps.
• Optimal Variable Transformations: The LLM designs constraint-preserving transformations to standardize physiological parameters into latent Gaussian variables. In this study, sleep duration and HRV parameters are transformed via log-normalization and bounded scaling. The transformed data is denoted as
• •Physiological Subsystem Grouping: The LLM analyzes inter-variable correlations and physiological pathways to identify meaningful subsystems, denoted as
• •Sleep Architecture Subsystem (
• •Autonomic Regulation Subsystem (
• •Copula Family Selection: For each subsystem
• •Vine Copula Structure Construction: The LLM constructs a vine copula structure
3.2.3 Physiologically-guided sampling and synthesis
• •Physiologically-Guided Sampling Strategy The sampling strategy
• •Synthetic samples are generated through a three-step process, as described in Equations 8–10. First, correlated uniform variables are sampled from the hierarchical copula structure (Equation 8). These samples are then transformed into standardized latent variables via the inverse Gaussian CDF (Equation 9). Finally, the latent variables are mapped back to the original parameter space through inverse transformations (Equation 10):
this process ensures that the generated data not only preserves statistical dependencies but also aligns with real-world clinical patterns. By integrating the mathematical rigor of hierarchical copula modeling with domain-specific knowledge extracted by the LLM, the proposed method generates synthetic sleep data that is both statistically robust and clinically meaningful. Embedding physiological constraints directly into the sampling and transformation processes ensures that the resulting dataset is suitable for downstream machine learning and clinical analysis.
3.3 Task data generation
The pipeline for data synthesis is illustrated in Figure 3. The sample data undergoes processing via PC-AHC-LLM framework to generate synthetic physiological parameters. Subsequently, based on this synthetic data, a LLM generates comprehensive sleep reports encompassing sleep-related parameters, descriptions of sleep states, and personalized suggestions. Finally, these sleep reports serve as the basis for generating three personalized questions and three domain-specific knowledge questions for each report, culminating in a comprehensive and realistic dataset for downstream model training and evaluation. First, GPT-4o is employed to generate a comprehensive sleep report dataset, which includes structured sleep-related parameters, descriptive summaries, and personalized suggestions tailored to individual profiles. Simultaneously, GPT-4o is used to generate a questions dataset, comprising both personalized questions (grounded in user-specific sleep data) and domain-specific knowledge questions (covering general sleep science and best practices).
Figure 3. The pipeline for data synthesis and report/question generation. Sample data is processed through the PC-AHC-LLM framework to generate synthetic parameters. Based on the synthetic data, a LLM generates sleep reports containing sleep-related parameters, descriptions of sleep states, and personalized suggestions. Subsequently, the sleep reports are used to generate three personalized questions and three domain-specific knowledge questions for each report, creating a comprehensive dataset for downstream applications.
For each dataset, we performed data synthesis, yielding a total of 5,000 synthetic data samples. This sample size was strategically chosen to ensure the creation of a comprehensive and diverse dataset, which is crucial for the robust and thorough evaluation of our model’s performance across a wide spectrum of sleep profiles. While this scale significantly increases the computational demands for data synthesis, sleep report generation, and Q&A, we deemed it a necessary investment to rigorously assess the model’s capabilities and generalizability. Subsequently, GPT-4o was employed to generate sleep reports based on these synthesized data. These reports included sleep-related parameters, descriptions of sleep states, such as users’ cardiac health, stress resilience, and other related conditions, and personalized suggestions, providing tailored guidance specific to the evaluated sleep profiles. The prompt utilized for generating the sleep reports is shown in Table 3.
Subsequently, based on the generated sleep reports, we utilized Prompt 2 (shown in Table 4) to produce personalized questions and domain-specific knowledge questions, further enriching the dataset for downstream applications.
3.4 Profile-aided distillation of expert inference
In this section, we introduce the Profile-Aided Distillation of Expert Inference (PADEI) framework, illurstrated in Figure 4. PADEI leverages profile-aided Chain-of-Thought prompting and LoRA-based Mixture of Experts distillation to enhance the performance of a compact language model across three core tasks: sleep report generation, personalized Q&A, and knowledge Q&A.
Figure 4. Architecture of the proposed PADEI framework. Input tasks (Report Generation, Personalized Q&A, Knowledge Q&A) are processed using profile-aided CoT prompting. A profile-aided Router dynamically activates a subset of six specialized LoRA adapters within the MoE layer. The outputs of the activated adapters are merged and integrated with the base LLM’s layers to generate the final task-specific responses. The output is supervised by a combination of cross-entropy loss, profile alignment loss, auxiliary load balancing loss, and activation frequency regularization loss to ensure both task performance and balanced expert utilization.
3.4.1 Profile-aided chain-of-thought (PA-CoT)
Through data collection and preliminary experiments, we observed that the diversity of personalized questions across different user groups poses significant challenges for using a single fixed CoT to guide a compact model in answering questions effectively. To address this, we propose the Profile-Aided Chain-of-Thought (PA-CoT), which clusters users based on their profiles and dynamically adapts inference paths to each cluster.
3.4.1.1 Clustering based on user profiles
Users are grouped into groups according to their health-related parameters, which can effectively distinguish between different health states. The user profile vector is defined in Equation 11 as:
where
For each cluster, a brief guide is introduced to contextualize the inference process. For example,.
• •Cluster 1: High Variability, Low Stress Users in this group exhibit high HRV metrics (e.g., elevated
• Cluster 2: Low Variability, High Stress Users in this group show reduced HRV metrics (e.g., low
• Cluster 3: Moderate Variability, Moderate Stress Users in this cluster have moderate HRV metrics but exhibit patterns that may correlate with sleep irregularities (for example, variable
3.4.1.2 Combining clusters with CoT inference
Once users are assigned to a cluster, the inference process integrates the cluster-specific preamble with task-specific CoT templates. Formally, the inference chain is defined in Equation 12 as:
where
Figure 5. Profile-Aided CoT prompting strategy for the three task types. For Task 1 (Sleep Report Generation), a standardized prompt (CoTg) is employed to ensure consistent output format. For Tasks 2 (Personalized Q&A) and 3 (Knowledge Q&A), a shared prompting logic is utilized: user profiles, derived from HRV parameters, are first clustered to establish context-specific focus. Subsequently, input questions are classified (S1-S3) based on the type of information required, which guides the selection of the appropriate inference path (CoTs1-CoTs3). The chosen inference path is then executed combined with the context determined by the user’s cluster.
3.4.2 Integration with MoE architecture
The proposed PA-CoT approach integrates with our LoRA-based MoE architecture to enhance personalized inference capabilities. As illustrated in Algorithm 2, user profile information serves a dual purpose: it guides the selection of appropriate CoT templates and influences the router’s expert assignment mechanism. This integration enables the model to dynamically adapt its inference pathways based on individual user characteristics.
The profile-aided routing mechanism is formalized as:
where
3.4.2.1 Expert configuration
The MoE architecture in our framework employs six LoRA adapters (
3.4.2.2 Dynamic weight allocation
The router dynamically assigns weights (
Let
The final sparse weight vector is then given by
where
3.4.2.3 Loss formulation
To mitigate load imbalance among expert adapters due to varying task sample sizes and the top-k routing mechanism, we incorporated an auxiliary loss alongside the standard cross-entropy loss as a regularization term. We introduced a profile alignment loss to ensure consistency between CoT inference and user profiles. The profile loss is defined in Equation 17:
where
The auxiliary load balancing loss encourages an even distribution of task load across experts, as defined in Equation 18:
where
The activation frequency regularization loss ensures uniform activation of all experts, as defined in Equation 19:
where
The total loss function is a weighted sum of the components, as given in Equation 20:
where
4 Experiments
4.1 Dataset
4.1.1 Public dataset
This study utilizes several publicly available, multimodal sleep-related datasets, summarized in Table 5, primarily sourced from PhysioNet, to enable cross-population analysis and support robust sleep health research. All selected datasets include ECG recordings acquired during sleep, which ensures the reliable extraction of four key HRV metrics: SDNN, RMSSD, LF/HF, and PNN50. These HRV metrics are widely recognized as critical indicators of ANS activity and sleep quality, and are essential for characterizing sleep-related cardiac dynamics.
4.1.2 In-house dataset
In addition to publicly available datasets, we established an in-house dataset as part of a population-based sleep study at Jinan University, involving 160 participants. The study, entitled “Survey and Analysis of the Health Status of Diabetic Patients Post-COVID-19 Pandemic”, was approved by the Ethics Committee of the First Affiliated Hospital of Jinan University (Approval Number: KY-2023299) and conducted in accordance with the Declaration of Helsinki. Sleep ECG data were recorded using the Bodyguard 2 device at a sampling rate of 1,000 Hz to ensure precision for HRV analysis. Following data cleaning and quality review by a board-certified physician, 162 valid overnight recordings were retained.
To further strengthen real-world representativeness and prospective evaluation, we conducted a second phase of data collection and enrolled an additional 150 participants. This cohort was intentionally enriched with children and older adults to increase physiological diversity, and included structured questionnaires on behavioral factors that may affect sleep (e.g., alcohol consumption and exercise habits). All recordings followed the same acquisition protocol and device specifications as in the first phase.
For model development and validation, 50 participants from the second-phase cohort were randomly selected and integrated with the first-phase data for training and synthesis. The remaining 100 participants were strictly held out to form an independent, prospectively collected test set that was time-separated from model development and used exclusively for final evaluation. This prospective test set was also used to assess report-level clinical event detection against physician-adjudicated references for two prespecified endpoints: autonomic dysfunction (SDNN
4.2 Training implementation
All experiments in this study were conducted following a unified workflow comprising data synthesis, model training, and evaluation. During the data synthesis stage, we employed the GPT-4o API2. To generate both physiological parameters and question-answer pairs, ensuring the diversity and physiological plausibility of the synthetic data. The synthesized datasets were carefully balanced across six sources, with equal sampling for each downstream task to mitigate potential data bias.
For model training, we adopted a teacher-student paradigm, where Qwen-max served as the teacher model and Qwen2.5 0.5B as the student. The training set consisted of 1,800 sleep report generation tasks, 4,800 personalized Q&A tasks, and 4,800 knowledge Q&A tasks. All models were trained using PyTorch 2.0.1 with CUDA 12.0 on a workstation equipped with an NVIDIA 4090 Ti GPU and Ubuntu 22.04. The batch size was set to 64, and the learning rate was
For evaluation, the Claude Sonnet 3.7 API was utilized as an LLM-based judge to assess model outputs. The evaluation set comprised 200 sleep report generation tasks, 1,200 personalized Q&A tasks, and 1,200 knowledge Q&A tasks. Each answer was independently evaluated three times, and the final score for each dimension was computed as the average of these ratings to reduce the impact of stochasticity in LLM-based assessment. Inference and real-time testing were conducted on an RK 3588 device with 8 GB RAM running a Linux-based operating system, simulating deployment in resource-constrained edge environments. Key metrics such as inference time and memory usage were recorded to assess the practical feasibility of the proposed framework.
For comprehensive benchmarking, the performance of the proposed method was compared with several SOTA and finetuned models on both internal and public datasets. All comparative experiments were conducted under identical hardware and software conditions, with consistent prompt settings to ensure fairness.
4.3 Evaluation metrics
4.3.1 Synthetic data evaluation
The quality of synthetic data in the era of LLMs can be evaluated across two dimensions Gan and Liu (2024): diversity and faithfulness. Diversity measures the variety and coverage of generated samples relative to the original dataset, ensuring broad representation of potential scenarios. Faithfulness assesses how closely the synthetic data adheres to the statistical and distributional properties of the real data, preserving key characteristics for reliable downstream use. Within the sleep scenario described in this study, the balance between faithfulness and diversity is task-dependent. Specifically, for report generation, where synthetic sleep-related parameters are involved, faithfulness is prioritized over diversity. This is because these parameters are quantitative and utilized in clinical or scientific settings, necessitating high fidelity to ensure accuracy and reliability. In contrast, tasks like personalized Q&A may benefit from greater diversity to capture a wider range of user queries and responses. To evaluate these dimensions, we employ a combination of quantitative metrics. Faithfulness is assessed using the Kullback-Leibler (KL) divergence, which quantifies the similarity between synthetic and real data distributions, and the Kolmogorov-Smirnov (KS) test, which compares empirical cumulative distribution functions (CDFs) to detect significant differences. Diversity is measured via the Hilbert-Schmidt Independence Criterion (HSIC), which evaluates the independence of dependencies in the data. These metrics collectively ensure a comprehensive assessment of synthetic data quality, tailored to the demands of personalized sleep analysis applications.
In contrast, evaluating the effectiveness of personalized questions presents a relatively subjective task. To address this, we conducted an expert evaluation, inviting three sleep center physicians to assess the questions generated. The evaluation was performed in two dimensions: relevance and diversity. Each dimension was scored from 1 to 5, with higher scores indicating better performance. This approach ensures that the generated questions are not only tailored to individual users but also contextually appropriate and diverse, providing a robust foundation for downstream applications.
4.3.2 Model output evaluation
Given that traditional evaluation metrics such as BLEU Reiter (2018), ROUGE Lin and Och (2004), and BERTScore Zhang et al. (2019) are insufficient to effectively differentiate model performance in this context, we adopt the LLM-as-a-Judge paradigm for evaluation. Specifically, Claude Sonnet 3.7 is employed as the evaluator, leveraging its advanced inference and contextual understanding capabilities. Inspired by RAGAS, Es et al. (2024), the evaluation framework assesses the performance of the model in four key dimensions.
• Personalization (Pers.): Evaluates the extent to which the recommendations and responses generated are tailored to the individual user’s data and specific needs.
• Relevance (Rel.): Measures the alignment of responses with the user’s context and the specific questions posed, ensuring that the information provided is pertinent and contextually appropriate.
• Completeness (Comp.): Assesses whether the responses adequately address all aspects of the query, ensuring that no critical details are omitted.
• Accuracy (Acc.): Evaluates the correctness and validity of the information provided, with a focus on domain-specific knowledge and the precision of personalized advice.
To mitigate the inherent stochasticity of the LLM-based evaluation, each answer is independently evaluated three times, and the final score for each dimension is calculated as the average of these ratings. The evaluation prompt is designed to guide the LLM in scoring the model outputs. The specific task and rating criteria are detailed in Table 6.
This LLM-as-a-Judge approach ensures a robust and nuanced evaluation of model performance, leveraging the advanced inference capabilities of Claude Sonnet 3.7 to provide detailed and context-aware assessments. In addition, a small subset of samples was randomly selected from the test sets of each dataset for manual evaluation. This human assessment is conducted to independently validate the LLM-based evaluation and provide complementary insights into the clinical appropriateness of generated outputs. In addition, Inter-rater reliability was quantified using Fleiss’ kappa (
4.4 Data synthesis
4.4.1 Parameter synthesis
4.4.1.1 Distributional similarity
The results in Table 7 demonstrate the performance of different synthetic data generation methods in terms of faithfulness (KL divergence), diversity (HSIC), and distributional similarity (Kolmogorov-Smirnov, KS, p-value). Our proposed PC-AHC-LLM method achieves a KL divergence of 0.87 and a KS p-value of 0.14, indicating high faithfulness to the original data distribution and no significant differences from real-world samples, which is critical for tasks such as report generation that require accurate and reliable synthetic HRV parameters. These metrics were computed by directly comparing synthetic outputs against real dataset values from public sources, including the DREAMT and HMCSS datasets, ensuring that key physiological parameters (e.g., SDNN, RMSSD, LF/HF, and PNN50) exhibit realistic patterns aligned with clinical thresholds (e.g., SDNN
4.4.1.2 Clinical event detection from generated reports
To evaluate clinical utility beyond distributional similarity, we assessed the ability of our framework to detect clinically meaningful events from generated sleep reports on the independent prospective test set (N = 100). Two prespecified endpoints were defined with physician adjudication: autonomic dysfunction (SDNN
For autonomic dysfunction, the model achieved a sensitivity of 0.86 [95% CI: 0.73, 0.95] and a specificity of 0.84 [95% CI: 0.74, 0.91]. For sympathetic dominance, sensitivity was 0.84 [95% CI: 0.68, 0.94] and specificity was 0.86 [95% CI: 0.77, 0.92]. These results demonstrate that the framework can reliably identify clinically relevant events from generated reports in a prospective, real-world setting. Detailed endpoint definitions, report parsing rules, and confusion matrices are provided in Supplementary Material S2.
4.4.2 Questions generation
For personalized questions generated from sleep reports, evaluation is inherently subjective due to the individualized nature of personalization, which lacks a fixed objective ground truth. Instead, we rely on expert human assessment as the proxy ground truth, conducted by three sleep medicine physicians whose judgments are grounded in clinical expertise. For each sleep sample, multiple questions were generated; to address data volume concerns, we randomly selected two questions per sample for evaluation. Physicians independently rated these on a scale of 1-5 across two dimensions: relevance (degree of alignment with user-specific sleep data and clinical facts, such as HRV implications for stress resilience) and diversity (variety in phrasing and coverage of sleep health aspects, ensuring broad applicability without redundancy). Ratings followed predefined criteria to minimize bias, including examples of high-relevance questions (e.g., those tying directly to individual HRV metrics). The final score for each dimension was computed as the average of all annotators’ ratings across evaluated questions, yielding averages of 4.1 for relevance and 4.2 for diversity (as reported in the Experiments section). This structured expert evaluation provides a reliable, replicable mechanism to confirm that the generated questions are contextually appropriate, sufficiently personalized, and varied for downstream applications like user-specific Q&A.
4.5 Personalized inference
4.5.1 Cluster validation
To validate the clusters on the user profile vectors and ensure the method’s robustness, we conducted a series of statistical and clinical assessments. Cluster quality was evaluated using the Silhouette score (average 0.68, indicating good separation and cohesion) and Calinski-Harabasz index (145.2, supporting three clusters as optimal). Robustness was confirmed through sensitivity analysis: bootstrapping with 100 resamples (80% data subsets) showed stable assignments (Jaccard similarity
Additionally, two sleep medicine physicians reviewed the clusters for physiological plausibility, rating at 4.2/5 on average and confirming that labels are descriptive rather than diagnostic tools. For example, the “Moderate Variability, Moderate Stress” cluster’s variable
4.5.2 Task performance
Table 8 presents a comprehensive evaluation comparing our proposed framework against several classes of models. The results demonstrate that our MoE-LoRA architecture, applied to a 0.5B parameter student model, achieves performance comparable to Qwen-max within a small margin. Across the six distinct datasets, the average scores for our model and Qwen-max in personalization, relevance, completeness, and accuracy are statistically comparable, with differences typically falling within a minimal 0.1-point margin. This indicates a practical performance parity, establishing that our lightweight solution can match the high standard set by leading LLMs like Qwen-max and Gemini for personalized sleep analysis tasks. The strong concordance between automated metrics and manual physician assessments further validates our evaluation methodology.
While the standard 0.5B-class edge models (e.g., OPT-350M, Pythia-410M) and the baseline Qwen2.5 0.5B model show improved performance, their scores consistently remain below 4.0. Crucially, even with standard LoRA fine-tuning (“0.5B (lora)”), which provides a modest uplift, the performance remains substantially inferior to that of our proposed method and the top-tier commercial models. This comparison underscores that a simple fine-tuning approach is insufficient to bridge the performance gap. We verified that all between-model comparisons, particularly the parity between our model and Qwen-max, remained stable under the empirical equivalence margin (
In summary, these findings demonstrate the efficacy of the proposed profile-aided MoE-LoRA architecture. It successfully enables a lightweight student model (0.5B parameters) to achieve performance parity with SOTA LLMs, offering a computationally efficient alternative without sacrificing the quality required for personalized biomedical inference. Specific evidence supporting the model’s proficiency across different task types is provided through case studies in Supplementary Tables S8, S9.
4.5.3 Analysis of expert activation in MoE LoRA
To further validate the effectiveness of the MoE LoRA model, we analyzed the activation patterns of the six LoRA adapters during personalized inference tasks. The activation heatmap, as shown in Figure 6, illustrates the distribution of expert activations across three task types: report generation, personalized Q&A, and knowledge Q&A. Each row in the heatmap corresponds to a specific task instance, while each column represents a LoRA adapter.
Figure 6. LoRA Adapter Activation Heatmap: This heatmap illustrates the activation patterns of six LoRA adapters across three task types: report generation, personalized Q&A, and knowledge Q&A. The activation values range from 0 to 1, representing the strength of activation for each adapter. Adapters 1, 3, and 6 show strong activation for report generation tasks, Adapters 2, 4, and 6 for personalized Q&A, and Adapters 1, 5, and 6 for knowledge Q&A. The balanced and task-specific activation patterns highlight the model’s adaptability and specialization.
The results demonstrate that the MoE mechanism dynamically activates the most relevant experts in response to different task types. As illustrated in Figure 6, adapters 1, 3, and 6 are predominantly activated for report generation, adapters 2, 4, and 6 for personalized Q&A, and adapters 1, 5, and 6 for knowledge-based Q&A. A plausible interpretation of these activation patterns is that, during training, the MoE architecture implicitly encourages different adapters to specialize in distinct inference or analytical subtasks. For example, Adapters 1 and 3 may have developed a preference to extract and synthesize structured information required for report generation, while Adapters 2 and 4 are more tuned to the nuances of personalized Q&A, such as interpreting user profiles and contextual cues. Adapter 5 appears to be more involved in domain-specific knowledge inference, and Adapter 6, which is consistently activated across all tasks, likely serves as a general-purpose or knowledge-fusion expert, providing foundational support for various types of inference.
This emergent specialization is not manually assigned but arises organically from the data-driven training process and the dynamic routing mechanism. Thus, the observed activation patterns reflect the model’s ability to adaptively allocate computational resources, ensuring that the most relevant expertise is brought in for each type of task. Such a mechanism enhances both the robustness and interpretability of the model, as it allows for modular, task-aware inference in complex, multi-task personalized sleep analysis scenarios.
4.6 Ablation study
We disentangle the contributions of three components in the proposed framework: profile-aided Chain-of-Thought (PA-CoT), MoE-LoRA versus Single-LoRA under matched active parameter budgets, and profile-aided routing.
4.6.1 Effect of PA-CoT
We compare Global-CoT, PA-CoT without the cluster preamble, and full PA-CoT, on top of the same MoE-LoRA configuration and routing. The results in Table 9 clearly show that the full PA-CoT model significantly outperforms Global-CoT. In Personalized Q&A, personalization scores increase from
4.6.2 Effect of MoE under matched active budgets
We compare Single-LoRA (rank-matched to
4.6.3 Effect of profile-aided routing
We compare content-only, profile-only, content + profile without
Table 11. Routing ablations and expert utilization (mean
Across the three ablations, PA-CoT yields the largest personalization improvements, MoE-LoRA provides capacity-controlled gains attributable to expert specialization, and profile-aided routing contributes additional improvements and healthier expert utilization.
4.7 Edge profiling and inference
We assess the on-device feasibility of the proposed framework on two representative edge platforms: RK3588 (embedded edge node) and Snapdragon 8 Gen 3 (mobile edge; Oppo Find X7 Ultra). These cover typical deployment scenarios for clinic-side gateways and consumer smartphones, respectively.
4.7.1 Setup
We evaluate Single-LoRA (rank-matched to
4.7.2 Device comparison
The result is shown in Table 12. Snapdragon 8 Gen 3 outperforms RK3588 across both variants. For Single-LoRA, latency decreases from
4.7.3 Comparison with lightweight edge models
To comprehensively evaluate the proposed framework against models commonly used in edge deployment scenarios, we expanded our baseline comparisons to include both ultra-compact models and models within the same 0.5B parameter scale. Specifically, we evaluated.
• •Ultra-compact models: MobileBERT (66M parameters) and DistilGPT-2 (82M parameters), which are widely adopted for resource-constrained edge devices.
• 0.5B-class models: OPT-350M, Pythia-410M, and DeBERTa-v3-Large (˜400M parameters), which match the target parameter scale.
Table 13 presents the inference performance comparison on RK3588. Ultra-compact models (MobileBERT, DistilGPT-2) achieve substantially lower latency (580–720 ms) and memory footprint (0.18–0.21 GB) compared to the proposed framework (3950–4200 ms, 0.64–0.70 GB). However, as shown in Table 8, these models exhibit severe performance degradation in personalized sleep health tasks: MobileBERT and DistilGPT-2 score 2.87–3.24 across metrics, representing a
Models at the 0.5B parameter scale (OPT-350M, Pythia-410M, DeBERTa-v3-Large) present a more balanced profile. OPT-350M achieves
4.7.4 Feasibility and trade-offs
Both devices sustain practical interactive inference under floating-point: decode rates of 16.6–21.8 tok/s and peak memory under 0.85 GB. The
4.8 Experiments on real-world data
To further illustrate the diversity and practical relevance of the evaluation, a set of representative real-world questions posed by human subjects is provided. Table 14 presents representative questions. Each set of four questions corresponds to a single individual and encompasses a broad spectrum of concerns related to sleep quality, health management, and personalized recommendations.
Table 14. Representative questions asked by human subjects. Each group of four questions corresponds to one subject.
Subsequently, we evaluated the model’s responses to these real-world questions using Claude Sonnet 3.7 and human experts. The evaluations focused on three key aspects: personalization, relevance, and completeness. The human expert scores for these dimensions were 4.3, 4.5, and 4.3, respectively, while the LLM scores were 4.4, 4.4, and 4.5. The results are summarized in Table 15. The evaluation results of Claude Sonnet 3.7 are highly consistent with those of human experts in all dimensions.
5 Discussion
5.1 Advantages over existing methods
The proposed framework demonstrates clear advantages over existing approaches in terms of both technical performance and practical deployment. In data synthesis, our method achieves a KL divergence of 0.85, indicating high fidelity to the original data distribution—an essential property for generating clinically meaningful HRV parameters. While the HSIC value (0.37) is lower than that of Gaussian Copula (0.60) and SMOTE (0.52), this reflects a deliberate emphasis on distributional accuracy over diversity, which is critical for reliable sleep report generation. In contrast, many existing methods prioritize diversity at the expense of clinical faithfulness. Furthermore, the personalized questions generated by our framework received high average scores for personalization (4.2), relevance (4.1), and diversity (4.1), outperforming traditional data augmentation techniques and demonstrating strong suitability for downstream analytical and interactive tasks.
For personalized inference, our model was validated across diverse populations and sleep scenarios, consistently matching or exceeding the performance of the teacher model (Qwen-max) and outperforming baseline models. For example, on the DREAMT dataset, our model achieved higher personalization (4.6) and relevance (4.7) scores compared to both the fine-tuned 0.5 billion-parameter and 1.5 billion-parameter models. The MoE LoRA architecture enables dynamic resource allocation, as shown by task-specific adapter activation patterns (Figure 6), ensuring robust and adaptive performance across heterogeneous datasets. This adaptability distinguishes our approach from conventional fine-tuning methods, which often lack such flexibility.
The framework’s practical value is confirmed by its successful deployment on the RK3588 edge device. It achieves a real-time inference speed of 16.6 tokens per second while maintaining a compact memory footprint of only 0.7 GB, demonstrating its suitability for resource-constrained environments. Our lightweight 0.5 billion-parameter base model offers a compelling balance between computational efficiency and analytical performance, making it a scalable and effective solution for real-world personalized health analysis.
5.2 Limitations and future work
Despite these strengths, several limitations warrant consideration. The framework’s reliance on user profiles for personalized inference implies that its effectiveness is influenced by the quality and completeness of input profiles. Enhancing robustness against incomplete or noisy profile data constitutes an important direction for future work. Additionally, while the current framework addresses multiple tasks effectively, its scalability to more complex or larger-scale applications remains to be fully validated. Whether the 0.5 billion-parameter base model, even with an increased number of adapters, advanced loss functions, or improved routing algorithms, can maintain high performance in more demanding scenarios is an open question. Future research will focus on hybrid data synthesis, advanced optimization strategies, and further architectural innovations to ensure continued scalability and adaptability.
6 Conclusion
This work presents a novel and efficient framework for personalized sleep analysis on edge devices, integrating profile-aided inference and adaptive data synthesis within a lightweight 0.5 billion-parameter model. The framework achieves a strong balance among accuracy, adaptability, and computational efficiency, enabling real-time, individualized health analysis in resource-constrained environments. Its modular design and successful deployment on edge hardware underscore its practical potential for wearable and mobile health applications.
By addressing key challenges in data scarcity, model efficiency, and user-specific inference, this study advances the state of the art in personalized health informatics. Looking ahead, the proposed approach lays a robust foundation for future innovations in scalable, secure, and accessible personalized healthcare, with the potential to expand beyond sleep analysis to broader domains of health monitoring and intervention.
Data availability statement
The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/Supplementary Material.
Ethics statement
The studies involving humans were approved by the Human Research Review Committee of the First Affiliated Hospital of Jinan University (Approval Number: KY-2023299). The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation in this study was provided by the participants’; legal guardians/next of kin. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.
Author contributions
HZ: Conceptualization, Data curation, Methodology, Writing – original draft. XA: Data curation, Writing – review and editing. XL: Data curation, Writing – review and editing. XoX: Supervision, Writing – review and editing. XnX: Funding acquisition, Writing – review and editing.
Funding
The authors declare that financial support was received for the research and/or publication of this article. The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Guangdong Provincial Key Laboratory of Human Digital Twin under Grant 2022B1212010004.
Acknowledgements
The authors would like to express their sincere gratitude to all individuals who contributed to this research. We thank the First Affiliated Hospital of Jinan University for approving this research plan, which enabled data collection.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The authors declare that no Generative AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphys.2025.1678364/full#supplementary-material
Footnotes
1https://openai.com/index/whoop/
2https://platform.openai.com/docs/guides/text?api-mode=responses
3https://github.com/TUDB-Labs/MoE-PEFT/tree/main/moe_peft
References
Brusie C. (2025). Personalizing population health. Available online at: https://sleepreviewmag.com/sleep-health/demographics/personalizing-population-health-azizi-a-seixas/.
Chan N., Parker F., Bennett W., Wu T., Jia M. Y., Fackler J., et al. (2024). Medtsllm: leveraging llms for multimodal medical time series analysis. ArXiv:2408.07773.
Chawla N., Bowyer K., Hall L., Kegelmeyer W. P. (2002). Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357. doi:10.1613/jair.953
Chen H., Deng W., Yang S., Xu J., Jiang Z., Ngai E. C. H., et al. (2025). Towards edge general intelligence via large language models: opportunities and challenges. IEEE Netw. 39, 263–271. doi:10.1109/MNET.2025.3539338
Cosentino J., Belyaeva A., Liu X., Furlotte N. A., Yang Z., Lee C., et al. (2024). Towards a personal health large language model. ArXiv:2406.06474.
Es S., James J., Espinosa Anke L., Schockaert S. (2024). “Ragas: automated evaluation of retrieval augmented generation,” in Proceedings of the 18th conference of the European chapter of the association for computational linguistics (EACL) (St. Julians, Malta: Association for Computational Linguistics), 150–158.
Fang C. M., Danry V., Whitmore N., Bao A., Hutchison A., Pierce C., et al. (2024). Physiollm: supporting personalized health insights with wearables and large language models. ArXiv:2406.19283.
Fleiss J. L. (1971). Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382. doi:10.1037/h0031619
Fu Y., Peng H., Ou L., Sabharwal A., Khot T. (2023). “Specializing smaller language models towards multi-step reasoning,” in Proceedings of the 40th international conference on machine learning (ICML) (Honolulu, HI, USA: PMLR), 10421–10430.
Gan Z., Liu Y. (2024). Towards a theoretical understanding of synthetic data in llm post-training: a reverse-bottleneck perspective. ArXiv:2410.01720.
Garbarino S., Bragazzi N. (2024). Revolutionizing sleep health: the emergence and impact of personalized sleep medicine. J. Personalized Med. 14, 598. doi:10.3390/jpm14060598
Goodfellow I. J., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., et al. (2014). Generative adversarial nets. NeurIPS 27, 139–144. doi:10.1145/3422622
Gunter T., Wang Z., Wang C., Pang R., Narayanan A., Zhang A., et al. (2024). Apple intelligence foundation language models. ArXiv:2407.21075.
Guo Z., Lai A., Thygesen J., Farrington J., Keen T., Li K. (2024). Large language models for mental health applications: systematic review. JMIR Ment. Health 11, e57400. doi:10.2196/57400
Hu E. J., Shen Y., Wallis P., Allen-Zhu Z., Li Y., Wang S., et al. (2022). Lora: low-rank adaptation of large language models. ICLR 1, 3. doi:10.48550/arXiv.2106.09685
Huang Y., Wan L. J., Ye H., Jha M., Wang J., Li Y., et al. (2024). New solutions on llm acceleration, optimization, and application. ArXiv:2406.10903.
Kamthe S., Assefa S., Deisenroth M. (2021). Copula flows for synthetic data generation. ArXiv:2101.00598.
Kang M., Lee S., Baek J., Kawaguchi K., Hwang S. J. (2024). Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. Adv. Neural Inf. Process. Syst. 36, 48573–48602. doi:10.5555/3666122.3668231
Kim Y., Xu X., McDuff D., Breazeal C., Park H. W. (2024). Health-llm: large language models for health prediction via wearable sensor data. ArXiv:2401.06866.
Kojima T., Gu S. S., Reid M., Matsuo Y., Iwasawa Y. (2022). “Large language models are zero-shot reasoners,” in Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November 2022- 9 December 2022, 22199–22213. doi:10.5555/3600270.3601883
Li D., Ma Y., Wang N., Ye Z., Cheng Z., Tang Y., et al. (2024a). Mixlora: enhancing large language models fine-tuning with lora based mixture of experts. ArXiv:2404.15159.
Li X., He S., Wu J., Yang Z., Xu Y., Jun Y. J., et al. (2024b). “Mode-cotd: chain-of-thought distillation for complex reasoning tasks with mixture of decoupled lora-experts,” in Proceedings of the LREC-COLING 2024, Torino, Italy, 11475–11485.
Li Y., Yuan P., Feng S., Pan B., Sun B., Wang X., et al. (2024c). “Turning dust into gold: distilling complex reasoning capabilities from llms by leveraging negative data,” in Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, Canada, 20-27 February 2024, 18591–18599. doi:10.1609/aaai.v38i17.29821
Lin C., Och F. (2004). “Looking for a few good metrics: Rouge and its evaluation,” in Proceedings of the NTCIR Workshop, Tokyo, Japan, 1–8.
McDonald D., Papadopoulos R., Benningfield L. (2024). Reducing llm hallucination using knowledge distillation: a case study with mistral large and mmlu benchmark. TechRxiv Preprint. doi:10.36227/techrxiv.171665607.76504195/v1
Merrill M. A., Paruchuri A., Rezaei N., Kovacs G., Perez J., Liu Y., et al. (2024). Transforming wearable data into health insights using large language model agents. ArXiv:2406.06464.
Nazi Z., Peng W. (2024). Large language models in healthcare and medical domain: a review. Information 11, 57. doi:10.3390/informatics11030057
Reiter E. (2018). A structured review of the validity of bleu. Comput. Linguist. 44, 393–401. doi:10.1162/coli_a_00322
Reynolds D. (2009). Gaussian mixture models. Encycl. Biometrics 741 (3), 659–663. doi:10.1007/978-0-387-73003-5_196
Rezende D., Mohamed S. (2015). “Variational inference with normalizing flows,” in Proceedings of the 32nd international conference on machine learning (ICML) (Lille, France: PMLR), 1530–1538.
Rombach R., Blattmann A., Lorenz D., Ommer P. E. B (2022). “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, June 18-24 2022, 10684–10695.
Secara I., Hordiiuk D. (2024). Personalized health monitoring systems: integrating wearable and ai. J. Intelligent Learn. Syst. Appl. 16, 44–52. doi:10.4236/jilsa.2024.162004
Shaffer F., Ginsberg J. (2017). An overview of heart rate variability metrics and norms. Front. Public Health 5, 258. doi:10.3389/fpubh.2017.00258
She J., Zheng W., Liu Z., Wang H., Xing E., Yao H., et al. (2025). Token level routing inference system for edge devices. ArXiv:2504.07878.
Shridhar K., Stolfo A., Sachan M. (2022). Distilling reasoning capabilities into smaller language models. ArXiv:2212.00193.
Stein P., Pu Y. (2012). Heart rate variability, sleep and sleep disorders. Sleep. Med. Rev. 16, 47–66. doi:10.1016/j.smrv.2011.02.005
Wang D., Zhang S. (2024). Large language models in medical and healthcare fields: applications, advances, and challenges. Artif. Intell. Rev. 57, 299. doi:10.1007/s10462-024-10921-0
Wei J., Wang X., Schuurmans D., Bosma M., Ichter B., Xia F., et al. (2022). “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 24824–24837. doi:10.5555/3600270.3602070
Yang S., Ali M. A., Wang C. L., Hu L., Wang D. (2024). Moral: moe augmented lora for llms’ lifelong learning. ArXiv:2402.11260.
Yuan Z., Shang Y., Zhou Y., Dong Z., Zhou Z., Xue C., et al. (2024). Llm inference unveiled: survey and roofline model insights. ArXiv:2402.16363.
Zhang T., Kishore V., Wu F., Weinberger K. Q., Artzi Y., Wu F. (2019). Bertscore: evaluating text generation with bert. ArXiv:1904.09675.
Zhang Y., Liu H., Xiao Y., Amoon M., Zhang D., Wang D., et al. (2024). Llm-enhanced multi-teacher knowledge distillation for modality-incomplete emotion recognition in daily healthcare. IEEE J. Biomed. Health Inf. 29, 6406–6416. doi:10.1109/JBHI.2024.3470338
Keywords: personalized sleep analysis, large language model (LLM), model distillation, data synthesis, edge computation
Citation: Zheng H, Ai X, Liu X, Xing X and Xu X (2026) Profile-aided distillation framework for personalized sleep analysis with compact models using LLM-guided synthetic data. Front. Physiol. 16:1678364. doi: 10.3389/fphys.2025.1678364
Received: 02 August 2025; Accepted: 24 November 2025;
Published: 05 January 2026.
Edited by:
Chunlin Xiong, China United Network Communications Group, ChinaReviewed by:
Pardeep Singh, Bennett University, IndiaSibo Qiao, Tianjin Polytechnic University, China
Copyright © 2026 Zheng, Ai, Liu, Xing and Xu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Huimin Zheng, MjAyMjEwMTgyMTMwQG1haWwuc2N1dC5lZHUuY24=