Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Natural Language Processing

Volume 8 - 2025 | doi: 10.3389/frai.2025.1663877

Liver cancer knowledge graph construction based on dynamic entity replacement and masking strategies RoBERTa-wwm-large-BiLSTM-CRF model with clinical Chinese EMRs

Provisionally accepted
Yichi  ZhangYichi Zhang1Xiaojun  HuXiaojun Hu2Hailing  WangHailing Wang1*Ke  LiuKe Liu3Yongbin  GaoYongbin Gao1Xiaoyan  JiangXiaoyan Jiang1Yingfang  FanYingfang Fan2*Zhijun  FangZhijun Fang1
  • 1Shanghai University of Engineering Sciences, Shanghai, China
  • 2The Third Affiliated Hospital of Guangzhou Medical University, Guangzhou, China
  • 3Beijing Anding Hospital Capital Medical University, Beijing, China

The final, formatted version of the article will be published soon.

Introduction: Liver cancer is a leading cause of cancer-related mortality worldwide, necessitating advanced tools for diagnosis and management. Knowledge graphs (KGs) are crucial for advancing smart healthcare, but existing liver cancer-specific KGs are mostly derived from literature or public databases, lacking integration with real-world clinical data (e.g., Electronic Medical Records (EMRs)), creating a critical gap. Furthermore, there is currently no publicly available KGs specifically for liver cancer, creating a significant gap in structured clinical knowledge resources. Methods: This study proposes a novel framework to construct the first Chinese liver cancer KG from Real-World Liver Cancer Electronic Medical Records (RLC-EMRs). A new named entity recognition (NER) model, DERM-RoBERTa-wwm-large-BiLSTM-CRF was developed that uses a Dynamic Entity Replacement and Masking (DERM) strategy to address data scarcity. Knowledge fusion was performed using the TF-IDF algorithm to standardize and integrate entities from clinical records, the professional medical website XYWY.com, and the CCMT-2019 terminology standard. Results: The final constructed liver cancer KG contained 46,364 entities and 296,655 semantic relationships. The proposed NER model achieved a state-of-the-art F1 score of 68.84% on the public CMeEE-v2 dataset. On the proprietary RLC-EMRs dataset, the model demonstrated high effectiveness with a precision of 93.23%, recall of 94.69%, and an F1 score of 93.96%. In addition, a KG-based retrieval system was successfully developed to query for complications, medications, and other related information. Discussion: The findings demonstrated the effectiveness of the proposed framework in constructing a comprehensive and clinically relevant liver cancer KG. The novel DERM-based NER model significantly improved entity extraction from complex medical texts. By successfully integrating real-world clinical data, this study addresses a critical gap in existing liver cancer-specific KGs, which are mostly derived from literature or public databases and lack integration with real-world clinical information.

Keywords: knowledge graph, named entity recognition, liver cancer, Knowledge fusion, Knowledge graph application

Received: 11 Jul 2025; Accepted: 19 Sep 2025.

Copyright: © 2025 Zhang, Hu, Wang, Liu, Gao, Jiang, Fan and Fang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence:
Hailing Wang, wanghailing@sues.edu.cn
Yingfang Fan, fanyf068700@sina.com

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.