Research on the Construction and Application of Breast Cancer-Specific Database System Based on Full Data Lifecycle

Relying on the Biomedical Big Data Center of West China Hospital, this paper makes an in-depth research on the construction method and application of breast cancer-specific database system based on full data lifecycle, including the establishment of data standards, data fusion and governance, multi-modal knowledge graph, data security sharing and value application of breast cancer-specific database. The research was developed by establishing the breast cancer master data and metadata standards, then collecting, mapping and governing the structured and unstructured clinical data, and parsing and processing the electronic medical records with NLP natural language processing method or other applicable methods, as well as constructing the breast cancer-specific database system to support the application of data in clinical practices, scientific research, and teaching in hospitals, giving full play to the value of medical big data of the Biomedical Big Data Center of West China Hospital.


INTRODUCTION
With the rapid development of new technologies such as big data and artificial intelligence, the medicine overlaps with such disciplines as information technology, computer science, and cyber security in more and more aspects. Particularly, thanks to the constant advancement of medical technology, the process for screening, diagnosis and treatment of diseases is being expanded to generate various new data. Based on different data modalities, artificial intelligence technology has been widely applied in the field of medicine (1)(2)(3)(4)(5). This research focuses on breast cancer which results in the second highest cancer mortality in women (6) and its screening, diagnosis and treatment strategies which have developed from single surgical therapy to a comprehensive treatment mode that combines surgical therapy, chemotherapy, radiotherapy, endocrinotherapy, and targeted therapy, forming a multi-disciplinary team (MDT) of breast cancer. Multi-source heterogeneous data, such as electronic medical record data, image data and gene data, were generated in the whole diagnosis and treatment process to drive the disease diagnosis and treatment and disease research into a big data era of disease-specific research (7).
Currently, the most fundamental challenge confronting the medical research institutions in the development of disease-specific data research is how to integrate multi-source heterogeneous data and build a disease-specific database, so as to support the discovery of potential diagnosis and treatment knowledge patterns from the massive medical data. For example, the data are too much to be screened manually by clinicians; the attributes of the same data element are described in different ways across hospitals or even across information systems of the same hospital, e.g., the same drug may be identified by different codes. As unstructured data such as electronic medical records contain medical information of important value, the difficulty lies in the effective and accurate extraction of such information. Based on the medical big data governance activities in China, Li et al. proposed a big data governance framework for medical data in China through literature review, expert consulting and structural modeling, providing an important reference for data governance framework in this research (8). This paper aims to integrate two heterogeneous clinical data sources, i.e., unstructured medical records and structured clinical data, through clinical text analysis and knowledge extraction; to break the information barriers within the organization and between clinical departments and to promote data sharing among medical centers in combination with patient information from multiple clinical data sources; to establish the diseasespecific data standards in accordance with international industry standards and then to construct a multi-modal knowledge graph specific to breast cancer; finally to build a diseasespecific database system for the purpose of analyzing disease characteristics, thus providing supports in clinical decisionmaking and rational drug use to clinicians in the diagnosis and treatment of breast cancer.

THE INVOLVEMENT OF BREAST CANCER PATIENTS
Patients pathologically diagnosed with breast cancer are prospectively registered in the Breast Cancer-specific Database System at West China Hospital, Sichuan University since 2008 (9,10). Medical records, diagnostic pathology reports, treatment records are recorded by oncologists. All patients are followed by outpatient visit or telephone at 3-4-month intervals within 3 years after diagnosis, 6-month intervals within 4-5 years, and then annually. The characteristics description of breast cancer patients included in the database is shown in Table 1.

OVERALL DESIGN SCHEME FOR BREAST CANCER-SPECIFIC DATABASE SYSTEM
Overall design thought of the breast cancer-specific database system is shown in Figure 1.
Several important oncology patient data systems of China and foreign countries (e.g., cancer registration software CanReg designed and developed by the Descriptive Epidemiology Unit of IARC\the cancer screening database of Chinese Anti-Cancer Association) were referred in the overall design and construction of the breast cancer-specific database system (11,12). The database is composed of four parts: a patient standard database, a breast cancer malignancy-specific database, a diagnostic imaging database and a breast cancer patient follow-up visit database. Data governance is based on these four parts. The construction of the disease-specific database system involves governance, extraction and application. Governance is performed firstly to collect the current disease-specific data assets of the West China Hospital and sort out their meanings, ownership, etc. The next step is to conduct data classification and quality control to ensure the accuracy of data processing. The last step is to provide a unified standard interface services based on the governed and integrated disease-specific data center. In addition, main considerations in design of the diseasespecific database include the application of the standardized and governed data in clinical diagnosis and treatment assistance and scientific research. The patient's medical record data can be viewed as a time series that captures the entire clinical process of collecting the patient's medical history, analyzing the condition, diagnosing, and treating the patient. Different data sources have different time spans, resulting in complex timing dependencies between events (13). Therefore, combined with the actual application scenarios and for better support to scientific research, the overall design and final data presentation of the disease-specific database are logically linked through the processes of admission registration, inpatient treatment, checkout and discharge, etc. in the chronological sequence, so as to establish a view for diagnosis and treatment based on full data lifecycle.

Establishment of Dataset Standards
A regional breast cancer-specific database should include a complete range of datasets in the uniform format and meeting the normative standards. In addition, the database should incorporate the national and medical industry standards and all system datasets of medical institutions. A synonym database of the dataset names should be created. A standard database of breast cancer datasets should be created for the Center to provide a dataset graph for its application and corpus support for automatic identification of dataset names. The Center's database includes the scope, normative references, term abbreviations, dataset metadata attributes, and data element attributes. Standard information for each dataset includes the dataset name, identifier, classification, field description, definition, etc. The datasets are saved in four different databases based on data types, and subdivided into four modules and about 20 submodules. The oncology datasets and breast cancer-specific datasets are created accordingly with reference to different health industry standards. Some referenced dataset standards are shown in Table 2, and some collected fields of the breast cancer-specific dataset are shown in Table 3.

Establishment of Data Element Standards
A local database of data element standards is established based on the national and industrial standards and in combination with the specific situation of the hospital. The local database includes data element indicators, normative references, term abbreviations, and data element directory. Data element standards specify the Chinese name, English name, identifier, definition, classification, data type, representation format, data threshold value, allowable value type and allowable value of data fields in the data dictionary, which are used to ensure the data quality. In the management of metadata, data elements may be classified and labeled, so as to establish a synonym database of the data elements. The local database describes the attributes of each data element, including Chinese field name, English field name, field name abbreviation, field type, field length, required or not, range or reference standards, notes and remarks. If there are any relevant international standards for the range of data elements, they can be referenced directly; otherwise, the range will be set by physicians and other professionals in combination with clinical experience. The set range standards will be saved together with other collected standards in the local database for the convenience of version management and subsequent updates. Table 4 shows partial attributes of some data elements.

Quality Control-Fusion and Governance of Multi-Source Heterogeneous Data
The structured medical data from HIS (Hospital Information System), LIS (Laboratory Information Management System), and follow-up visit system are integrated with the image data from PACS (Picture Archiving and Communication Systems). These data are acquired by building an ETL (Extract-Transform-Load) automation platform to perform incremental extraction at regular intervals on a daily basis, and complete data standardization and other processes during the extraction process. The unstructured data in the electronic medical record are structured through natural language processing and machine learning after data source access, and then saved in the diseasespecific database. Afterwards, the data in the four module databases are linked primarily based on the patient ID, thus breaking the information barriers within the organization and between clinical departments. Finally, the front-end application is supported by breast cancer-specific data for fully mining the data of full lifecycle about single disease and providing support for data analysis of multi-center joint scientific research projects. Specific processing methods are described below. The data governance framework is shown in Figure 2.

Structured Data Processing
Data acquisition (data reception or data capture) is performed through the data fusion platform for the data of breast cancer patients which are structured but exist in different systems. The source data are extracted, integrated and saved in the target database as per the following steps: (1) Establish a data source directory, and determine the connection mode, access permission, data storage directory, and interfaces of each data source; (2) Data cleaning and filtering: Establish data review rules, e.g., the gender can only be male, female, or unknown, the ID number can only be 18 digits, the patient ID cannot be blank, etc. Then, filter the data according to these rules, and save the unqualified data in a temporary database, with no need for data fusion. The cleaned data should not contain missing or incomplete data, repeated data and nonstandard data. See Figure 3 for the statistics of some cleaned data; (3) Map the original data in the data source database with the standard datasets in accordance with the specified data standards, and complete the range conversion of data elements at the same time to standardize the processing of breast cancer standard data, so as to complete the collection, and collation of multi-source data; (4) In the process of timed automatic incremental extraction of medical data, monitor the log for each extraction, and count the number of extraction records and completion for later failure rollback (14).

Unstructured Data Processing
Electronic medical records contain highly valuable medical data. The unstructured breast cancer data are parsed by the standard medical structure based on natural language text data, and the structured data correction, annotation and association tools are provided for clinicians to manage the annotation tasks (either by manual or automatic annotation) of the text data to be processed, so that the unstructured text data in breast cancer pathology reports and present medical history are transformed into analyzable structured data, providing a data basis for the construction of a subsequent consensus data link mining engine, an analysis tool for self-defined data link risk factors and a breast disease knowledge graph. Natural language texts (such as current medical history, color Doppler ultrasound description and pathological description) are annotated by professional physicians for entities and relations until 200 annotations, and preliminary training is developed, then back-annotation is performed using the trained model to assist the physicians in annotation of subsequent samples. At present, 1,000 samples have been annotated and trained, and completed for model training and model evaluation using NLP to make profound adjustment to model parameters. The model ability is evaluated, with the recognition accuracy of 80-85%, reaching the level of manual recognition by general physicians. The parsing results of some electronic medical records are shown in Figure 4.

Image Data Processing
Medical imaging technology has increasingly become an indispensable means for disease diagnosis, providing quick and   accurate support for clinical practice. The imaging information about a patient's tumor site and the examination report issued by the radiologist are saved in the medical image data. In the process of image data governance, it is necessary to eliminate unqualified image data. With AI deep neural network machine learning technique, the machine can automatically distinguish the unqualified image data and annotate such data (15). The filtered unqualified images are saved in the temporary cache database and manually verified later. The qualified image data are extracted and saved together with basic patient information and report results in the diagnostic imaging database, and finally linked with the patient standard database and the disease-specific database based on the patient ID to connect the whole treatment process. The image data governance process is shown in Figure 5.

Model-Building -Multi-Modal Breast Cancer-Specific Knowledge Graph
On the basis of the breast cancer-specific database system, a multi-modal breast cancer-specific knowledge graph is constructed to integrate texts, medical images, and even videos, voices and other rich media information, and to reflect the hierarchical relation among the entities and relations related to breast cancer such as pathogenesis, symptom characteristics, complications, treatment means, medical history, and medication in the form of node network graph. Such a centralized and clear structure can help researchers quickly clarify the relations and differences among numerous and complex knowledge points (16). AI mining engine is constructed to identify valuable hidden relations from the huge breast cancerspecific database and analyze such relations through clustering, attribute comparison and AI active learning, and the results are reviewed by experts and incorporated into the knowledge graph if passing the review. After deep knowledge data mining, the cross-departmental and even cross-hospital knowledge relations can be established only for dynamic knowledge graphs based on multivariate knowledge graph, thus expanding the entity set, relation set and triple set of knowledge graph. Meanwhile, the entity is not limited to the single representation only in breast cancer-related terms. The traditional knowledge graph is out of use, and the current knowledge graph integrates multi-modal knowledge, and displays, represents and utilizes medical data in various forms to the largest extent for the convenience of learning and understanding by researchers (17,18). The partial knowledge graph constructed is shown in Figure 6.

MAINTENANCE OF DATA SECURITY AND SHARING OF THE ACHIEVEMENTS IN DISEASE-SPECIFIC DATABASE
After the breast cancer-specific database system is constructed, in order to maximize its value in clinical practice and scientific research, rather than being limited to the inquiry and use in the Center and the Hospital, the data should be shared to multiple parties. However, medical data are particularly sensitive, so the following solutions will be adopted to share the data under the premise of ensuring the data security and patient privacy. Due to the high security requirements of medical information data and the small volume of data in a single institution, federated learning, or multi-party secure computation can be considered to achieve joint use of data by multiple parties while ensuring data privacy and security. Essentially, both approaches limit the data use to the specified scope, which is effective to avoid data leakage and abuse. Federated learning is a distributed machine learning technique which collaborates the data modeling by multiple parties without data exchange (19). This model ensures the privacy of medical data and allows scientific research. Multi-party secure computation technically ensures that multiple parties of data cannot obtain the original data, and realizes collaborative computation without data leakage, that is, multiple parties run a computing task, machine learning task and data retrieval so as to obtain the final results based on common data, but data and intermediate calculation results will not be disclosed to any party in this process.

DEVELOPING DATA VALUE TO SUPPORT CLINICAL RESEARCH
Support scientific research. The disease-specific database system will provide massive datasets for scientific research. Researchers can precisely filter target data according to their different research needs. Data analysis tools are also provided to improve data processing capabilities of the Hospital. Researchers can customize screening modes from the aspects of data sources, data time periods, data label types, verified knowledge, etc. to efficiently and accurately retrieve target data from massive data, thus indirectly improving clinical scientific research capabilities.
Support clinical practice. The diagnosis and treatment data of a breast cancer patient can be synchronized in real time through the "Breast Cancer-specific Database System" to form a data file which includes diagnostic imaging data, clinical pathology data, basic patient information, medical advice information, medication, surgery, radiotherapy, chemotherapy, cost settlement, etc., as well as the associated breast cancer knowledge database, etc. Combined with the "multi-modal breast cancer-specific knowledge graph" and based on the databasewide medical big data, various quantitative or qualitative big data machine learning algorithms are utilized for data analysis (20)(21)(22) to output the holographic knowledge portrait analysis reports of the patient's breast cancer risk profile, disease trend, clinical protocol, etc., such as the possibility of certain conclusion and the proportion of certain therapeutic regimen, providing the physicians with multi-dimensional and rich reference information, improving the ability of junior physicians in identification, diagnosis and treatment, and reducing the probability of missed diagnosis and misdiagnosis. Physicians can intuitively view, analyze and integrate multi-dimensional and multi-level holographic knowledge portrait, thus providing reference knowledge for accurate diagnosis and treatment of breast cancer based on the full-volume data. With the help of the auxiliary diagnostic system, physicians can provide more accurate therapeutic regimen based on the stage of the cancer and the patient's physical condition. Meanwhile, the intelligent auxiliary diagnosis system for breast cancer also provides a whole course management tool covering the examination, treatment and follow-up visit, so that the physicians can optimize the therapeutic regimen as appropriate in a timely manner, improve the treatment effect, and also provide more valuable data for the normalization and standardization of breast cancer treatment while using it (23). Support teaching. Based on the whole-process therapeutic regimen in the breast cancer-specific database and the real physiological data of patients, theoretical learning and practice are carried out simultaneously for teachers and students. The most important thing is that the data are real and updated in real time update, so they are more instructive.

CONCLUSION
A breast cancer-specific database system based on full data lifecycle, by integrating the data and processes of existing clinical data systems, accumulates knowledge database, provides standard access interface and back feeds business integration to promote the optimization and transformation of existing disease-specific research processes and form a closed loop for sustainable development. The disease-specific database system covers several disease-specific databases for conveniently saving and managing patient data in a systematic, standardized and accurate manner, so as to realize the tracking of breast cancer cases, and effectively develop teaching, scientific research and evaluation on the effects of various therapies for breast cancer. A scientific platform is created for research on breast cancer pathogenesis and etiology through comprehensive long-term longitudinal tracking and data comparison/analysis.
Clinical text analysis and knowledge extraction are conducted to integrate two heterogeneous clinical data sources, that is, unstructured medical record data and structured clinical data. New-generation information technologies, such as big data, NLP text parsing, data mining and knowledge graph, are deeply fused and applied to build a disease-specific database system based on full data lifecycle for the purpose of breast cancer disease characteristic analysis, so as to effectively develop teaching, scientific research and evaluation on the diagnosis and treatment of breast cancer and the follow-up visit tracking of cases, conduct comprehensive long-term longitudinal tracking and data comparison and analysis, and create a scientific platform for research on cancer pathogenesis and etiology. Big data and AI technology are utilized to provide continuous help for single disease of breast cancer before, during and after surgery, enable the physicians to deeply participate in the whole path of disease diagnosis and treatment, truly achieve accurate diagnosis and treatment planning, and break the data barriers between clinical departments. The governance and application of image data are emphasized to explore the image optimization algorithm and image recognition tool through database feedback and cyclic iteration optimization. The occurrence and development rules of relevant diseases are analyzed based on population categories to provide big data-based analysis and services for better clinical diagnosis and treatment, health management and clinical evidencebased medical research. Specialized research and disease-specific database are the focus of the connotation construction of the Hospital. The comprehensive hospitals in China can win a competitive advantage only by strengthening the construction of disciplines and also better meet the health service requirements of society and the country.

DATA AVAILABILITY STATEMENT
The datasets for this study are available from the corresponding author on reasonable request. Requests to access these datasets should be directed to yinjin@wchscu.cn.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the biomedical research ethics committee of West China Hospital (reference number: 20200427). Written informed consent for participation was not required for this study in accordance with the national legislation and the institutional requirements.