ORIGINAL RESEARCH article

Front. Artif. Intell., 26 June 2025

Sec. Medicine and Public Health

Volume 8 - 2025 | https://doi.org/10.3389/frai.2025.1544200

Construction of medical scientific data repositories in China: analysis of survey and recommendations

  • 1School of Management, Shenzhen Polytechnic University, Shenzhen, China
  • 2Centre for Modern Industry and SME, Shenzhen, China
  • 3School of Government, Beijing Normal University, Beijing, China
  • 4Department of Information Science, Faculty of Humanities and Social Sciences, Khon Kaen University, Khon Kaen, Thailand

Background: In the context of global open science trends, medical open-access repositories (OARs) promote transparency in research and facilitate the sharing of scientific data. The increase in scientific output necessitates a robust infrastructure to enhance OARs in China.

Objectives: This study aimed to evaluate medical open-access repositories (OARs) in China that are indexed in re3data.org and OpenDOAR.org. The study analyzed data classification, descriptions, retrieval, and the utilization of selected repositories.

Methods: This study ascertained the current status of the Chinese medical OARs by visiting their respective websites and attempted to identify the disciplinary orientation of each OAR. A content analysis approach was utilized to achieve this study’s objective. Twelve Chinese medical open-access repositories were selected from re3data.org and OpenDOAR.org to examine how their information is organized. The data were collected manually from May 1 to 30, 2023, and analyzed using various quantitative techniques to understand the current status of medical scientific repositories in China.

Results: Based on the results, this study proposed the following recommendations: (1) implement multi-dimensional data classification, (2) use persistent data identifiers, (3) formalize the description metadata, (4) enhance advanced retrieval and result set filtering functions, and (5) optimize the preview and interaction features of data repositories.

Conclusion: The scope of this study is restricted to the medical open-access repositories in China as listed on re3data.org and OpenDOAR.org. Therefore, the results of this study are only generalizable within China. The primary focus of research output in China is on medical open-access repositories. This study is essential for assessing China’s current status in research data management within the medical field and its distribution infrastructure in global open science trends.

1 Introduction

Scientific data are a vital infrastructure for research, serving as objects, tools, and resources for scientific exploration; effectively managing and enhancing their utilization is essential (1). Medical scientific data are derived from medical research, experiments, clinical services, and health management practices (Guan, 2022); open-access medical scientific data can accelerate the process of scientific research (Liu and Xu, 2014). The International Science Council (ISC) promotes the open sharing of research findings (ISC, 2019). Developing medical big data and promoting new applications are described in the Healthy China 2030 Plan released in 2016 (CPC Central Committee and Council State, 2016).

Open-access repositories (OARs) are vital for freely sharing intellectual knowledge on the Internet and ensuring it is accessible to the public (Sofi et al., 2024). They are considered a new avenue for scholarly communication, allowing researchers to share their work more rapidly with a larger audience (Tmava, 2023). The Directory of Open Access Repositories (OpenDOAR) is the primary directory for searching OARs and their contents (Ibrahim and Beigh, 2019). Since the inception of OpenDOAR in 2005, the number of open-access repositories on the platform has grown significantly, highlighting the influence of the global open-access movement (Sofi and Mir, 2023). The records are reviewed and updated periodically, providing an up-to-date snapshot of the global academic repository landscape (Pinfield et al., 2014). In addition to re3data, this landscape features a rich and diverse collection of data repositories from numerous countries around the world (Khan et al., 2024), aiming to promote the sharing, access, and better visibility of scientific data (Gurikar and Hadagali, 2020). Among the available sources, re3data.org is the most comprehensive registry for searching and identifying data repositories (Gohain, 2021). Medical OARs are essential for promoting data openness (Zuo et al., 2018), providing convenience for adequate access and utilization of medical research data (Ma et al., 2024). Data classification and description serve as the fundamental basis for data services related to medical OARs (Bond and Sheta, 2021). Data retrieval and utilization are integral for promoting OA (open-access) scientific data (Tao and Ye, 2023) and are essential for medical OARs. The standardization of the description of data in China’s national science and technology repositories needs improvement, particularly regarding the granularity and depth of data content description (Si and Wang, 2018). The quality of data retrieval functions and data utilization services in data repositories have attracted the attention of scholars (Wu, 2018). Enhancing the classification, description, retrieval, and utilization of China’s medical scientific data can facilitate better querying, access, and openness of these data. In terms of medical OARs, relevant studies primarily focus on metadata (Archana and Padmakumar, 2023), data organization (Si and Wang, 2018; Wu, 2018), data reuse (Lin et al., 2024), operational status (Bashir et al., 2019), and data sharing (Elvas et al., 2023). However, the use of OARs in medicine and health is a relatively unexplored area of research that requires further investigation (Loan and Sheikh, 2016), and the research on China’s medical OARs is still fragmented and not systematically described. Based on this, the current study aims to advance the research on medical OARs in China.

2 Construction of repositories and literature review

2.1 Construction of medical OARs

With the support of policy and regulatory frameworks, the development of medical OARs has matured, and there has been progress in managing medical scientific data. The United States and the United Kingdom are the leading countries in the OA movement (Moskovkin et al., 2021) and have prioritized scientific data and formulated relevant policies and laws to regulate the open sharing of medical scientific data (Li et al., 2009). These two countries are primarily involved in developing institutional open-access repositories that form significant components of OpenDOAR (Sofi et al., 2024). In 2011, the UK Research Council issued the “Common Principles on Data Policy,” and this was updated in 2018 (Research Councils UK, 2018). The US National Institutes of Health released the “Data Management and Sharing Policy” in 2003, with an updated version taking effect on 25 January 2023 (NIH, 2020). The UK’s Data Regulation Center utilizes standardized templates designed by DMPonline to create a general medical scientific data management plan, offering personalized health and medical scientific data management services (Donnelly et al., 2010). Since the initial release of the “Protein Sequence and Structure Atlas” bioinformatics database in 1965 (Dayhoff, 1969), databases such as GenBank, EMBL, and DDBJ have emerged (Sindhu and Sindhu, 2023), and repositories for population health data, such as NCBI, CT, and UpToDate, have been established (Fu et al., 2017). The number of scientific data repositories available on re3data.org (Archana and Padmakumar, 2023; Cho, 2019; Sizhu et al., 2018) and OpenDOAR (Maharana and Chakrabarti, 2019; Mir, 2022; Rinehart and Cunningham, 2017) are increasing in number, and the related research continues to expand in various contexts.

China is actively promoting policies and establishing repositories for open access to medical scientific data. In 2018, the “Measures for the Management of Research Data” were implemented (General Office of the State Council, 2018). In May 2019, the State Council issued the “Regulation of the People’s Republic of China on the Administration of Human Genetic Resources” to restrict the sharing of human genetic data; this regulation was revised in 2024 (State Council of the People’s Republic of China, 2024). In June 2019, the Ministry of Science and Technology added the National Center for Genome Sciences Data and other medical repositories to the “List of National Science and Technology Resource Sharing Service Platforms” (Ministry of Science and Technology and Ministry of Finance China, 2019). In April 2021, China implemented the “Biosafety Law,” which aims to establish a comprehensive framework for biosafety management (Standing Committee of the National People’s Congress, 2020). In 2004, construction of the National Center for Population Health Science Data began (Li and Sun, 2015), and in 2015, the BIG Data Center (BIGD) was launched (BIG Data Center (BIGD), 2015). In 2019, the National Microbial Science Data Center (NMDC) was established (NMDC, 2019). Other repositories registered on re3data.org include DEG, GWH, and MiCroKitS, while those registered on OpenDOAR.org include HF-IR, CIB OpenIR, and PSYCH OpenIR.

2.2 Review of the related literature

Numerous studies have focused on OARs, and efforts were made to highlight a few studies here, especially from the metadata perspective. Gu et al. (2019) investigated and analyzed the data collection, organization, and analysis processes related to the tranSMART knowledge management repositories. Based on MEDLINE journal selection criteria, Kirkham et al. (2020) identified data organization’s key features and policies in preprint repositories in the biomedical and medical fields. Several studies highlighted the importance of metadata quality and standards for enhancing research data interoperability, including the ways in which different data sources should use appropriate models for storing data and metadata (Kondylakis et al., 2022). The significance of standardized data formats for public resources in the biomedical sector has been highlighted (Swedlow et al., 2021). Removing ambiguity and eliminating duplicate data enhances repository metadata quality (Arencibia et al., 2022). Furthermore, the FAIR (findable, accessible, interoperable, and reusable) principles of scientific data have been implemented in mature natural science fields. A survey was conducted on the global implementation of FAIR principles in research data management (Reisen MVan Stokmans et al., 2019). To support the FAIR Data requirements of the research community, the Chemistry Implementation Network (ChIN) focuses on providing chemical-related data (Coles et al., 2020). Similarly, the Go FAIR Metric Group developed a general indicator framework based on the principle that “one indicator corresponds to only one principle.” This framework provides a template for designing indicators (Azevedo and Dumontier, 2020). The FAIR principles outline requirements, such as findability and interoperability, for data retrieval and metadata description.

Chinese scholars have investigated the current status and experiences of the construction of OARs in the United States, the United Kingdom, Australia, and Canada, e.g., investigating and summarizing the characteristics of government data portal websites in the US and UK regarding scientific data organization, browsing, and retrieval (Li and Xin, 2014). Dengdeng and Feng analyzed the current status of research data storage and sharing repositories in the US, UK, Australia, and Canada from five aspects: policy and legal support, funding sources, construction models, data collection, and data services (Dengdeng and Feng, 2016). Dandan and Jingyuan summarized the experiences of the Canadian Federal Science Data Repository regarding data submission, management, integration, and discovery and offered suggestions for creating a national scientific data repository in China (Dandan and Jingyuan, 2023).

Chinese scholars have also focused on investigating and analyzing the problems in China’s medical OARs, proposing suggestions for improvement, and investigating and analyzing issues related to data description and retrieval across six research data-sharing repositories (Si and Wang, 2018). Jiang et al. (2012) examined the deficiencies in describing and retrieving information from the National Population and Health Science Data Sharing Repositories. Mengxue (2020) compared and analyzed the data description and metadata aspects of multiple domestic and international medical scientific data repositories. Chunqiu et al. (2022) investigated the application of the FAIR principles in China’s medical OARs. Although existing research on China’s open repositories of research data has addressed issues related to data description and retrieval, there remains a lack of comprehensive investigation specifically focused on the classification and utilization of medical scientific data. These studies have provided valuable insights into the practices and lessons learned from constructing scientific data repositories in major developed countries, and their findings can help guide the development of similar repositories and initiatives in China.

Specifically, a survey was conducted on 12 samples of medical OARs in China through online registration and visits to official websites. Based on the survey results, an analysis of the current status and characteristics of data classification, description, retrieval, and utilization of the sample repositories was conducted. This study is expected to promote research by providing suggestions for constructing open medical scientific data repositories.

This study aimed to achieve two primary objectives. First, it sought to classify, describe, retrieve, and utilize data from China’s open scientific data repositories. Second, it explored potential recommendations that could enhance the development of these repositories. This paper aims to provide a current and comprehensive understanding of these repositories. We begin by describing a survey of 12 repositories, analyzing their benefits and conceptualizing their implications. In the concluding sections, we offer suggestions for future successful implementation of open-access data repositories, discuss the limitations of this study, and outline directions for further research.

3 Selection of medical OARs in China and survey design

3.1 Selection of repositories

3.1.1 Selection approach and principles

This study focused on medical repositories indexed in re3data.org and Open DOAR, as well as on publications from the Chinese literature extracted from CNKI (China National Knowledge Infrastructure). The proposed methodology for the analysis of twelve different OARs was divided into two parts, as shown in Figure 1. Including preliminary research and formal research, re3data.org is a global directory of research data repositories that provides accessible datasets for researchers, funding agencies, publishers, and academic institutions (Sizhu et al., 2018). China possesses 37 of the largest data repositories, accounting for 24% of all data repositories in Asian countries (Cho, 2019). The search criteria on re3data.org were as follows: “countries = China,” “subject = medicine,” and “database access = open.” Repositories are listed on OpenDOAR.org, which is an authoritative directory of open-access repositories. The search criteria on OpenDOAR.org were as follows: “Subject = Health and Medicine + Biology and Biochemistry,” and “Country = China + Hong Kong + Macao + Taiwan.” Relevant databases of the Chinese literature were searched using the term “title or abstract contains ‘medical scientific data repositories’,” and the results were de-duplicated. The necessary data regarding these repositories were collected manually and entered into a Microsoft Excel file for tabulation. Utilizing multiple sources, this comprehensive approach ensured that the sample encompassed a representative set of medical OARs operated by Chinese institutions or containing Chinese-related data. This provided a solid foundation for the subsequent in-depth investigation.

Figure 1
A two-part flowchart diagram illustrating a research methodology. The top section, titled

Figure 1. The survey design for this research.

3.1.2 Selection of sample repositories

After removing duplicate entries and consolidating the repositories from re3data.org and OpenDOAR.org with the relevant literature, we identified a list of twelve accessible open-access medical scientific data repositories for this study. The identified repositories were manually checked. Table 1 provides the respective URLs (uniform resource locator) for these repositories. The types and characteristics of these medical OARs were analyzed for a deeper understanding.

Table 1
www.frontiersin.org

Table 1. The research samples of medical OARs in China.

3.2 Research methods

The methodology used in this study involved extracting metadata elements from the selected repositories on re3data.org and OpenDOAR.org. Additionally, the authors examined the websites of registered medical OARs in China to further explore the research questions. The study’s methodology is illustrated in Figure 1.

To gain a more in-depth understanding of the status of medical OARs, the data were collected from re3data.org and OpenDOAR.org, along with the relevant URLs of medical scientific data repositories, from 1 May to 30 May 2023. The URLs of these repositories were accessed manually to fulfill the research objectives by investigating and analyzing four dimensions: data classification, data description, data retrieval, and data utilization. The extracted data were analyzed using Microsoft Excel. This formed the foundation of the research approach (see Figure 1). Based on the preliminary investigation, the final research content of the study is outlined in Tables 2, 3.

Table 2
www.frontiersin.org

Table 2. Survey content regarding data classification and description dimensions.

Table 3
www.frontiersin.org

Table 3. Survey content regarding data retrieval and utilization dimensions.

4 Analysis of survey results from sample repositories

The data were analyzed to determine the various characteristics of the retrieved repositories from OpenDOAR and re3data. This section of the article summarizes the findings of the collected data across four aspects: data classification, data description, data retrieval, and data utilization. These findings are displayed in tables.

4.1 Current status of repositories concerning the data classification dimension

The data classification dimension was investigated from two aspects: classification basis and classification hierarchy. Table 4 shows the survey results. Table 4 shows the survey results.

Table 4
www.frontiersin.org

Table 4. Survey results on data classification dimension.

4.1.1 Integrating characteristics of medical scientific data through multivariate classification

The NCMI comprises various classification bases, including species, data sharing methods, source units, source types, human organ distribution, keywords, data formats, and methods for generating research data. Additionally, eight repositories, including GWH, MTD, and NONCODE, are categorized according to species. MiCroKitS classifies based on the distribution of cell locations, MTD classifies based on the distribution of chromosome locations and human body organs, and NCMI classifies based on the distribution of human body organs. NCMI also provides keyword classification, presenting medical research data corresponding to the top 20 keywords. CNGBdb and NCMI categorize research data based on the method of scientific data generation, which includes observation records, digital processing, instrumental measurements, survey interviews, and simulation analysis. The sample repositories utilize various data classification methods based on the types and characteristics of their open data.

4.1.2 The coexistence of single-level and multi-level classification hierarchies

The research results indicate that among the twelve examined repositories, four OARs (GWH, GSA, MTD, and BIGD) have only one level of classification hierarchy. Out of the total number of OARs, eight have two or more levels of classification. Specifically, MiCroKitS has four classification levels, while CNGBdb and NCMI have three. NGDC, NONCODE, CSDB, DEG, and CPLM have two classification levels. An increased number of detailed classification levels help in identifying and selecting specific data.

4.2 Data description dimension

The data classification dimensions are investigated from three aspects: data type, identifier and metadata description. The survey results are shown in Table 5.

Table 5
www.frontiersin.org

Table 5. Survey results on data description dimension.

4.2.1 Variety of data types and predominantly structured data

The twelve sample repositories include various types of medical scientific data, such as research and statistical formats, raw data, archived data, software applications, structured text, structured graphics, databases, plain text, standard office documents, configuration data, and other data types. Structured data are the primary data types in all twelve repositories except for NGDC, CSDB, and BIGD. The remaining nine sample repositories mainly contain research data, statistical data, raw data, and archived data. NONCODE, NCMI, MiCroKitS, and CPLM include semi-structured software applications, and NCMI includes unstructured graphic and audiovisual data. The medical scientific data repositories can store and recall them through semi-structured and unstructured data.

4.2.2 The adoption of three types of data identifiers

This research found that the sample repositories mainly adopt three types of data identifiers. Firstly, permanent identifiers used for distinguishing unique objects are internationally recognized, such as the Digital Object Identifier (DOI) accepted by CNGBdb, CSDB, and BIGD. Secondly, the unique permanent identifiers commonly used in China, such as NCMI, CSDB, and BIGD, adopt the China Science and Technology Resource (CSTR). Third, internal identifiers for repositories, such as NGDC, GWH, GSA, MTD, and eight other repositories, utilize customized identifiers that are less persistent and compatible compared to the previous two types of identifiers.

4.2.3 Metadata descriptions need to be standardized

This research found that the sample repositories provide metadata through web forms, text on web pages, and XLSX files. Out of the twelve repositories, eight provide metadata in the web form, two present them as web text, and two share them in the XLSX format. The twelve repositories include various metadata elements, with the number of elements ranging from ten to twenty. The NGDC, GWH, and GSA repositories in the sample adhere to the group standard titled “Metadata Standard for Human Genetic Sequencing Raw Data Repertoire” (T/CHIA20-2021), which provides a detailed description of the attributes associated with gene sequencing raw data. The standard was developed based on the data standards set by the International Nucleic Acid Sequence Sharing Consortium and guidelines for database construction. It serves as a valuable reference for the metadata description of data related to health, disease, and other biological projects.

4.3 Data retrieval dimension

The data retrieval dimensions are investigated from three aspects: search method, popular searches, and search results display. The survey results are shown in Table 6.

Table 6
www.frontiersin.org

Table 6. Survey results on data retrieval dimensions.

4.3.1 Support for multiple search methods

All twelve sample repositories provide more comprehensive query components, and each sample repository supports a simple search, except for MTD. All twelve repositories feature an on-site search engine and drop-down menus that allow users to refine their search conditions and terms. Additionally, MTD, BIGD, DEG, MiCroKitS, and CPLM include checkboxes to further broaden or narrow the scope of searches based on specific needs. The GSA, MTD, CNGBdb, NCMI, CSDB, BIGD, MiCroKitS, and CPLM sample repositories all support advanced search functions, although their methods vary. CNGBdb and NCMI offer dedicated advanced search portals along with Boolean, faceted, and visualized search options. However, none of these repositories support searches that are restricted to specific fields. In addition, all twelve repositories offer data retrieval capabilities and can be searched to obtain the required information based on user needs. CNGBdb, NCMI, CSDB, and BIGD facilitate factual retrieval and allow for natural language queries, making them user-friendly. Additionally, eight of these repositories feature BLAST (basic local alignment search tool) retrieval, which connects medical scientific data with specialized bioinformatics tools. The sample repositories can be further developed to enhance the advanced search function, improving the accessibility of medical scientific data.

4.3.2 Providing popular search functions

The twelve sample repositories showcase popular searches in various formats. Among them, seven repositories offer a popular search function. The repositories NGDC, GWH, GSA, and CNGBdb provide popular search terms, while NCMI, CSDB, and BIGD present popular search rankings. Additionally, NCMI includes a dynamic word cloud that highlights popular search terms. These top search terms allow users to access the latest and most relevant repository data quickly. They present innovative ideas for data retrieval, while the word cloud visually illustrates the popularity of various repository topics. Trending searches based on high-frequency search terms and research hotspots can help identify cutting-edge trends in the field.

4.3.3 Search results display supports multiple sorting

The twelve sample repositories vary significantly in how they present search results and how they provide support for secondary searches. The repositories can be classified into two display types: scrolling full display and page-limited display. MTD, NONCODE, CNGBdb, MiCroKitS, and CPLM only support a scrolling full display of results. In contrast, seven repositories, including NGDC, support a page-limited display. However, NCMI, BIGD, and DEG do not allow users to select the number of search results displayed on each page; instead, they present a fixed number of results. On the other hand, NGDC, GWH, GSA, and CSDB enable users to choose the number of entries shown on each page, with options for 10, 25, or 50 entries. MTD, NONCODE, and CNGBdb currently do not support sorting of the result set. In contrast, NGDC, GWH, and GSA allow sorting by number, while NGDC, GWH, and GSA also support sorting by description. NCMI, CSDB, and BIGD provide the option to sort by access, and both NCMI and BIGD allow sorting by release time. BIGD also supports sorting by downloads and relevance, while DEG supports sorting by access control functions, gene names, parameter information, and more.

4.4 Data utilization dimension

The data utilization dimension is investigated from two aspects: platform statistics and utilization of search results. The search results are shown in Table 7.

Table 7
www.frontiersin.org

Table 7. Survey results on data utilization dimension.

4.4.1 Providing basic statistical functions

Except for MTD, all sample repositories provide statistical data, although the statistics and their presentation are quite basic. Seven of these repositories focus on the number of statistical datasets, while ten are dedicated to the total count of repositories. The NGDC and CPLM repositories track the core data, whereas CNGBdb and DEG monitor the statistical data source units. Additionally, GWH, GSA, NONCODE, and CSDB visually represent the statistical repository data using charts and graphs. Only GWH, NONCODE, and BIGD support the visual presentation of medical scientific data.

4.4.2 Enables users to browse and download search results

All the repositories, except for BIGD, do not support online previewing of files and only present data details for viewing after they have been downloaded. Out of the twelve sample repositories, nine allow data downloads. The NGDC, CSDB, and CPLM do not support data downloads. GWH offers partial data downloads, while CSDB requires an application to access data downloads. The data download formats vary by repository: GWH, MTD, NONCODE, and CNGBdb support downloading GZ files; GSA, NCMI, and BIGD support downloading XLSX files; MTD also supports TXT files; and DEG supports ZIP and DAT files.

4.4.3 Some repositories emphasize user interaction

Data interactions primarily involve actions such as liking, favoriting, and sharing. The CSDB platform allows users to share content through microblogging, while NCMI enables sharing via WeChat, QQ, and microblogging. Both NCMI and BIGD allow users to favorite data and view their favorites. Additionally, NCMI, CSDB, and BIGD support data liking. NCMI also offers tools for analyzing data utilization cases.

5 Discussion

OARs provide long-term sustainable storage, preservation, and open access to resources and serve as tangible indicators of an institution’s productivity, thereby increasing the institution’s visibility, prestige, and value (Tmava, 2023). In Asian nations, the open-access movement is expanding (Sofi and Mir, 2023). Data management is a complex issue; many efforts are being made to address privacy, ethical, and intellectual property rights through initiatives such as creative commons and similar activities (Gurikar and Hadagali, 2020). By analyzing the survey results, the analysis framework devised in this research suggests some solutions for selecting OARs. Based on the research results and analysis, China should consider the following recommendations to enhance the openness of medical scientific data and the development of repositories.

5.1 Utilization of multi-dimensional data classification

The structure of the twelve medical OARs varies in terms of their specific characteristics and the classification levels of the associated research data. This indicates that medical OARs in China can implement a multi-dimensional data classification system that aligns with the characteristics of the repositories’ data. A detailed classification of medical research data by type and characteristics can enable quicker and more accurate retrieval.

5.2 Selection of persistent data identifiers

Out of the twelve sample repositories, eight utilize identifiers defined by the repositories. However, these identifiers lack persistence, stability, and compatibility. Persistent data identifiers are crucial because they enable stable aggregation and effective utilization of datasets. China’s medical OARs can refer to the national mandatory standards for health information and select persistent and stable data identifiers, such as CSTR and DOI, which are widely used both domestically and internationally.

5.3 Implementing standardized descriptions for metadata

Currently, the sample repositories have adopted various descriptive specifications for their metadata elements. Only NGDC, GWH, and GSA have selected metadata elements that comply with relevant group standards. The sample repositories can develop appropriate metadata specifications based on published standards in China’s medical field. They can reference both national and international metadata description standards to enhance the integration and utilization of varied medical research data from multiple sources.

5.4 Improving the advanced search features and refining the filtering of search results

The sample repositories can enhance the advanced search portal by incorporating various search forms, such as faceted search and visualization search, to improve the user experience when searching for medical scientific data. Among the twelve sample repositories, some lack secondary search functionality, and three do not allow search results to be sorted. Additionally, the logic behind how search results are displayed is unclear, making it difficult to assess and choose relevant results quickly. Therefore, it is advisable for sample repositories to offer a secondary search function and enhance the sorting of result sets, allowing users to rapidly locate desired data based on specific criteria.

5.5 Optimizing repository data to improve preview and interaction

The repositories should emphasize their core data, total data volume, data sources, and diverse datasets in visual formats such as charts. Out of the twelve sample repositories, only BIGD provides a preview of data files. NGDC, CSDB, and CPLM do not permit data downloads. It is suggested that these repositories enhance their data preview functionality, provide data download services, and improve interactivity by enabling users to like, collect, and share data on external websites. This will help promote the circulation and utilization of medical scientific data.

This study presents an in-depth analysis of medical scientific data repositories in China, emphasizing their developmental trajectory and areas requiring enhancement. The findings illuminate the interplay between progress and persistent challenges, particularly regarding data classification, description, retrieval, and utilization. The Analysis of Survey Results from the Sample Repositories section reveals noteworthy alignments and critical discrepancies in repository implementation, underlining achievements and opportunities for improvement. The results of the study indicate that current medical scientific data resources primarily focus on genomic phenotypes, while brain imaging datasets, such as the Chinese Color Nest Data Community of Science Data Bank and the National Basic Science Data Center (CCNDC) (Chinese Color Nest Data Community, 2025), have not been fully integrated. Therefore, the study recommends broadening the types of data to include additional categories of medical scientific data, like brain imaging data (Gong and Zuo, 2025), to support more comprehensive research practices.

This research illustrates how China’s repositories are underpinned by robust policy frameworks, such as the “Measures for the Management of Scientific Data” and the “Biosafety Law” (General Office of the State Council, 2018; State Council of the People’s Republic of China, 2024). These initiatives have catalyzed the development of repositories with a focus on open access. However, disparities in metadata standardization, advanced retrieval functions, and user interaction mechanisms persist, as the survey findings demonstrate. While the repositories align with global trends in repository development (Sofi et al., 2024; Gohain, 2021), unique challenges specific to China’s context, such as uneven adherence to metadata standards and limited functionality for advanced searches, hinder their full potential.

From a data classification and description perspective, the literature highlights the importance of adopting multi-dimensional classification and adhering to metadata standards such as the FAIR principles (Rinehart and Cunningham, 2017; Swedlow et al., 2021). Survey results validate these findings, revealing that repositories such as MiCroKitS and CNGBdb utilize advanced multi-level classification systems. However, inconsistencies in metadata adoption still need to be addressed, with only a subset of repositories integrating international standards such as DOI and CSTR (Azevedo and Dumontier, 2020). In data retrieval, the literature emphasizes the necessity of advanced features, including Boolean and faceted searches (Tmava, 2023; Gohain, 2021). Repositories such as CNGBdb and NCMI demonstrate partial implementation of these capabilities, yet gaps in field-specific searches and sorting functionalities suggest room for improvement to align with global benchmarks (Moskovkin et al., 2021). Furthermore, data utilization must be developed, and an increased number of interactive features and visualization tools must be implemented. Although repositories such as BIGD and NCMI exhibit partial adoption of user-friendly interfaces, these features must be more consistent and widespread (Gu et al., 2019).

The growth of repositories in China, as detailed in the Construction of Repositories, reflects the influence of policy-driven initiatives. National frameworks have expedited the establishment of repositories, but their impact varies across institutions, resulting in uneven adoption of advanced functionalities and standardized practices (Ministry of Science and Technology and Ministry of Finance China, 2019). This highlights a critical need for harmonization across repositories to realize the full potential of open-access infrastructure.

The synthesis of findings reveals a dual narrative: China’s repositories are advancing with global trends, yet they must catch up in key areas such as metadata standardization and interactive functionalities. Aligning these repositories with international standards (Chunqiu et al., 2022) and focusing on user-centric improvements will significantly enhance their functionality and accessibility. By leveraging insights from global repositories and addressing the identified gaps, China’s medical data repositories can better facilitate research, improve user engagement, and contribute meaningfully to global scientific collaboration.

In conclusion, this research underscores the imperative to build upon the existing frameworks while addressing the deficiencies that hinder the full realization of the potential of China’s medical data repositories. With a commitment to adopting global best practices and enhancing user-focused functionalities, these repositories can evolve into essential nodes in the international open-access ecosystem, supporting broader scientific innovation and discovery.

6 Conclusion

The data from the re3data and OpenDOAR services presented herein provide an important perspective on developing medical OARs in China. The open-access trend in medical scientific data that is described in this article can be explained by data classification, description, retrieval, and utilization. Among these, data description and data retrieval play an important role. In this regard, our study also suggests the construction of medical OARs in China to enhance the management of China’s medical scientific data. Nonetheless, the study results have certain constraints due to limitations in the number of sample repositories, the research duration, and the study’s content. Furthermore, only medical OARs are addressed in this study, and other types of repositories may require different ways of organizing information; an exploration of this matter is not within the scope of this article. Our study also suggests that data-sharing platforms in China need to be more user-friendly. The management of repositories and content can be enhanced for both current and future needs. In the future, China’s medical OARs should enhance data classification, data identifiers, metadata descriptions, and repository functions to promote open sharing and interconnection of research data. Furthermore, topics such as remittance management, the metadata description specifications of medical scientific data, security risk management, and the privacy protection of medical OARs. The researchers’ willingness to share medical science data will also take into consideration. Exploring ways to enhance researchers’ enthusiasm for data sharing is a key factor influencing the level of data openness and the effectiveness of sharing. Establishing a reasonable incentive mechanism for data sharing and clarifying the rights and regulations surrounding data use are essential for researchers to address the concerns and obstacles they encounter during the data-sharing process.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors.

Author contributions

JS: Conceptualization, Funding acquisition, Investigation, Writing – original draft. CL: Conceptualization, Investigation, Writing – original draft. WC: Writing – original draft.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This research was funded by the Research Foundation of Shenzhen Polytechnic University (Grant number 6022312032S); Shenzhen Philosophy and Social Sciences Planning Project, Research on Shenzhen’s implementation of the national cultural digitalization strategy (Grant number SZ2023B017); Guangdong Provincial Philosophy and Social Science Planning 2024 Greater Bay Area Research Project (Grant number GD24DWQGL05).

Acknowledgments

We appreciate the contributions of Shuyu Pan at the Guangxi Hospital Division of The First Affiliated Hospital, Sun Yat-sen University, and Yuan Liu at Capital Medical University Yanjing Medical College during the data survey and data collection stage of this study.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that no Gen AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Archana, S. N., and Padmakumar, P. K. (2023). The status quo of Indian data repositories indexed in re3data registry. Digit. Libr. Perspect. 39, 496–516. doi: 10.1108/DLP-02-2023-0017

Crossref Full Text | Google Scholar

Arencibia, E., Martinez, R., Marti-Lahera, Y., and Goovaerts, M. (2022). “On metadata quality in Sceiba, a platform for quality control and monitoring of Cuban scientific publications” in Communications in Computer and Information Science. eds. S. K. Bhatia and S. Tiwari (Berlin: Springer).

Google Scholar

Azevedo, R. M., and Dumontier, M. (2020). Considerations for the conduction and interpretation of fairness evaluations. Data Intell. 2, 285–292. doi: 10.1162/dint_a_00051

Crossref Full Text | Google Scholar

Bashir, A., Ahmad Mir, A., and Ahmad Sofi, D. (2019). Global landscape of open access repositories. Libr. Philos. Pract. 2445, 1–21.

Google Scholar

BIG Data Center (BIGD). (2015). The mission of National Genomics Data Center. Available online at: https://ngdc.cncb.ac.cn/mission?lang=en (Accessed May 17, 2023).

Google Scholar

Bond, K., and Sheta, A. (2021). Medical data classification using machine learning techniques. Int. J. Comput. Appl. 183, 1–8. doi: 10.5120/ijca2021921339

PubMed Abstract | Crossref Full Text | Google Scholar

Chinese Color Nest Data Community. (2025). Available online at: https://ccndc.scidb.cn (Accessed June 7, 2025).

Google Scholar

Cho, J. (2019). Study of Asian RDR based on re3data. Electron. Libr. 37, 302–313. doi: 10.1108/EL-01-2019-0016

Crossref Full Text | Google Scholar

Chunqiu, L., Boya, D., Qian, G., Ningyuan, S., Zixuan, L., and Shunmei, Y. (2022). Application assessment and survey analysis of FAIR principle in medical scientific data open platforms. Libr. Inf. Serv. 2, 72–82. doi: 10.21608/buhuth.2022.271139

Crossref Full Text | Google Scholar

Coles, S. J., Frey, J. G., Willighagen, E. L., and Chalk, S. J. (2020). Taking fair on the chin: the chemistry implementation network. Data Intell. 2, 131–138. doi: 10.1162/dint_a_00035

Crossref Full Text | Google Scholar

CPC Central Committee and Council State (2016). Healthy China 2030 plan. Available online at: https://www.gov.cn/zhengce/2016-10/25/content_5124174.htm (Accessed May 16, 2023).

Google Scholar

Dandan, W., and Jingyuan, R. (2023). Function analysis of Candian federated research data repository and its inspiration. Libr. Dev. 319, 170–177. doi: 10.19764/j.cnki.tsgjs.20211859

Crossref Full Text | Google Scholar

Dayhoff, M. O. (1969). Atlas of protein sequence and structure. Silver Spring, MD: National Biomedical Research Foundation.

Google Scholar

Dengdeng, W., and Feng, G. (2016). Study on the status and enlightenment of scientific data storage and sharing platform in England. Libr. Dev. 3, 29–34. doi: 10.3969/j.issn.1004-325X.2016.03.006

Crossref Full Text | Google Scholar

Donnelly, M., Jones, S., and Pattenden-Fail, J. W. (2010). DMP online: The digital curation Centre’s web-based tool for creating, maintaining and exporting data management plans. Int. J. Digit. Curation 5, 187–193. doi: 10.1007/978-3-642-15464-5_74

Crossref Full Text | Google Scholar

Elvas, B., Ferreira, C., and Dias, M. S. (2023). Health data sharing towards knowledge creation. Systems 11:435. doi: 10.3390/systems11080435

Crossref Full Text | Google Scholar

Fu, L., Li, J., Wang, R., Shang, X., and Yin, L. (2017). Constructing characteristics about scientific data sharing platform in the field of population health in foreign countries and its inspiration to us. China Sci. Technol. Resour. Rev. 49, 89–94. doi: 10.3772/j.issn.1674-1544.2017.05.012

Crossref Full Text | Google Scholar

General Office of the State Council (2018). Measures for the management of scientific data. Beijing: General Office of the State Council.

Google Scholar

Gohain, R. R. (2021). Status of global research data repository: an exploratory study. Libr. Philos. Pract. 5193, 1–12. Available at: https://digitalcommons.unl.edu/libphilprac/5193

Google Scholar

Gong, Z. Q., and Zuo, X. N. (2025). Dark brain energy: toward an integrative model of spontaneous slow oscillations. Phys Life Rev 52, 278–297. doi: 10.1016/j.plrev.2025.02.001

PubMed Abstract | Crossref Full Text | Google Scholar

Gu, W., Yildirimman, R., Van der Stuyft, E., Verbeeck, D., Herzinger, S., Satagopam, V., et al. (2019). Data and knowledge management in translational research: implementation of the eTRIKS platform for the IMI oncotrack consortium. BMC Bioinfor. 20:164. doi: 10.1186/s12859-019-2748-y

PubMed Abstract | Crossref Full Text | Google Scholar

Guan, J. (2022). Ethical requirements and management standards for the sharing and re-use of scientific data in health care and medicine (I) preface. Chin. Med. Ethics. 33, 144–147. doi: 10.12026/j.issn.1001-8565.2020.02.03

Crossref Full Text | Google Scholar

Gurikar, R., and Hadagali, G. S. (2020). The re3data.Org: reservoir of open access to research data. Int. J. Inf. Stud. Libr. 5, 16–26.

Google Scholar

Ibrahim, S., and Beigh, I. N. (2019). Contribution of the UK open access repositories to OpenDOAR. Libr. Philos. Pract. 21:2592.

Google Scholar

ISC (2019). Science as a global public good: ISC action plan, 2019–2021. Paris: ISC.

Google Scholar

Jiang, J., Zhao, H., and Liu, R. (2012). Discussion on the information organization of the scientific data sharing data sharing platform-take the National Scientific Data Sharing Platform for population and health as an example. J. Inf. Resour. Manag. 2, 52–56. doi: 10.13365/j.jirm.2012.04.009

Crossref Full Text | Google Scholar

Khan, A. M., Loan, F. A., Parray, U. Y., and Rashid, S. (2024). Global overview of research data repositories: an analysis of re3data registry. Inf. Discov. Deliv. 52, 53–61. doi: 10.1108/IDD-07-2022-0069

Crossref Full Text | Google Scholar

Kirkham, J. J., Penfold, N. C., Murphy, F., Boutron, I., Ioannidis, J. P., Polka, J., et al. (2020). Systematic examination of preprint platforms for use in the medical and biomedical sciences setting. BMJ Open 10:e041849. doi: 10.1136/bmjopen-2020-041849

PubMed Abstract | Crossref Full Text | Google Scholar

Kondylakis, H., Ciarrocchi, E., Cerda-Alberich, L., Chouvarda, I., Fromont, L. A., Garcia-Aznar, J. M., et al. (2022). Position of the AI for health imaging (AI4HI) network on metadata models for imaging biobanks. Eur. Radiol. Exp. 6:29. doi: 10.1186/s41747-022-00281-1

PubMed Abstract | Crossref Full Text | Google Scholar

Li, J., Liu, D. H., and Jiang, H. (2009). Research on international scientific data sharing. Libr. Dev. 2, 19–22.

Google Scholar

Li, Z., and Sun, H. (2015). Analysis of the resources construction mode of the National Scientific Data Sharing Platform for population and health. J. Med. Inform. 36, 72–76. doi: 10.3969/j.issn.1673-6036.2015.10.016

Crossref Full Text | Google Scholar

Li, S., and Xin, L. (2014). A study on scientific data organization and retrieval of government open data portals in UK and USA. Libr. Trib. 17, 110–114.

Google Scholar

Lin, J., Jiang, Y., and Chen, Y. (2024). Research on the generation mechanism and action mechanism of scientific data reuse behavior. J. Acad. Librariansh. 50:102921. doi: 10.1016/j.acalib.2024.102921

Crossref Full Text | Google Scholar

Liu, C., and Xu, Y. (2014). New roles of the professional library in the environment of open science and open data. Libr. Dev. 83–88. doi: 10.3969/j.issn.1004-325X.2014.02.023

Crossref Full Text | Google Scholar

Loan, F. A., and Sheikh, S. (2016). Analytical study of open access health and medical repositories. Electron. Libr. 34, 419–434. doi: 10.1108/EL-01-2015-0012

Crossref Full Text | Google Scholar

Ma, X., Jiao, H., Zhao, Y., Huang, S., and Yang, B. (2024). Does open data have the potential to improve the response of science to public health emergencies? J. Informetr. 18:101505. doi: 10.1016/j.joi.2024.101505

Crossref Full Text | Google Scholar

Maharana, B., and Chakrabarti, i. A. (2019). LIS open access institutional digital repositories in OpenDOAR: an apprasal. Libr. Philos. Pract. 1:2757.

Google Scholar

Mengxue, Y. (2020). Comparative analysis of health medical scientific data management platforms from domestic and abroad. Digit. Libr. Forum 22, 11–19. doi: 10.3772/j.issn.1673-2286.2020.01.002

Crossref Full Text | Google Scholar

Ministry of Science and Technology and Ministry of Finance China (2019). Notice of optimizing and adjusting the list of National Science and Technology Resource Sharing Service Platforms. Beijing: Ministry of Science and Technology and Ministry of Finance China.

Google Scholar

Mir, A. A. (2022). Growth and development of open access institutional repositories in Africa. Int. J. Inf. Sci. Manag. 20, 41–53.

Google Scholar

Moskovkin, V. M., Saprykina, T. V., Sadovski, M. V., and Serkina, O. V. (2021). International movement of open access to scientific knowledge: a quantitative analysis of country involvement. J. Acad. Librariansh. 47:102296. doi: 10.1016/j.acalib.2020.102296

Crossref Full Text | Google Scholar

NIH (2020). NIH policy for data management and sharing. Bethesda, MD: National Institutes of Health (OD).

Google Scholar

NMDC. (2019). Introduction of National Microbiology Data Center, N.M.D.C. Available online at: https://nmdc.cn/introduction (Accessed May 18, 2023).

Google Scholar

Pinfield, S., Salter, J., Bath, P. A., Hubbard, B., Millington, P., Anders, J. H. S., et al. (2014). Open-access repositories worldwide, 2005–2012: past growth, current characteristics, and future possibilities. J. Assoc. Inf. Sci. Technol. 65, 2404–2421. doi: 10.1002/asi.23131

Crossref Full Text | Google Scholar

Reisen MVan Stokmans, M., Basajja, M., Ong’ayo, A., Kirkpatrick, C., and Mons, B. (2019). Towards the tipping point for FAIR implementation. Data Intell. Spec. Issue FAIR Best Pract. 2, 264–275. doi: 10.1162/dint_a_00049

Crossref Full Text | Google Scholar

Research Councils UK (2018). Research councils (RCUK) common principles on data policy. Swindon: Research Councils (RCUK).

Google Scholar

Rinehart, A., and Cunningham, J. (2017). Breaking it down: a brief exploration of institutional repository submission agreements. J. Acad. Librariansh. 43, 39–48. doi: 10.1016/j.acalib.2016.10.002

Crossref Full Text | Google Scholar

Si, L., and Wang, Y. (2018). Current situation and suggestions of data organization on domestic scientific data sharing platforms-a study based on National Science and technology infrastructure platform. Libr. Dev. 10, 52–58.

Google Scholar

Sindhu, D., and Sindhu, S. (2023). Biological databases and resources: their engineering and applications in synthetic biology. Int. J. Adv. Sci. Eng. 9, 3085–3098. doi: 10.29294/ijase.9.4.2023.3085-3098

Crossref Full Text | Google Scholar

Sizhu, W., Zanmei, L., Jiawei, C., and Qing, Q. (2018). Development of medical data repositories based on Re3data.Org. Chin. J. Med. Libr. Inf. Sci. 27, 20–31. doi: 10.3969/j.issn.1671-3982.2018.09.005

Crossref Full Text | Google Scholar

Sofi, I. A., Bhat, A., and Gulzar, R. (2024). Global status of dataset repositories at a glance: study based on OpenDOAR. Digit. Libr. Perspect. 40, 330–347. doi: 10.1108/DLP-11-2023-0094

Crossref Full Text | Google Scholar

Sofi, I. A., and Mir, A. A. (2023). Status of patent archives in Asian continent: a vivid picture from OpenDOAR. Glob. Knowl. Mem. Commun. doi: 10.1108/GKMC-07-2023-0241 (Epub ahead of print).

Crossref Full Text | Google Scholar

Standing Committee of the National People’s Congress (2020). Biosecurity law of the People’s Republic of China. Beijing: Standing Committee of the National People’s Congress.

Google Scholar

State Council of the People’s Republic of China (2024). Regulation of the People’s Republic of China on the administration of human genetic resources. Beijing: State Council of the People’s Republic of China.

Google Scholar

Swedlow, J. R., Kankaanpää, P., Sarkans, U., Goscinski, W., Galloway, G., Malacrida, L., et al. (2021). A global view of standards for open image data formats and repositories. Nat. Methods 18, 1440–1446. doi: 10.1038s41592-021-01113-7

Google Scholar

Tao, R., and Ye, J. (2023). Investigation and analysis of data resource development in university libraries under the open science environment. Libr. Tribut. 11, 75–83. doi: 10.3969/j.issn.1002-1167.2023.07.010

Crossref Full Text | Google Scholar

Tmava, A. M. (2023). Faculty perceptions of open access repositories: a qualitative analysis. New Rev. Acad. Librariansh. 29, 123–151. doi: 10.1080/13614533.2022.2082991

Crossref Full Text | Google Scholar

Wu, S. (2018). Research on data organization and utilization of open government data platform in Asian countries. Library 11, 80–84. doi: 10.3969/j.issn.1002-1558.2018.12.013

Crossref Full Text | Google Scholar

Zuo, S., Zhu, J., and Liang, Y. (2018). Role-transformation of university librarian driven by open research data. Libr. Dev. 12, 23–27.

Google Scholar

Keywords: open access, scientific data, data management, medical open-access repositories, OpenDOAR, re3data

Citation: Song J, Li C and Chansanam W (2025) Construction of medical scientific data repositories in China: analysis of survey and recommendations. Front. Artif. Intell. 8:1544200. doi: 10.3389/frai.2025.1544200

Received: 10 April 2025; Accepted: 10 June 2025;
Published: 26 June 2025.

Edited by:

L. J. Muhammad, Bayero University Kano, Nigeria

Reviewed by:

Xiu-Xia Xing, Beijing University of Technology, China
Siquan Wang, Columbia University, United States

Copyright © 2025 Song, Li and Chansanam. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chunqiu Li, bGljaHVucWl1QGJudS5lZHUuY24=; Wirapong Chansanam, d2lyYWNoQGtrdS5hYy50aA==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.