A Map of the Initiatives That Harmonize Patient Cohorts Across the World

1 Thematic Area for Frailty and Healthy Ageing of the Network of Biomedical Research Centers (CIBERFES), Instituto de Salud Carlos III, Madrid, Spain, Centro de Investigación Biomédica en Red de Salud Mental, Instituto de Salud Carlos III, Madrid, Spain, 3 European Clinical Research Infrastructure Network (ECRIN-ERIC), Paris, France, 4 Parc Sanitari Sant Joan de Déu, Barcelona, Spain, 5 Biomedical Research Foundation, Hospital Universitario de Getafe, Madrid, Spain, Geriatric Department, Hospital Universitario de Getafe, Madrid, Spain, Department of Psychiatry, Universidad Autónoma de Madrid, Madrid, Spain, 8 Instituto de Investigación Sanitaria Princesa, Hospital Universitario de La Princesa, Madrid, Spain


INTRODUCTION
Integration of cohort studies allows taking advantage of already collected information to increase the sample size to study uncommon exposures, rare diseases, less strong associations, or very restricted populations (personalized medicine). It also allows to carry out standardized analyses and avoid publication bias compared to the analysis of published data (1)(2)(3)(4)(5). Nevertheless, the growing energy spent in conducting cohort studies across the world in the last decades has not been paralleled by an effort to make them accessible to the scientific community and harmonize their data. This last limitation moved the European Commission to fund the SYNergies for Cohorts in Health: integrating the ROle of all Stakeholders (SYNCHROS) coordination and support action, endowed with almost e2 million 1 from 2019 to 2021. It aims "to establish a sustainable European strategy for the development of the next generation of integrated cohorts, thereby contributing to an international strategic agenda for enhanced coordination of cohorts globally, in order to address the practical, ethical, legal, and methodological challenge of optimizing the exploitation of current and future cohort data, toward the development of stratified and personalized medicine as well as facilitating health policy." In order to achieve its objectives, the first activity proposed in SYNCHROS was to map the population, patient, and clinical trial cohort integration landscape. That would allow the project to have a first look at the challenges and tried solutions adopted by different groups, and, more importantly, it would provide a list of principal investigators of these initiatives who could be contacted for the process of developing the common strategy. This study reports the result of the mapping of the initiatives that integrate patient cohorts. The mapping of population cohorts will be reported elsewhere. The aim of the study was to obtain a non-exhaustive, but representative, list of these initiatives carried out in recent times in the world. To our knowledge, there is no other repository of integration initiatives of patient cohorts. Although excellent single cohort repositories exist, like the Maelstrom catalog, repositories of initiatives that integrate several patient cohorts could not be found.
This mapping will provide researchers with a useful tool to find initiatives on their areas of interest with whom they can share or analyze harmonized data.

METHODS
The initiatives included in the mapping were obtained from three different sources: 1. Systematic searches, carried out in MEDLINE and the Maelstrom catalog. 2 2. Suggestions of potential initiatives to be included in the mapping provided by partners of the SYNCHROS consortium. 3. References and links provided by the initiatives detected in the two previous sources.
The inclusion criteria were as follows: a) initiatives that integrated patient, clinical, or disease cohorts; b) individual patient meta-analysis and mega-analyses; and c) at least one cohort included in the initiative having information about the sample at two or more points of time (at least two waves).
The exclusion criteria were as follows: a) initiatives that only integrate population cohorts or clinical trials, without including patient cohorts; b) initiatives published before the year 2000; and c) initiatives that did not provide information in English.

MEDLINE Search
The process started with searches restricted to papers published in English from 2000 to 2019 using the terms selected by consensus among the SYNCHROS partners. Those terms which obtained fewer than 500 hits were retained, and the abstracts of the hits were reviewed to find new terms that were used in subsequent searches. In some cases, the term "cohort" was added to these searches to limit the number of hits.
The final search strategy used is given as follows: (cohort OR "prospective study" OR "longitudinal study" OR "individual meta-analysis"[All Fields] OR "individual participant data meta-analysis"[All Fields] OR "individual patient data metaanalysis"[All Fields] OR "individual meta analysis"[All Fields] OR "individual participant data meta analysis"[All Fields] OR "individual patient data meta analysis"[All Fields] OR "meta analysis using individual"[All Fields] OR "meta-analysis using individual"[All Fields] OR "meta analysis of individual"[All Fields] OR "meta-analysis of individual"[All Fields] OR "megaanalysis"[All Fields] OR "mega analysis"[All Fields]) AND ("harmonization study" OR "integration study" OR "integration initiative" OR "integrated study" OR "merged cohort" OR "data pooling" OR "pooled sample" OR "combined data" OR "combining data" OR "harmonized data" OR "harmonised data" OR "harmonizing data" OR "data harmonization" OR "data harmonisation" OR "data sharing" OR "common database" OR "multiple cohorts" OR "multiple longitudinal studies" OR "international consortium" OR "collaborative effort"). It also provides information about harmonized data generated by these research networks." We looked for initiatives included in the "Networks" section of the catalog.

Selection of Initiatives
Initiatives that were obtained from the systematic searches and provided by the partners were evaluated against the inclusion and exclusion criteria by two different investigators. In case of a disagreement, a third reviewer was consulted.

Extraction of Information
The following information was extracted from each initiative: name of the initiative, principal investigator, partners, name of the institution responsible for the initiative, funding resources, contact person, information source, whether the research team is currently active, main objectives, criteria for the cohorts to be included in the initiative, type of harmonization (prospective/retrospective), number of cohorts included in the initiative (the total number and the number of harmonized cohorts), whether more cohorts are foreseen to be harmonized, number of participants (the total number and the number of participants with harmonized data), age range of the sample, threats to representativeness of the sample, maximum number of variables that have been harmonized, including those where harmonization was not possible for all the cohorts, setting of the harmonized cohorts (local-regional/national/international, including country of origin of the cohorts), and a brief description of the population considered by the initiative.
All this information was retrieved from the webpage and/or the scientific article that presented the initiative. Missing information was requested from the principal investigators of the projects, who were contacted initially by email and, if there was no answer, by phone call or by post.

ANALYSIS
Results of the identification process of the initiatives are presented in Figure 1.
Partners of the SYNCHROS project provided 39 initiatives. Of those, 28 were excluded, mainly because there was no data harmonization or because eligibility could not be ascertained due to unresponsiveness from the principal investigators. The remaining 11 initiatives were selected. The descendent search from these initiatives provided two additional ones.
In the MEDLINE search, out of 843 hits obtained, 677 were excluded after reading their title and abstract. Of the remaining articles, 166 were read and, from those, 140 excluded. The main reasons for exclusion were that initiatives dealt only with population cohorts, that they had already been submitted by partners or already presented in another reference, or that the integration was only cross-sectional. In the end, 26 initiatives were selected. The reference list of these initiatives included five additional ones.
The search in the Maelstrom catalog only provided initiatives that harmonized population cohorts.
Overall, 44 initiatives were retrieved. They are presented in Table 1. Table 1 shows a selection of the most relevant information obtained from each of the initiatives. Complete information can be found in the repository of the SYNCHROS project. 3 They are ordered by types of diseases covered (starting with those which consider several diseases) and by alphabetical order. Of the 44 initiatives found, no further information could be obtained from principal investigators in almost half (20) of them.
Eight initiatives (BIOMAP, CINECA, EHDEN, ESCAP-NET, HarmonicSS, HARMONY, Lifebrain, and ReCoDID) have recently started adding cohorts; 21 are led by active research teams; and 12 are adding, or considering adding, cohorts now. Nevertheless, there is plenty of missing information on the activity status of the initiatives. 3 https://www.synchros.es.
In the selected initiatives, the most represented group of diseases is cancer (10 initiatives), followed by infectious diseases (8 initiatives, of which 5 focus on HIV) and cardiovascular disease (4 initiatives). There are five initiatives that have harmonized data from more than one type of disease. Other diseases and conditions producing a high burden in the high-income countries (6) are represented (dementia, osteoarthritis), but others included in this list (unipolar depressive disorders, alcohol use disorders, hearing loss, chronic obstructive pulmonary disease, diabetes mellitus, road traffic accidents) or poor-defined conditions with a well-defined impact on life-expectancy and quality of life (like back pain or functional deterioration) are missing. There is one initiative about a specific rare disease (Sjögren syndrome).
There is a sizable number of initiatives that have harmonized other types of cohorts in addition to patient cohorts. After Breast Cancer Pooling Project, BIOMAP, CLL-IPI, HARMONY, and the initiatives on obsessive-compulsive disorder and pulmonary embolism have harmonized at least one clinical trial cohort. CINECA, ESCAPE-NET, Lifebrain, and the project "Seasonal plasticity of cognition" have also harmonized population cohorts. BiomarCaRE and the National Cancer Institute Cohort Consortium have harmonized the three types of cohorts: patient, population and clinical trials cohorts.
Most of them (33) have an international scope, compared to seven national initiatives and one regional/local initiative. Two initiatives report that they include cohorts from across the world and eight initiatives incorporate cohorts from high-and low-and middle-income countries (LMIC); 30 (75%) initiatives

Disease cohorts Seven states of Germany
The spotlight is here on the data integration centers that will be embedded in the hospital IT-infrastructure and will facilitate the collection and exchange of data within the consortia university hospitals. Furthermore, we will elaborate a programme for strengthening medical informatics by extending the academic offer, including new professorships in the field of medical informatics, a novel, innovative master programme and personnel training. The MIRACUM partners have agreed to share data, based on interoperable data integration centers, develop common and interoperable tools and services, realize the power of such data and tools in innovative IT solutions, which shall enhance patient-centered collaborative research as well as clinical care processes, and finally to strengthen biomedical informatics in research, teaching and continued education only include cohorts from high-income countries, and none harmonize data from LMIC countries alone. Most initiatives are partnered with universities, hospitals, and research institutes. Governmental institutions take part in a few of them (9). The presence of patient associations and pharmaceutical companies as partners is anecdotal. The number of partners ranges between 2 and more than 100, with a median of 12. Three quarters comprise 20 partners or fewer.
Most initiatives have been or are funded by American (12) or European (10) institutions. Canadian funding comes third (4). The vast majority have received public funding alone (22). Five have received combined funding from public institutions and non-profit organizations. Private funding was provided in isolation to one initiative (RESPOND), combined with public funding to another one (EHDEN), and combined with non-profit funding to a third one (Tumor necrosis factor α antagonist use).
Their objectives may be classified into four general categories (some initiatives share more than one): determining the prognosis of subgroups of patients (14), providing a repository of patients (11), establishing the efficacy (6) or safety (4) of treatments, and exploring risk factors and biomarkers of diseases (10).
The median number of cohorts included in each initiative is 5, ranging from 1 (which also harmonizes population cohorts) to 58; three quarters include 17 cohorts or fewer. The number of individuals included varies wildly, from 57 to 310 million (Sentinel initiative). The median is 6,841. Eight out of 37 (21.6%) initiatives have harmonized fewer than 1,000 patients and the same proportion have harmonized 100,000 patients or more. Twenty-six have harmonized all or almost all the cohorts incorporated to the initiative, two (EHDEN and CINECA) are still in the process of harmonizing their cohorts and another two (CNODES and COSMIC) harmonize data on a project-byproject basis.
Eight initiatives included patients from all ages, eight included only adult patients, three included only children, and two included exclusively older people.
Of those which have declared the number of variables in their harmonized database, there are between 10 and more than 1,000 (median 24), with two out of 15 (13.3%) including more than 1,000 variables.
Four initiatives harmonized administrative databases. Thirtythree were retrospective vs. four prospective. The great majority do not report major threats to the representativity of their samples.

DATA AVAILABILITY STATEMENT
The dataset generated by this study can be found in the webpage of the SYNCHROS project https://www.synchros.es.

AUTHOR CONTRIBUTIONS
ÁR-L performed the database searches, extracted the information from the initiatives, and drafted the article. ÁR-L, LR-U, and CK evaluated the initiatives against inclusion and exclusion criteria. All authors made substantial contributions to the conception and design of the paper, analysis and interpretation of data, reviewed the manuscript, and read and approved its final version.

FUNDING
This work was supported by the European Union's H2020 Programme under grant agreement number 825884.