Phenotype-Agnostic Molecular Subtyping of Neurodegenerative Disorders: The Cincinnati Cohort Biomarker Program (CCBP)

Ongoing biomarker development programs have been designed to identify serologic or imaging signatures of clinico-pathologic entities, assuming distinct biological boundaries between them. Identified putative biomarkers have exhibited large variability and inconsistency between cohorts, and remain inadequate for selecting suitable recipients for potential disease-modifying interventions. We launched the Cincinnati Cohort Biomarker Program (CCBP) as a population-based, phenotype-agnostic longitudinal study. While patients affected by a wide range of neurodegenerative disorders will be deeply phenotyped using clinical, imaging, and mobile health technologies, analyses will not be anchored on phenotypic clusters but on bioassays of to-be-repurposed medications as well as on genomics, transcriptomics, proteomics, metabolomics, epigenomics, microbiomics, and pharmacogenomics analyses blinded to phenotypic data. Unique features of this cohort study include (1) a reverse biology-to-phenotype direction of biomarker development in which clinical, imaging, and mobile health technologies are subordinate to biological signals of interest; (2) hypothesis free, causally- and data driven-based analyses; (3) inclusive recruitment of patients with neurodegenerative disorders beyond clinical criteria-meeting patients with Parkinson’s and Alzheimer’s diseases, and (4) a large number of longitudinally followed participants. The parallel development of serum bioassays will be aimed at linking biologically suitable subjects to already available drugs with repurposing potential in future proof-of-concept adaptive clinical trials. Although many challenges are anticipated, including the unclear pathogenic relevance of identifiable biological signals and the possibility that some signals of importance may not yet be measurable with current technologies, this cohort study abandons the anchoring role of clinico-pathologic criteria in favor of biomarker-driven disease subtyping to facilitate future biosubtype-specific disease-modifying therapeutic efforts.


INTRODUCTION
We have long assumed that the neuropathological findings of aggregated α-synuclein (α-syn) into Lewy bodies and Lewy neurites define and cause Parkinson's disease (PD) and that aggregations of amyloid (Aβ) into plaques and tau into neurofibrillary tangles define and cause Alzheimer's disease (AD), and that the distribution of these proteins explains their clinical heterogeneity (Espay et al., 2020). These pathological findings are, however, ubiquitous and do not correlate with agnostic postmortem analysis: α-syn, Aβ, and tau aggregation are frequent "co-pathologies" in AD and PD (Irwin et al., 2017;Boyle et al., 2018;Karanth et al., 2020) and can be found even in supersurvivors without dementia or parkinsonism (Head et al., 2009;Wallace et al., 2019). The overlapping pathological features may instead reflect clinical characteristics shared by PD and AD (Scarmeas et al., 2004(Scarmeas et al., , 2005Kehagia et al., 2010). Indirect evidence from human studies suggest protein aggregation in sporadic cases may in fact be protective and not capable of discriminating clinical disease subtypes (Espay et al., 2019). As a result, it has become imperative to transition from the centuryold, clinico-pathological convergent model on which diseases are classified to a systems biology framework, in which genotype and biomolecular abnormalities, rather than clinical phenotypes alone, define nosology and drive therapeutics (Espay et al., 2017).
Given these premises, we have recently launched at the University of Cincinnati's James J. and Joan A. Gardner Center for Parkinson's Disease and Movement Disorders, a phenotypeagnostic biomarker-discovery program aimed at characterizing biological subtypes of neurodegenerative disorders, particularly those best suited for targeting with therapies available for repurposing. This cohort study has unique features compared to ongoing [e.g., Parkinson's Progression Marker Initiative (PPMI)] or newly assembled cohorts [e.g., Luxembourg study, Personalized Parkinson Project (PPP)] ( Table 1) (Kenneth et al., 2011;Hipp et al., 2018;Bloem et al., 2019). The main novelty for our cohort study is a design based on the assumption we do not know which biomarkers have clinical relevance at the individual level. Accordingly, the recruitment will be deliberately inclusive of different neurodegenerative phenotypes with the expectation that biological subtypes may not align with clinicopathological subtypes.
Here we summarize the methodological aspects of this cohort study, including phenotypic measures and analytic approach, and discuss anticipated challenges.

THE CINCINNATI COHORT BIOMARKER PROGRAM
This is an omics-based, longitudinal, structural causal model, non-phenotype-driven population-based study. We will enroll a total of 4,000 patients with neurodegenerative diseases and 1,000 healthy age-matched controls with yearly follow-up for at least 5 years, extended to 10 and beyond contingent on additional funding. At each visit, patients will undergo a similar clinical, paraclinical, and biospecimen collection. Pragmatic approaches such as streamlining data gathering (prioritizing biospecimen collection) will be allowed if important to retain subjects and minimize dropouts. The exploratory nature of this study rendered it unsuitable for funding considerations by agencies giving continued preference for hypothesis-based studies based on the prevailing clinico-pathologic model of neurodegenerative diseases, which remains the gold standard for nosology, biomarker validation, and disease modification. As a result, this study was funded through philanthropy, with major support by the James J. and Joan A. Gardner Family Foundation. The main aim is to identify biological outliers defining molecular disease subtypes, with a focus on those suitable for targeting with already available therapies (repurposing) in future built-in adaptive clinical trials.

Inclusion and Exclusion Criteria
Given the inclusive nature of the study, we are recruiting subjects older than 18 years of age exhibiting a range of parkinsonisms representing PD and PD-like disorders, such as progressive supranuclear palsy, multiple system atrophy, and corticobasal syndrome, as well as AD and AD-like disorders, such as frontotemporal dementias, normal pressure hydrocephalus, and vascular dementia. The enrollment of young subjects could help PPMI (Kenneth et al., 2011) Luxembourg (Hipp et al., 2018)  in the identification of early biomarkers in specific conditions (e.g., genetic). However, our enrollment will be initially focused on the elderly population seeking care at the University of Cincinnati Gardner Center, which receives referrals from a wide range of Cincinnati-area neurologists. The Center evaluates a representative population of neurodegenerative disorders seeking care in the Cincinnati area. We will also recruit ageand sex-matched healthy controls. Controls that during the study assessment manifest signs of neurological disease will be shifted as cases.
Although "Cases" and "Controls" are determined by virtue of the presence or absence of neurological symptoms, respectively, our inclusion criteria for neurodegenerative disorders are otherwise deliberately inclusive, based on the premise that we do not a priori know in which clinical phenotypes will the first targetable molecular subtypes be identified. As noted in Section "Data Analysis and Management, " none of proposed analysis will use the classification of participants into cases or controls, nor any phenotypic subtype created therein, as independent variables. Nevertheless, all participants will be referred by a neurologist to make sure they present specific signs of parkinsonism or dementia. In case of doubt, the Principal Investigator will decide if the subjects fit inclusion criteria. Only subjects with recognized causes or contributors for their motor or cognitive manifestations (e.g., vitamin B12 deficiency) and those requiring aggressive medical management will be excluded.

Ethics, Collection, and Storage of Biological Samples
The study protocol was approved by the Institutional Review Board of the University of Cincinnati (protocol number 2020-0039). Informed consent is obtained from all subjects with the conduct of the study fully adhering to the principles of the Declaration of Helsinki. Biospecimens will be collected from subjects and healthy controls, including peripheral blood, urine, and stool.
Plasma is being isolated from blood collected in EDTA vacutainers and aliquoted for future use, including isolation of plasma proteins and extracellular vesicles (EVs). As all cells secrete EVs, they are abundant in all bodily fluids and have been shown to carry diverse species of nucleic acids, proteins, and lipids (van Niel et al., 2018). Plasma will be subjected to size exclusion chromatography (70 nm qEV original, Izon Science) to separate EVs from soluble proteins. The EVs present in each sample will be quantified using nanoparticle tracking analysis (NanoSight NS300, Malvern Panalytical), and their surface proteins characterized by a flow cytometry method optimized for vesicle analysis (Wiklander et al., 2018). Following isolation, we will extract RNA and sequence the mRNA present within these vesicles using methods developed for single-cell RNAsequencing. In order to amplify the most informative signals in total EVs mRNA, we will utilize known neuron, astrocyte and oligodendrocyte cell surface markers using immunoprecipitation (Miltenyi Biotec).
A urine sample is being collected in a sterile kit during in-clinic visits. Stool samples are aliquoted into preservative containers (OMNIgene.Gut, DNA Genotek, Corp.) immediately after passage. Samples are transferred to −80 • storage within 72 h. DNA is subsequently extracted from 0.25 gm stool using the PowerFecal Pro extraction kit (Qiagen, Inc.). DNA sequencing libraries will be constructed (Nextera XT, Illumina, Corp.) and pooled for sequencing on an Illumina sequencing machine (NextSeq500, Illumina, Corp.). Sequencing reads will be aligned to a microbial genome database using Kraken (Wood and Salzberg, 2014) to determine the assemblage of microorganisms present in each fecal sample (Quigley, 2017). Biospecimens are processed and aliquoted for downstream use consistent with the strategy of future use/sharing of the samples. All sample metadata are tracked via the DT Biobank's LIMS system to catalog the chain of custody and processing details. Stool samples are stored at −80 • C in the Microbial Genomics and Metagenomics Laboratory at Cincinnati Children's Hospital. Participants are also asked to participate in an optional brain donation program.
Genomics, transcriptomics, proteomics, metabolomics, epigenomics, and microbiomics will be processed from our biological samples. We will use validated methods for the analysis of the samples to ensure feasibility and reproducibility of the study in future independent cohorts. The specific methods will be selected at a later time; this will give us greater flexibility in the choice of assays as the analytic technologies become less expensive. Also, we may add other '-omics' (e.g., lipidomics, etc.) in the future.

Gait and Postural Stability Outcome Measures Obtained Using Mobile Health Technologies
Gait and postural stability are measured in the following conditions (Axivity, Ltd., Newcastle upon Tyne, United Kingdom): (1) Two-minute Walk: Subjects are asked to walk a straight path for 2 min. Parameters include: stride length, gait speed, stride width, and stride asymmetry; (2) Instrumented Time Up and Go (iTUG): Subjects are instructed to sit comfortably in an armless chair. At the "go" signal, they rise from the chair without using support, walk 3 m, turn 180 • and walk back; (3) Postural Sway: Subjects are asked to stand with their hands at their sides and feet together spaced by a wooden wedge on a firm surface; (4) 360 • Turn: Subjects are instructed to turn in a complete circle (360 • ), first to the left, and then to the right. Other measures include: (1) Tapping test: Subjects are asked to tap on the smartphone screen for 30 s; (2) Rest and postural tremor tests: Subjects hold their arm out straight for 30 s, and subsequently rest their arms in the lap while counting down from 100; and (3) Voice and speech tests: Subjects are asked to say "aaaah" at a comfortable pitch and loudness, and subsequently recite a short, phonetically-balanced passage, into an Android-based smartphone microphone.
A 3-Tesla brain MRI will be obtained within 6 months from the baseline. A comprehensive protocol including 3D T1 fast spoiled gradient echo (FSPGR), 3D T2-weighted, 3D T2-FLAIR, susceptibility weighted imaging (SWI), resting state functional MRI (fMRI), diffusion tensor imaging (DTI), and 3D arterial spin labelling (ASL) will be performed. 3D T1 FSPGR sequence provides volumetric analysis of regional atrophy. T2 and FLAIR sequences will be analyzed for chronic small vessel disease including white matter disease, lacunar infarcts, dilated perivascular spaces. SWI will provide information on iron deposition in the deep nuclei and microbleeds. Resting state fMRI will be analyzed for changes in functional connectivity. DTI tractography analysis will provide information on white matter integrity.

At Home Sensor-Based Assessment
Participants are provided with smartwatches (Sony Corporation, Tokyo, Japan) for at-home 24-h continuous collection of sensor data such as accelerometry and wrist-based photoplethysmography, from which estimates of multiple behavioral parameters, including sleep behavior quality, heart rate variability and step count will be obtained.

Data Storage and Process
All biological samples are stored for future analyses in a dedicated Biobank at Cincinnati Children's Hospital Medical Center (CCHMC) and Discover Together Biobank using established protocols, for processing, storage, and future analysis. The database was designed to account for the longitudinal study design, linkage to multi-omics measurements and formats, and capacity to store big data. The stored data are labeled according to processed or unprocessed data, methods, and type of omics data. All the samples are coded using an identifier reflecting sites and subject number. All the samples are preprocessed for background correction, quality control and standard deviation of the intensity ratios. Prior to conducting analyses, normalization using LOWESS or quantiles, scaling with baseline correction, outlier removal, and missing imputation for less than 20% missing data using K-nearest neighbor imputation will be performed. BioMart for database and Bioconductor for data processing and analyses will be used along with specific software required for sequence, network, reads, mining, and pathways will be utilized according to their specific purposes. We plan to create an online platform where de-identified and analyzed data can be shared. To protect confidentiality and prevent bias, all imaging data will be deidentified and transmitted with unique study identification numbers to the imaging core lab, utilizing a HIPAA complaint secure platform. Imaging readings will be recorded on electronic case report forms and integrated seamlessly with the clinical data.

Data Quality Management
A pre-analytical standard operating procedure (SOP) has been developed. The multiple steps included are aimed at minimizing biases at forming and analyzing substudy cohorts. The following SOP are highlighted: (a) subjects are selected only by neurologists; (b) controls are selected from the same population and time period than cases; (c) a substantially large sample size will permit estimating rare molecular subtypes; (d) pragmatic assessments to minimize dropouts and maximize adherence to protocol over a long observational period. Finally, our interdisciplinary team is meeting regularly to review the quality controls of data collection, SOP protocol adherence, datagathering issues, and concerns related to ethics, data storage, data process, and management.

DATA ANALYSIS AND MANAGEMENT Aim
The main aim is to identify biologically unique biological subgroups with emphasis on those suitable for repurposing of already available therapies using proof-of-concept adaptive clinical trials.

Sample Size and Statistical Power
The sample size of this study was computed using several simulations under various conditions. We utilized the Qiu and Joe (2009) formula (10 × d × k) (Qiu and Joe, 2009) where d is the number of variables included for clustering while k is the number of clusters and formula (70 × d) (Dolnicar et al., 2014). Using this formula to estimate the moderate, adjusted Rand index values produces a sample size of 3500 with 50 biological markers. This sample size is powered for detecting at least 10 subtypes with 40 biological markers using the Qiu and Joe formula. Furthermore, a total of 800 healthy controls are required to form a comparative group based on 1:2 case-control design for detecting small Cohen effect sizes (D = 0.2) between groups with more than 90% power and 5% level of significance. The sample size suggested in this study is more than sufficient to detect small (odds ratio 1.2 or standardized mean difference 0.20) to moderate (OR = 1.5 or SMD = 0.50) expected associations between individual subtypes and clinical outcomes depending on the types of outcomes and subtypes with more than 80% power and 5% level of significance and covariates accounting for 10 to 50% of variance in a given outcome using logistic regression analysis. This sample size also ensures adequate power for detecting small to moderate Cohen's effect sizes (SMD 0.20-0.50) using two-sided unpaired t-tests. The sample size estimation was also found to be sufficient using data-driven sample size driven algorithm (DSD) (Billoir et al., 2016) and sample size in high-dimensionality data settings using the MV power algorithm (Guo et al., 2010). We note here that these formulas depend upon assumptions (such as Gaussianity) which may not hold for these data and for the kinds of clustering analysis we plan to use in this study and can only be considered reasonable to justify the sample size. Although a sample size of 3500 patients and 800 healthy controls was estimated as sufficient, we plan to enroll 4000 cases and 1000 controls in order to account for potential dropouts. The sample size will most likely need to increase to identify heretofore unanticipated molecular subtypes.

Exploratory Data Analysis
All potential biomarkers will be compared between cases and controls using a bootstrap test to screen for significant biomarkers from each omics platform and thereafter we will apply Bayesian exponential family principal components analysis (BE-PCA) (Shakir et al., 2008), a generalization of principal component analysis (PCA), which is a widely used method of statistical analysis and simplification of data sets, to reduce the dimensionality of the multi-omic data (Wold et al., 1987). We avoid the use of simple PCA because some of the variables we measure in this project are likely to be non-Gaussian.

Data-Driven Causal Inference
The analytical approach of this study will be based on the latest techniques from statistics and data science (Little, 2019), revolving around causal modeling and inference of the interaction between all the variables captured in the study across genomics, transcriptomics, proteomics, metabolomics, epigenomics, microbiomics, and pharmacogenomics data (Figure 1). Starting with a simple causal model built using existing datasets, the model can be used for various purposes, including simulating randomized trials using causal inference, and acting as a guide to designing pragmatic trials to collect appropriate data to "fill in" missing information in the causal model. Results of these simulated trials will then further inform the modeling and statistical analysis choices, with the end goal of deriving a simple, mechanistic model that is both explanatory and predictive, which can be used to extract "subtypes" most likely to respond to therapies (Pearl, 2010).
The justification for the use of these techniques is that they aim to minimize the misleading effects of reliance on speculative and unproven theories of disease, behavior and symptom mechanisms while avoiding the problems of purely data-driven modeling, which can be easily confounded by unmeasured variables, poor-quality data or mischaracterized measurement processes.
These advances in causal inferential methods rely on a synthesis of two analytical techniques (Little and Badawy, 2019): (1) Data-driven approaches. These approaches often have high predictive accuracy, and can capture highdimensional, non-Gaussian, non-linear relationships. Machine learning is one example. The primary drawback is their limited explanatory power and high sensitivity to irrelevant confounding effects, which inevitably creep into measurements. (2) Causal modeling approaches. A set of probabilistic relationships is drawn up to describe the mechanistic processes explaining the data. Because these models traditionally require fully-specified probabilistic relationships between variables, they often do not make quantitatively accurate predictions, but they do allow realistic, causal interactions among biological, behavioral and symptom expression processes to be built in to the analysis. This causal structure is essential in this study given the sheer number of variables and the resulting complexity of interaction between them.
We propose to use a synthesis of these two approaches, which can be described as data-driven causal inference. This aims to exploit the advantages of the high predictive accuracy of data-driven approaches and the realism of causal modeling. It respects the causal structure of the real world captured by the measurements, and is verified against the high-dimensional, non-Gaussian measured data with nonlinear interactions, promising to circumvent both the problems of erroneous clinico-pathological reasoning and prevent data analysis which is heavily biased by spurious correlations because its structure can disentangle confounding factors in the measured data, for example.
Technically, data-driven causal inference involves finding variables and their covariates (Figure 1), isolating the mechanism predicting these variables using causal bootstrapping (Little and Badawy, 2019) or other causal adjustment methods (Pearl, 2010), then using the data to fit a predictive model of that isolated mechanism. The isolated mechanisms can then be assembled into a full, predictive causal network. After examining the associations of identified biological subtypes with clinical characteristics and outcomes, the severity of subtypes, their motor and non-motor functionalities, and progression pattern will be determined by integrating data from biological interpretation of subtypes as well. Visual interpretations obtained using Bayesian exponential family PCA and other dimensionality reduction techniques and relationship with clinical neurodegenerative disease subtypes, will be summarized to generate a global view of each subtype. The main benefit of this causal-inference data driven model is not the validation in separate populations but the identification of suitable candidates, within the cohort for future repurposing therapy approaches.

Subtyping Based on Individual Markers From Integrative Analysis
The analysis of the biological data should lead to clustering subjects with shared biomolecular alterations regardless of phenotype (Espay et al., 2017). In data-driven biological subtyping, the "truth" is unknown and the analysis hypothesis free. Clustering is a major method for disease subtyping based on high-dimensional omics data (Wang and Gu, 2016). We will apply clustering methods to identify subtypes in genomics, transcriptomics, proteomics, metabolomics, epigenomics, microbiomics, and pharmacogenomics. There are currently two main methods for the fusion clustering of multi-omics data [i.e., iCluster, similarity network fusion (SNF)] based on the sample similarity network. Studies have shown that SNF has better performance in disease subtyping than iCluster (e.g., cancer) (Wang et al., 2014;Wang and Gu, 2016). We will perform unsupervised clustering on the processed data by SNF and validate similarities and dissimilarities in identified subtypes using moCluster and pattern fusion analysis by adaptive alignment of multiple heterogeneous omics data. Because clustering analysis is an unsupervised learning method, the results cannot be tested by ground truth which usually indicates the accuracy of training set's classification of supervised leaning techniques. We can also perform bioinformatics analysis, such as differential expression analysis and functional enrichment analysis, for different subtypes and compare the difference among them. Data-driven subtypes will be determined using various parameters described above. Deep phenotyping from clinical FIGURE 1 | Basic causal model of proposed relationships between measured variables in the CCBP cohort study. Arrows between variables (in circles) indicate the dominant direction of causal influence between them. In this study, machine learning is used to model predictive relationships, but these should also be causal, not merely associational, relationships. For example, in predicting phenotype (effect) from omics data (causes), confounders such as subject age influence both cause and effect variables, which makes it critical to take these into consideration when using predictive machine learning algorithms.
(e.g., development of clinical milestones such as falls, progression of motor and non-motor symptoms, etc.), paraclinical (e.g., mobile health technologies), and neuroimaging data (e.g., brain atrophy) will be used as outcome measures or dependent variables. The longitudinal design, with multiple follow-ups, will give us information about the casual role of potentially druggable biomarkers. The relationship between biomarkers and disease will require similar assessments in the control group.

Subtyping Based on Composite Markers From Integrative Analysis
The clustering of markers (joint expressions of important features) arising from different omics measurements may be useful in identifying unique subtypes of patients as opposed to using patterns of individual markers to form patient subtyping. This procedure typically involves a two-stage framework of clustering. The first stage of clustering groups the subset of variables into disjointed segments whereas the second stage creates subtyping of patients by exploring the patterns in the identified clusters of markers from the first stage. We will utilize unsupervised feature selection methods such as sparse partial least square (sPLS), sparse canonical correlation analysis (sCCA) (Witten and Tibshirani, 2009), and variable cluster analysis (VCLUS) in the first step followed by moCluster (Meng et al., 2016) and SNF in the second stage to determine subtypes.

Subtyping Based on Outlier and Non-Gaussian Markers
Heterogeneity may exit in the identified subtypes of patients. Generally, clustering approaches are conducted to determine subtypes and variable selection after removing outliers and non-Gaussian data. As opposed to removing outliers and non-Gaussian data, several unique subtypes and biological heterogeneity can be obtained by determining subtypes based on outlier markers. In this regard, two novel approaches can be adopted to identify outlier markers as well as non-normal markers. We will employ outlier profile and pathway analysis (OPPAR) using the modified cancer outlier profile analysis (mCOPA) (Wang et al., 2012). The mCOPA is used to identify markers that are outliers either up-regulated or down-regulated. We will also apply the maximum ordered subset t-statistics (MOST) (Karrila et al., 2011) method for identifying bimodal distributed markers. After selecting the appropriate set of nonnormal and outlier markers, moCluster and SNF methods will be used to cluster patients into homogenous patterns of non-normal and outlier markers. These steps of identifying subtypes will be replicated for gene set enrichment analysis using OPPAR.

Subtyping Based on Dynamic Network Biomarkers
Individual sets of omics may have limitations, such as poor sample quality or data sparsity, network-based stratification can be used to overcome these limitations and identify unique patient subtypes. We will employ a network-based stratification approach for baseline omics data that determines patients with genes in similar network regions (Hofree et al., 2013). The dynamic network biomarkers (DNBs) method examines timedependent alterations in biomarkers. We will select the casesmarkers which are not statistically different at the baseline from controls and determine the longitudinal changes in the markers according to disease progression or treatment response. MoCluster and SNF will then be applied to determining subtypes based on the changes in non-significant markers.

Bioassay Development for Currently Available Therapies
First, genomics, transcriptomics, proteomics, metabolomics, epigenomics, and microbiomics data will serve to identify potentially altered molecular pathways for each global neurodegenerative subtype (Figure 2). Bioassay candidates will be selected depending on candidates identified by relevant pathway analyses. For example, from the genomics data, we will perform genome-wide association study (GWAS) analysis to obtain SNPs of each subtype and then identify the potential pathogenic genotype and pathways in which they are associated. Viable bioassay candidates will be selected, determined by the generation of high-throughput clinically relevant assays for the quantification of expression and/or biologic state of candidates.
Second, online databases, including OMIM and PubMed, will be searched for related mechanistic information. Specifically, we will collect information of the effects of gain of function (GOF) and loss of function (LOF) in human and/or mammalian models and remove targets that can significantly aggravate the corresponding phenotype. Targets will be obtained through candidate analysis above and candidate drugs with repurposing potential will be recognized for future proof-of-concept clinical trials from identified pathways/protein combinations and drugrelated protein information ( Table 2).
We plan to work with industry partners to develop/utilize bioassays for the presumed mechanisms of actions for each of the candidate drugs. Given the phenotype-agnostic nature of this study, after the identification of bioassay-based abnormality suggesting vulnerability to a specific drug, a proof-of-concept clinical trial will be designed to match the drug with the bioassaydefined clinical cohort in order to evaluate for preliminary safety and efficacy of such to-be-repurposed intervention.

Reliability and Validation of Patient Subtypes
Various approaches will be used to assess replicability, naturalness, and validation of cluster subtypes. The reliability will be assessed by the adjusted Rand statistic and percentage agreement on cross-validated hold-out testing. The validation will be assessed by comparing the clusters across different clustering methods (SNF, moCluster) (sPLA and sCCA) and (OPPAR followed by SNF, moCluster) and the concordance index (c-statistic) by evaluating the predictive performance of each cluster on primary outcomes across different methods.
to assess phenomena associated with brain neurodegeneration (Lehmann-Werman et al., 2016). Nevertheless, selected biological alterations associated with central nervous system neurodegeneration can also be detected in other tissues (Kaushik and Cuervo, 2015;Lehmann-Werman et al., 2016); for instance, EVs will be used as a platform for "liquid biopsies." EVs have been shown to transport this molecular cargo directly between neighboring cells, as well as to distant cells via blood and other fluids (van Niel et al., 2018). EVs bear both surface proteins and intracellular contents from their parent cells into peripheral fluids, which are then accessible without the invasiveness of tissue biopsy (El Andaloussi et al., 2013). Moreover, in the future, EVs may also serve as a delivery system for therapies given that, as native nanoparticles, they benefit from immune tolerance and the ability to cross biological barriers (van Niel et al., 2018;Wiklander et al., 2019).

Relevance of Biomarkers
Neurodegeneration starts years prior to symptom onset (Cacabelos, 2017). This creates difficulties in distinguishing between early biomarkers, related to causal disease mechanisms, and late biomarkers, possibly end results of other processes, themselves pathogenic, or resulting from response to various treatments (Espay et al., 2017). Moreover, early or late biomarkers may be transient or constant across neurodegenerative disorders, potentially underestimating or overestimating the importance of an early or late biomarker depending on the time of data acquisition. A population-based study design with control subjects, multiple visits, longitudinal assessments and next-generation statistical analysis may help mitigate these issues.

Development of Bioassays
Some of the known mechanisms of therapies with repurposing potential ( Table 2) may not be relevant to disease pathogenesis in any subtype, even if bioassays can be developed to measure their range in a laboratory. Some bioassay candidates can be difficult to deploy or measure with existing technology in a manner that would make them clinically viable. Connecting specific biomarkers to disease stage/progression will be difficult given our study design. This concern will be ameliorated by using promising bioassays to select patients for future proofof-concept drug studies. Such studies will contribute toward separating primary from secondary biologic mechanisms of each neurodegenerative subtype.

Uncertainty About Extent of Unknowns
While the data-driven design of this study favors the collection of data without a priori hypotheses for later analysis using discovery algorithms (Kim et al., 2016), a major challenge is to define which biologically promising targets may be more relevant than any of the currently known biomarkers. Also, some technologies may be insufficiently sensitive for potentially relevant biomarkers or result in false negative assays. As for the known variability of prior omics data, we expect that to be attenuated by the unbiased analysis, not anchored on diagnostic or phenotypic data. The creation of a robust biobank is designed to mitigate these difficulties by providing the opportunity to re-analyze samples and data in the future.

THE "ALL OF US" PROGRAM
The "All of US" program is an important effort funded by the NIH starting in 2015, aiming to collect clinical, paraclinical, and biological data in a very large population, not preselected for the presence of neurodegenerative disorders (All of Us Research Program Investigators, 2019). The goal of the program is to enroll at least 1 million persons nationwide from 340 recruitment sites (All of Us Research Program Investigators, 2019). This effort represents a significant step forward in the understanding of human health and disease. However, the lack of focus on neurodegenerative disorders (or any other disorder) represents an important limitation from the standpoint of our research objectives. Compared to the "All of US, " our study aims to merge an "inclusive" approach to all neurodegenerative disorders and utilizes standardized clinical questionnaires and scales, inclinic and at-home wearable technologies, and more extensive biological sampling. Nevertheless, a future collaboration between these two approaches stands to accelerate the understanding of neurodegenerative disorders.

CONCLUSION
This phenotype-agnostic, population-based, bio-subtyping and bioassay development program will provide longitudinallycollected clinical and biological data to characterize patients affected by neurodegenerative diseases -not to understand diseases, but to understand how individuals are affected by them. The inclusivity and large number of deeplyphenotyped individuals (currently classified under a range of neurodegenerative disorders) and the causal model-driven nature of analyses, blinded to the clinical disease classification, are unique elements in the design of this study, expected to identify small but molecularly suitable subsets of subjects for embedded proof-of-concept adaptive clinical trials. Our goal is to identity the first molecular subset of individuals for whom an available therapy can be repurposed before the end of the 2020s. Despite many anticipated challenges, the ascertainment of biological subtypes will help to materialize the promise of precision medicine for patients affected by neurodegenerative disorders.

AUTHOR CONTRIBUTIONS
AS organized, executed the research project, conceived and wrote the first draft of the manuscript. LM and MK organized, executed the research project and critically revised the manuscript. AKD and LL conceived the statistical methods and critically revised the manuscript. JV, APD, PL, MP, BW, EH, BS, EK, AV, LW, DBH, MR, CT, DWH, SE, KE, and RF critically revised the manuscript. ML organized the research project, conceived the statistical methods and critically revised the manuscript. AE organized, executed and supervised the research project, conceived and wrote the first draft of the manuscript. All authors contributed to the article and approved the submitted version.

FUNDING
The CCBP has received major funding through a grant from the Gardner Family Foundation.