A Standardized Brain Molecular Atlas: A Resource for Systems Modeling and Simulation

Accurate molecular concentrations are essential for reliable analyses of biochemical networks and the creation of predictive models for molecular and systems biology, yet protein and metabolite concentrations used in such models are often poorly constrained or irreproducible. Challenges of using data from different sources include conflicts in nomenclature and units, as well as discrepancies in experimental procedures, data processing and implementation of the model. To obtain a consistent estimate of protein and metabolite levels, we integrated and normalized data from a large variety of sources to calculate Adjusted Molecular Concentrations. We found a high degree of reproducibility and consistency of many molecular species across brain regions and cell types, consistent with tight homeostatic regulation. We demonstrated the value of this normalization with differential protein expression analyses related to neurodegenerative diseases, brain regions and cell types. We also used the results in proof-of-concept simulations of brain energy metabolism. The standardized Brain Molecular Atlas overcomes the obstacles of missing or inconsistent data to support systems biology research and is provided as a resource for biomolecular modeling.


Methods
Flux balance and variability analyses are some of the most common types of constraint-based modelling, that allow the prediction of the flow of metabolites in a network of reactions, and subsequently, the metabolic capacities of biochemical networks. One of the substantial limitations for interpretability of the results in these kinds of studies is the high uncertainty in the properties of the investigated system, such as a range of mathematically plausible reaction fluxes. To this end, many alternatives have been proposed to better constrain the models (Gavai et al. 2015;Heckmann et al. 2018;Lularevic et al. 2019;Pandey, Hadadi, and Hatzimanikatis 2019;Sánchez et al. 2017;Tian and Reed 2018), with one of the widely-proposed means being gene expression data and enzyme turnover. One possible way is the incorporation of protein expression and allocation that leads to producing so-called Metabolism and Expression (ME) models (O'Brien and Palsson 2015) that are available for simple organisms, but great efforts are needed to apply this approach for accurate simulations of mammalian metabolism. On the other hand, the limited capacity of enzymes constrains the available range of reaction fluxes (Noor et al. 2016). The capacity of enzymes, which define maximal reaction rates, is reflected in their activities and amounts. The latter often has been represented by the expression of corresponding genes. At the same time, using protein concentrations instead of gene expression can lead to more biologically accurate results due to gene-protein level differences defined by regulatory processes of protein synthesis and degradation. While there are many variations of the flux balance and flux variability methods, the key idea is to analyse the flow of metabolites through the metabolic network optimizing a specified objective function given some mathematical constraints on the reaction rates (Orth, Thiele, and Palsson 2010; Anand, Mukherjee, and Padmanabhan 2020). FBA has a wide range of applications in bioengineering and biotechnology, but its use for brain cells remains limited. Nonetheless some studies are available, including but not limited to those addressing neuron-astrocyte coupling ( The main factors that affect rates and fluxes of biochemical reactions are catalytic activities, concentrations of enzymes, availability of reactants and co-factors, and the physical state of the system, such as temperature. Under the assumptions that the same enzymes have similar activities in neurons and astrocytes, that conditions and presence of reactants and co-factors are also comparable, the enzyme concentrations should explain differences in the reaction fluxes when comparing neurons and astrocytes. So we aimed to evaluate whether protein concentrations used as constraints will result in meaningful relative maximum capacities of reactions in the neuron and astrocyte. Comparison of the results to what can be achieved with the other types of constraints for the fluxes is out of the scope of our study. We used our protein concentrations data to constrain large scale models of neuron and astrocyte metabolism, and to perform flux balance analysis (FBA) and loopless flux variability analysis (FVA) (Desouki et al. 2015; Schellenberger, Lewis, and Palsson 2011) maximizing an objective function of total adenosine triphosphate (ATP) production. FVA is used to evaluate reaction fluxes boundaries, within which multiple equally optimal solutions can be contained. Loopless FVA is a modification of this method aiming to eliminate thermodynamically infeasible loops (Schellenberger, Lewis, and Palsson 2011). We used the mouse metabolism model iMM1415 (Sigurdsson et al. 2010) as a core reconstruction, but we updated an objective function to represent ATP production as given by the equation (5) below.
Using gene-protein-reaction rules (GPR) from the original model, which are basically the relations between proteins and biochemical reactions, we set constraints on the lower and upper bounds of fluxes based on protein concentration data from the Brain Molecular Atlas for neurons, astrocytes, cortex and hippocampus, producing four models respectively. In the case of complex gene-proteinreaction relationships containing AND and OR expressions, constraints were set based on logic of subunits or isoforms of enzymes being rate limiting for the reaction. Natural logarithm protein concentrations were scaled to the range of zero to one and then multiplied by standard bounds of -100,000 and 100,000 mmol/gdw/hr for lower and upper bounds, respectively. Reversibility of reactions was considered by setting relevant bounds to zero.
For loopless FVA, we set the fraction of optimum parameter to 0.9 to reduce the number of biologically implausible solutions. We used GUROBI solver (

Results
As all this information is available through the Brain Molecular Atlas, we applied it to perform loopless flux variability analysis (FVA) (Desouki et al. 2015;Schellenberger, Lewis, and Palsson 2011) with the objective function set to be ATP production as described in Methods section. Only healthy-state data was used for this analysis. We compared individual maximal reaction capacities (Vmax) for the main energy metabolism pathways of neuron and astrocyte based on results from loopless FVA (Supplementary Figure 11A).
Qualitatively, we can see expected differences between neuron and astrocyte energy metabolism from our simulations, consistent with reported experimental results as shown in the following observations. First, there is evidence that hexokinase is highly enriched in neurons compared to astrocytes, in mouse as well as in human cortex (Lundgaard et al., 2015). Differences in the maximal capacity of tricarboxylic acid cycle enzymes (citrate synthase), pyruvate kinase flux and pyruvate transport to mitochondria between neurons and astrocytes can be explained by differential preferences of oxidative to glycolytic energy metabolism in these cell types. Moreover, it is known that the malateaspartate shuttle (MAS) is active in neurons but much less so in astrocytes (McKenna et al. 2006). Particularly, aspartate-glutamate carrier aralar1 is highly enriched on neurons compared to astrocytes (Bélanger et al., 2011). Our simulation qualitatively supports this, as the difference between neuronal and astrocytic "aspartate-glutamate mitochondrial shuttle" (ASPGLUm) flux capacities is positive. Citrate synthase (CSm) has a higher maximal capacity in the neuron than in the astrocyte in the simulation. This is consistent with the information about higher activity of citrate synthase in neurons than in astrocytes (Hassel and Brâthe, 2000). Next, a large difference in maximum capacities is observed for nucleoside diphosphate kinase (NDPK), which is known to be important for presynaptic function (Krishnan et al., 2001), to play a role in neural development, as well as to be associated with a variety of brain disorders (Ansoleaga et al., 2016;Hwan Kim et al., 2002;Qiu et al., 2018), and is reported to be expressed mainly in neurons (Garcia-Esparcia et al., 2015). NDPKs, in general, are multifunctional proteins (Attwood and Muimo, 2018).
The biological context of compared processes is shown below in Supplementary Figure 11B for better understanding of their relations to other pathways, including those involved in neuronal signal transduction and the astrocytic glutamate-glutamine cycle.

Supplementary Presentation on the Methods Walkthrough
Proteins.
Data mining.
Step 1: collect protein data. Corresponding code is in the MADIP/step_1_collect_protein_data.ipynb which calls functions from MADIP/collect_protein_data.py This module loads publicly available data from the referred sources provided in the manuscript A Standardized Brain Molecular Atlas: a resource for systems modeling and simulation.
First, at this step we read the data (xlsx and txt format) from the supplementary materials of referenced sources. There are individual functions in MADIP/collect_protein_data.py to read the data from different sources.
This module provides helper functions for performing alignment of protein and gene ids among different data sources. There are several filtering procedures that are also performed at this step as described below.
First, we load the data frame produced at step 1 in the Jupyter notebook MADIP/step_1_collect_protein_data.ipynb.
We check for the UniProt protein ids which are potentially coming from the experiment contaminants ("CON_" in the beginning of the protein id is an indicator of the potentially contaminant protein according to the MaxQuant annotation). We dropped those 490 entries from the data frame, so the remaining entries count became 2,142,284.
Next, we manually get additional Uniprot data from https://www.uniprot.org (latest data retrieval was done on 21 July 2020) by the following queries: [10116]" OR taxonomy:"Homo sapiens (Human) [9606]") AND reviewed:no This data will be needed in the later stages of the nomenclature alignment. Queried data was read and formatted by function process_uniprot_mapping_data() from MADIP/protein_ids_alignment_helpers.py (called from the notebook MADIP/step_2_protein_ids_alignment.ipynb) Next, we check for the gene names consistency within studies using function check_GN_consistency_within_studies(df_all) from MADIP/protein_ids_alignment_helpers.py (called from the notebook MADIP/step_2_protein_ids_alignment.ipynb). It allows us to get the gene names repeated in different entries of the same source in combination with other gene names (this is the case due to the protein annotation and identification). Example: UCHL3;UCHL4 + UCHL3.
Next, we decided to remove partial duplicates of type "GN1;GN2" and "GN1" within studies, because we need unique, unambiguous data. This resulted in the decrease from 2,142,284 to 2,140,373 entries.
Next, we addressed the UniProt annotation of the entries. We cropped the isoform index, i.e. transformed some identifiers from "PPPPPP-Number" to "PPPPPP" format, because we considered the data on isoforms of the protein combined together.
Next we obtained the list of unique sets of UniProt identifiers, which we used later to build a graph of related ("synonymous") UniProt identifiers.
At the next step we have built two graphs of potential synonyms, one for the gene names and the other one for the UniProt protein identifiers, as described in the Methods part of the main text. We counted occurrences of every gene name and UniProt identifier and the most frequent gene name and UniProt identifier in every connected component of the graph as a main identifier across all identifiers of that component (i.e. potentially synonymous identifiers). This produced two mapping dictionaries. Next we performed many manual check for the key-value correspondence in those dictionaries. A few nomenclature conflicts found by manual checks were resolved with the use of the Mouse Genome Database and UniProt, the most common gene names were kept as final identifiers in the integrated Molecular Atlas.
So the same procedure constructing and using the graph of related identifiers has been completed for gene names and UniProt identifiers, which resulted in mapping dictionaries for those two types of identifiers. In each of these two dictionaries, the list of identifiers corresponds to an identifier which is found the most frequently in the data.
Then these mapping dictionaries are applied along with the series of manual checks for the consistency of identifiers as can be seen in Jupyter Notebook MADIP/step_2_protein_ids_alignment.ipynb Some additional checks and refinements are continued in MADIP/step_3_protein_conc_calc_tpaPerSample.ipynb

Concentration estimations.
Step 3: calculation of protein concentrations Corresponding code is in the MADIP/step_3_protein_conc_calc_tpaPerSample.ipynb We start MADIP/step_3_protein_conc_calc_tpaPerSample.ipynb with additional refinements of protein IDs as shown in the Jupyter Notebook.
Next we need to retrieve protein sequences from the UniProt database. Then we use protein sequences to calculate the theoretical number of peptides (where length of the peptide is considered to be from 6 to 29 amino acids), which can be counted by splitting the protein sequence by particular amino acids, by which enzymes would cut the protein sequence in the experiments. We use information on enzymes reported in the original data sources.
We also obtain the molecular weights of proteins from UniProt where it is not given in the data source.
The we apply the total protein mass approach (Wiśniewski et al. 2014) with some adjustments depending of the experimental data type as shown in the MADIP/step_3_protein_conc_calc_tpaPerSample.ipynb Other transformations are used in the same Jupyter Notebook for the data reported in mol/g protein and mol/(g total protein).
Step 4: clean metadata Corresponding code is in the MADIP/step_4_cleanMetadata.ipynb This code is dedicated to the alignment of age categories for the data coming from the different sources. When the data sources explicitly state the qualitative age categories, these data were used. Otherwise the estimate was made based on days/week reported in the sources, with the use of https://www.jax.org/news-and-insights/jax-blog/2017/november/when-are-mice-considered-old and https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3733029/.
Step 5: normalisation Corresponding code is in the MADIP/step_5_normaliz.ipynb First we introduce combined compound_gene_protein_id, row_id, row_gene_id to better group entries in the data.
Then we perform additional exploratory analysis on the data.
Then we select housekeeping proteins and use data on those proteins to perform median normalization.

Evaluation
Step 6: Comparison of estimated protein concentrations to literature Corresponding code is in the BrainMolecularAtlas/main_6_plots.ipynb BrainMolecularAtlas/main_14_absValValid.ipynb It performs basic operations of data frames.
Step 7: Comparison with PubMed mentions Corresponding code is in the BrainMolecularAtlas/main_figS4_basedOn_pipeline_concHitsAn_2_14july2020.R It performs selection of proteins to be analysed and queries generation. Then EUtilsSummary from the RISmed library was used to perform queries, but this small and easy to reproduce part of the code is omitted to allow for a more permissive license for the entire project. Next the background concentration and log fold change are calculated. The figures-generating code for this particular case is also not available due to conflicting license types, but it can easily be reproduced with the ggplot R library.
Step 8: Functional network analysis This step is done using Cytoscape as follows: -open Cytoscape -load the data (we selected proteins in every cell type (neurons, astrocytes, microglia, oligodendrocytes) and brain region of interest (cerebellum, cortex, hippocampus, striatum, brainstem, thalamus, amygdala) with concentrations above 99% of overall protein levels across cell types and brain regions, correspondingly) -click on STRING plugin -set data as STRING network -click on clustering -choose MCL -set inflation to 5 -run it -next for every cluster click on STRING: retrieve functional enrichment color clusters according to obtained annotation, manually summarised from functional enrichment retrieved zope.interface==5.3.0 Table 1. Overview of data sources used in the Brain Molecular Atlas for estimation of the protein and metabolite concentrations (two sheets). Table 2. Statistics summary and supplemental statistical analysis results (two sheets). DBM/OVS refers to distance between medians (DBM) as a percentage of overall visible spread (OVS). Table 3. Results from the metabolites automated PubMed data mining (with corresponding queries).

Supplementary Tables and Data Sheets Overview
Data Sheets 1-4. Estimated protein concentrations (split into 4 files due to file size limit in the Journal's editorial management system).

Data Sheet 5. Estimated metabolite concentrations.
Data Sheet 6. Functional enrichment of protein-protein interaction networks related to Figures 6 and 7.
Column "ClusterID" refers to the location (brain region or cell type) and cluster index number. Data Sheet 7. List of differentially expressed proteins related to Fig 8, Supplementary Figures 6-9. "GroupID" column reports the group name which defines to which part and which figure every entry corresponds.