The visualization of Orphadata neurology phenotypes

Disease phenotypes are characterized by signs (what a physician observes during the examination of a patient) and symptoms (the complaints of a patient to a physician). Large repositories of disease phenotypes are accessible through the Online Mendelian Inheritance of Man, Human Phenotype Ontology, and Orphadata initiatives. Many of the diseases in these datasets are neurologic. For each repository, the phenotype of neurologic disease is represented as a list of concepts of variable length where the concepts are selected from a restricted ontology. Visualizations of these concept lists are not provided. We address this limitation by using subsumption to reduce the number of descriptive features from 2,946 classes into thirty superclasses. Phenotype feature lists of variable lengths were converted into fixed-length vectors. Phenotype vectors were aggregated into matrices and visualized as heat maps that allowed side-by-side disease comparisons. Individual diseases (representing a row in the matrix) were visualized as word clouds. We illustrate the utility of this approach by visualizing the neuro-phenotypes of 32 dystonic diseases from Orphadata. Subsumption can collapse phenotype features into superclasses, phenotype lists can be vectorized, and phenotypes vectors can be visualized as heat maps and word clouds.


Introduction
The signs and symptoms of a disease characterize its phenotype. In addition to signs (what a physician observes in a patient) and symptoms (the complaints of a patient), a clinical phenotype can include the age at the onset of a disease, its mode of onset, its rate of progression, its mode of inheritance, and its response to treatment. Some researchers include biochemical, radiological, electrophysiological, and biosensor findings as part of the disease phenotype (1)(2)(3)(4)(5). Large phenotype repositories are available on the internet. The On-Line Mendelian Inheritance in Man (OMIM) has over 9,500 disease profiles (6) and Orphadata has phenotype profiles of 4,245 rare diseases (7). The Human Phenotype Ontology (HPO) draws phenotype profiles from Orphadata and OMIM so that some genetic diseases have alternative profiles from each registry (8,9). All three repositories have sophisticated search engines that retrieve phenotype features by disease or gene (1). Phenotypic features are recorded as concepts (terms) from restricted vocabularies such as the Human Phenotype Ontology (20,246 terms) (10), or the Online Mendelian Inheritance of Man ontology (99,165 terms) (11).

Neuro-phenotypes
The June 2022 release of Orphadata lists 7,261 rare diseases, with 1,740 classified as rare neurological diseases (https://www.orphadata. com/linearisation/). Orphadata provides phenotype profiles on 1,184 rare neurologic diseases (https://www.orphadata.com/phenotypes/). Neuro-phenotyping is the deep phenotyping of neurological disease (1). We have suggested that most neuro-phenotyping can be done with a restricted vocabulary of about 1,600 concepts (12). Although lists of phenotypic features for neurological diagnoses can be retrieved from Orphadata, OMIM, or HPO, these lists are difficult to visualize.

Visualizations of disease phenotypes have limitations
OMIM, Orphanet, and HPO yield lists of phenotype features of variable length, sorted by alphabetical order, feature frequency, or body system. For example, the Orphadata annotations for Dystonia Type 13 (DYT13) are: Although useful, these lists have limitations. The lists may be long. In the Orphanet dataset, 25% of the lists are more than 34 features in length. Many of these lengths are beyond the length of 7 + 2 that is easily comprehended (13). Side-by-side comparisons of lists are difficult ( Table 1). Lists of signs and symptoms from Orphadata may contain pathologies (e.g., gliosis, Lewy bodies), radiological findings (e.g., abnormal PET FDG), biochemical findings, electrophysiological findings, and modes of inheritance. Although terms in Orphadata are from the HPO-controlled vocabulary (20,246 classes) (10), redundancies, near-synonyms, hypernyms, and hyponyms populate the lists (e.g., dysarthria and slow slurred speech; bradykinesia and hypokinesia; masked facies and hypomimia, etc.) Furthermore, OMIM, Orphadata, and HPO do not provide native methods for visualization of phenotype.

Prior work
Limited work has been done on visualizing phenotype lists retrieved from HPO, OMIM, or Orphadata. Xu et al. (14) visualized the distances between genetic diseases and their underlying phenotypes using t-SNE (stochastic neighborhood embedding) maps. The phenotype features from the OMIM dataset were used to calculate distances between genetic diseases. The t-SNE maps are a 2-dimensional representation of the distances between genetic diseases derived from multi-dimensional data. Although these t-SNE maps provide instructive information about the distances between genetic diseases, they do not reveal the details of the underlying phenotypes. Network analysis and network graphs have been used to visualize the distances between diseases based on their phenotype (15)(16)(17). However, these network diagrams do not elucidate the underlying phenotypic differences between the diseases. Several methods have been proposed to visualize disease-phenotype relationships, including radar graphs (18), co-occurrence charts (19), and sunburst diagrams (20 An additional barrier to visualizing neurology phenotype profiles is the large number of terms in the HPO (N ¼ 20, 390), making the number of columns in heat maps or tables impractical. A feature reduction strategy that chunks phenotype features into a more manageable number of superclasses is needed. For example, Hier and Pearson (25) have suggested chunking problems in the electronic health record by body system to increase the readability of the problem list. Both OMIM and HPO chunk phenotype features by body system. Orphanet chunks phenotype features by feature frequency (common to rare). Yauy et al. (26) have chunked 16,600 phenotypic traits into 390 interacting symptom groups. However, the chunking of phenotype features by body system is unlikely to yield useful visualizations because dissimilar phenotypic features are grouped together. For example, chunking concepts by a nervous system category would put the unlike concepts of hypertonia, hypotonia, hyperreflexia, and hyporeflexia into the same category, a grouping of little diagnostic value. Although the chunking of phenotype concepts by body system or other schemes helps organize phenotype features, it does not reduce the number of features. Since the HPO is a hierarchical containment ontology, we have suggested that subsumption can create superclasses of phenotypic features and reduce the number of features (27,28).

Proposed approach and use case
We propose to improve the visualization of neurology phenotypes in the Orphdata dataset utilizing a combination of subsumption, vectorization, heat maps, and word clouds.
As proof of concept, we illustrate the utility of this approach with a use case that visualizes the phenotype lists of 32 dystonic diseases from Orphadata. In 1911 Oppenheim described the disease dystonia musculorum deformans and coined the term dystonia (29). Albanese et al. (30) defined dystonia as "a rare movement disorder characterized by sustained or intermittent muscle contractions causing abnormal, often repetitive movements, postures, or both." Since the description of dystonia by Oppenheim, many forms of dystonia have been described. Dystonia is classified along two axes: clinical and etiologic (30).
Clinical classification is by age at onset, body distribution, the temporal pattern of symptoms, and associated phenotype features. Etiologic classification is by genetic versus non-genetic causation. Dystonia is one of the hyperkinetic movement disorders which also encompasses chorea, athetosis, hemiballismus, tics, tremors, stereotypy, myoclonus, and dyskinesia (31). Although all diseases labeled dystonia have a core symptom of dystonia, there is considerable variability in the clinical presentation (signs and symptoms) of the dystonias (29,32,33), making it an excellent use case for phenotype visualization. Furthermore, better characterization and classification of the dystonias is a major initiative of the European Reference Network for Rare Diseases, and Orphadata (34,35).
We downloaded the most recent Orphadata file with phenotype annotations of 4,254 rare diseases, including 1,184 rare neurological diseases. We identified 2,946 unique HPO terms used to characterize the signs and symptoms of rare neurological diseases and created a lookup table to map each term to one of 30 superclasses based on subsumption and expert opinion. The lists of phenotypic features for 32 dystonic diseases from Orphadata were converted into 31-element vectors, with the first element of the vector being the disease name and the next 30 elements being the count of features (signs and symptoms) for each superclass. The full 32-row Â 31-column matrix of the dystonic diseases can be visualized as a feature map ( Figure 2); individual rows can be visualized as word clouds ( Figure 3B).  In the lower half of the An XML file with 4,254 rare disease disorders and 112,256 phenotypic annotations was downloaded (June 2022 release of Orphadata: (https://www.orphadata.com/phenotypes/). Phenotype features are coded using the HPO ontology. Orphadata defines a rare disease as affecting less than 1 in 2,000 individuals in Europe and classifies 1,184 of the diseases as rare neurological diseases. We used python to parse the XML file and create a variable-length list of phenotypic features for each disease. We retained phenotypic annotations that were clinical signs or symptoms and filtered out phenotypic annotations related to disease course (progressive, static, etc.), mode of inheritance (recessive, dominant, etc.), biochemical abnormality, radiological abnormality, pathological abnormality, or electrophysiological abnormality. Based on published literature, Orphadata classifies the frequency of each phenotypic feature from rare (1-4%) to always present (100%). We retained phenotypic features classified as occasional or higher (5-100%).

Lookup table to convert phenotype classes to superclasses (subsumption)
The HPO (10) is organized as a hierarchical subsumption ontology so that more-specific concepts in the ontology are subsumed by more general concepts (28). We identified 2,946 unique concepts that Orphadata used to phenotype neurological diseases. We collapsed these concepts into 30 superclasses using subsumption and domain expert opinion. Example class memberships and class counts are shown for each superclass below. We used python to assign each phenotypic feature (sign or symptom) to one of the thirty superclasses based on the lookup table (see Table 1 for an illustration of how individual phenotype features were mapped to superclasses). The lookup table is available in the Supplementary Materials.

Vectorization (conversion of phenotype lists to phenotype vectors)
Variable-length lists of phenotypic features were converted into vectors of fixed length 31 elements. The first element of the list was the disease label, and the following 30 elements were the counts of features in each of the 30 superclasses based on the lookup table. When the phenotype is represented as a vector, phenotypes can be compared by distance metrics. Furthermore, the magnitude of each element in the phenotype vector carries additional information that allows comparisons between diseases. For example, one disease with hyperkinetic features dystonia, chorea, and athetosis would have a hyperkinesia superclass value of n ¼ 3, whereas a disease with only dystonia would have a hyperkinesia superclass value of n ¼ 1. Such weightings could be useful in distinguishing between phenotypes of similar diseases.

Visualization (creation of heat maps and word clouds
Heat maps and word clouds were based on the phenotype vectors generated by python. Heat maps were created using the heat map widget from Orange (36). The score mapped for each superclass was the count of the phenotype features subsumed by that class. When a superclass had no features assigned to it, that superclass was dropped from the heat map. Word clouds were produced using the word cloud widget from Orange. Word size in the word cloud reflected the frequency of phenotypic features for a group of diseases ( Figure 1B) or a single disease ( Figure 3B).

Results
As our use case, we examined the phenotype profiles of 32 disease variants of dystonia in Orphadata. Phenotype profiles were lists of features (see Table 1 for examples of DYT4, DYT6, DYT16, and DYT27). Feature lists ranged from 5 to 48 elements, with a mean of 18.4 features +10.5. The 252 unique features in the phenotype lists were reduced by subsumption into one of the 19 available 30 superclasses (Table 1 and Figure 1A,B). This allowed visualization of the entire dystonia disease set of 32 variants as a heat map (Figure 2). This heat map allows an easy distinction of pure dystonia (e.g., DYT25 and DYT26) from dystonias with sensory loss (e.g., autosomal dominant dopa-responsive dystonia), cognitive impairment (e.g., DYT4) and hypokinesia (e.g., adultonset dystonia-parkinsonism). Individual rows in the heat map ( Figure 3A can be further visualized with word clouds which emphasize phenotypic differences between the dystonia variants (see Figure 3B for word clouds of DYT4, DY6, DYT16, and DYT 27.) (A) To characterize the 32 dystonic diseases, 528 total concepts and 252 unique concepts were used. The most frequent concepts used were dystonia, bradykinesia, generalized dystonia, dysarthria, and focal dystonia. (B) After feature reduction by subsumption, the number of superclasses needed to characterize dystonia diseases was reduced to nineteen. The largest superclass is hyperkinesia which encompasses dystonia, generalized dystonia, focal dystonia, blepharospasm, craniofacial dystonia, and others.

Discussion
Rich and detailed information on the phenotypes of neurological diseases is held in online repositories such as OMIM, HPO, and Orphadata. Detailed phenotypic data is available for download and can be used to gain insights into the inter-relationships between genes, disease, and phenotypes. Nonetheless, the visualization of the phenotypes retrieved as lists remains problematic. We identified several limitations to the visualization of disease phenotypes that included:  To address these limitations, we proposed restricting our attention to visualizing the phenotypes of rare neurological diseases in Orphadata (N ¼ 1, 184). We mapped each of the 4,505 unique features used to describe signs and symptoms in Orphadata into one of 30 superclasses (see list in the Methods section). This allowed us to convert phenotype lists of variable length to vectors of fixed length (31 elements), in which the first element of the vector was the disease label and the next 30 elements were the count of features for each of the 30 superclasses. This process of converting a list to a vector is illustrated in Table 1 for DYT4, Feature map of 32 dystonias from Orphadata. Each row is a different variant of dystonia. Each column is one of 19 phenotype superclasses. Counts in columns range from 0 to 8. The color scale is centered at 1. Rows and columns are clustered by hierarchical clustering with Ward linkage. Distances between columns are by Pearson correlation coefficient. Distances between rows are by Euclidean distance. Hyperkinesia is the most frequent feature, followed by tremor, behavior, hypokinesia, speech_language, and miscellaneous (See word cloud in Figure 1B) DYT6, DYT16, and DYT27. Only 11 of the 30 superclasses were needed to represent these four dystonias. Once phenotype lists are converted to vectors, a group of diseases can be represented as a matrix. For example, 32 dystonic diseases from Orphadata can be converted to a matrix with 32 rows (each row a disease) and 20 columns (each column a superclass of phenotypic features plus one column for the disease label) and then visualized as a heat map (Figure 2). For easy readability, individual rows (diseases) in the heat maps can be converted to word clouds to visualize better the phenotype ( Figure 3B). We have addressed limitation (1) (long feature lists) by using subsumption to collapse 4,505 phenotypic classes into 30 neurological superclasses. This subsumption of numerous phenotypic features into 30 superclasses also addressed limitation (2) (too many near-synonyms) and limitation (3) (too many unique features). Once phenotype lists of variable length are converted to vectors of fixed length, side-by-side comparisons of diseases become feasible through the use of heat maps and word clouds ( Figures 3A,B); addressing limitation (4). Another advantage of vectorization is that it allows the calculation of distances between phenotypes using standard distance metrics such as cosine and Euclidean. Figure 2 demonstrates the clustering of rows (dystonic diseases) using the Euclidean distance. We filtered out biochemical, radiological, electrophysiological, and pathological features to address limitation (5) (thus, limiting the phenotype to signs and symptoms.) This work has some significant limitations. First, collapsing granular phenotype features into superclasses by subsumption involves information loss. The superclasses retain no laterality information (left-sided versus right-sided weakness, etc.) The superclasses retain no topographical information (proximal versus distal weakness, etc.) The high information value of some granular phenotype features, such as impaired vertical gaze (a sign of progressive supranuclear palsy) or internuclear ophthalmoplegia (a sign of multiple sclerosis), is lost when the granular features are collapsed into the superclass of abnormal eye movements. Second, our current process of collapsing phenotype concepts into superclasses requires a manually constructed lookup table that assigns each concept to a superclass. Errors can be made in assigning concepts to superclasses. We are looking at ways to improve the subsumption process that collapses ontology concepts into superclasses. Third, heat map scales are non-linear. For each superclass score, we counted the number of features in that superclass. For example, a disease phenotype with the term hemiparesis would have a superclass score of 1 for weakness. In contrast, a disease phenotype with terms arm weakness and leg weakness would have a superclass score of 2. Furthermore, we did not weight phenotype features by importance. In building the features maps, a more general concept like hyperreflexia carries the same weight as a more limited concept such as increased biceps reflex. We are exploring whether normalization or other transformations of the underlying data would improve the utility of the heat maps. Fourth, the size and granularity of the superclasses were not uniform. For example, the vision superclass subsumed 450 concepts and had many different types of visual impairment, whereas the fatigue superclass subsumed only 26 concepts and reflected the concept of fatigue alone. Fifth, our selection of thirty superclasses was somewhat arbitrary and subject to modification. Although the selection of the thirty superclasses reflected domain expert opinion and the underlying structure of the ontologies, other useful partitions of the ontology into superclasses are possible. For example, chorea or dystonia could have been distinct superclasses instead of subsumed into hyperkinesia. Speech (e.g., dysarthria) and language disorders (e.g., aphasia) could have been separate superclasses. Sixth, the superclasses were restricted to neurological terms and neurological diseases. As a result, the heat maps will not be useful in visualizing the phenotypes of nonneurological diseases. Furthermore, the heat maps will not adequately visualize important non-neurological signs and symptoms of diagnostic value (such as Kayser-Fleisher rings for Wilson's disease (37)). Although true pathognomonic signs and  Table 1. Each row in the heat map represents a column of signs and symptoms from  (1,(38)(39)(40), the heat maps lack the granularity to show pathognomonic signs. Furthermore, the current heat maps do not support a drill down to the underlying granular phenotype features. Although we used Orange to create the heat maps, suitable heat maps are also available in python, and R. Other heat map color schemes are available and may give better visualizations. The Orphadata phenotype datasets are undergoing revisions and improvements. Some diseases are phenotyped more completely than others. Although the dataset is curated, omissions, errors, and discrepancies can still occur. Finally, a similar analysis could have been done with phenotypic annotations from the OMIM or HPO datasets. Despite these limitations, combining feature reduction by subsumption with vectorization of phenotype lists followed by visualization by heat maps and word clouds offers a robust method to explore neurology phenotypes. Subsumption permits the reduction of thousands of ontological concepts into a reduced number of phenotype superclasses. Vectorization allows the conversion of variable-length phenotype feature lists into superclass vectors of fixed length. Matrices of superclass vectors allow the side-by-side comparison of disease phenotypes as heat maps. Individual rows in the heat maps can be visualized with word clouds, providing an easy-to-grasp representation of a disease phenotype.

Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: www.orphadata.com/phenotypes/ and in Supplementary Materials.

Author contributions
Concept by DBH and RY. Data analysis by DBH and RY. Data interpretation by DBH, MDC, RY, and DCW III. Writing, revision, and approval by DBH, MDC, RY, and DCW III. All authors contributed to the article and approved the submitted version.

Funding
MDC received financial support from the Veterans Administration and Biogen.