What do we mean with sound semantics, exactly? A survey of taxonomies and ontologies of everyday sounds

Taxonomies and ontologies for the characterization of everyday sounds have been developed in several research fields, including auditory cognition, soundscape research, artificial hearing, sound design, and medicine. Here, we surveyed 36 of such knowledge organization systems, which we identified through a systematic literature search. To evaluate the semantic domains covered by these systems within a homogeneous framework, we introduced a comprehensive set of verbal sound descriptors (sound source properties; attributes of sensation; sound signal descriptors; onomatopoeias; music genres), which we used to manually label the surveyed descriptor classes. We reveal that most taxonomies and ontologies were developed to characterize higher-level semantic relations between sound sources in terms of the sound-generating objects and actions involved (what/how), or in terms of the environmental context (where). This indicates the current lack of a comprehensive ontology of everyday sounds that covers simultaneously all semantic aspects of the relation between sounds. Such an ontology may have a wide range of applications and purposes, ranging from extending our scientific knowledge of auditory processes in the real world, to developing artificial hearing systems.


Introduction
Sounds, i.e. the acoustic signals that populate our environment, are ubiquitous in everyday life. Humans and animals produce sounds to communicate (e.g. speech, vocalizations) and as a consequence of their actions and of their interaction with the environment (e.g. walking, flapping wings). In addition, many manufactured objects and devices that we use daily produce sounds through a variety of mechanisms and processes (e.g. hammering, combustion engines). The development of knowledge organization schemes for the characterization and systematization of . /fpsyg. .
natural sounds and of their relations has been an object of interest in several research fields, including cognitive and behavioural psychology, machine sound classification, sound design, music theory, soundscape analysis and medicine. Here, we survey this multifaceted literature to assess the state-ofthe-art and to aid the future development of comprehensive everyday sound ontologies. The frameworks for the description of everyday sounds available in the literature differ greatly in terms of their methodology and complexity (Guastavino, 2018). In the simplest cases, sound classifications are obtained by grouping sounds according to some settled attribute (e.g., voice vs. tool sounds, Lewis et al., 2005;urban vs. rural soundscape contexts, De Coensel et al., 2003; a classification is a set of groups that distinguishes different kinds of things). In other more complex cases, taxonomies are provided in which sounds are arranged in nested, cumulative hierarchies, extending to certain depths (Gaver, 1993; a taxonomy typically follows a hierarchical organization based on a shared relation, e.g., is-a/subclass of). Finally, in a few cases, ontologies are proposed that describe attributes and properties of sounds as well as the connection between them in an any-to-any direction and in a machinereadable and machine-interpretable format (Cherny et al., 2016; an ontology focuses on the description of entities and their relations, and on making a computational representation of this knowledge accessible to automated reasoning). Here, we focused on taxonomies and ontologies with the goal of systematically surveying the state of the art in the characterization of sound semantics from both a quantitative and qualitative perspective.
Taxonomies and ontologies served different purposes in different research fields. In auditory cognition, they have aided research on the factors driving human perception of complex sounds (e.g., Gygi et al., 2007;Lindborg, 2016). Here, theoretically-driven sound taxonomies have been proposed based on psychoacoustic dimensions (pitch, loudness, timbre) or on the physical properties of sound-generating objects (e.g., solid, liquid, gas phases) and related physical processes (e.g., impacts, explosions, flows; Gaver, 1993, p. 1). Alternatively, sound taxonomies have been derived within a data-driven framework from the results of sound perception experiments involving various behavioral tasks (Gygi et al., 2007). Within the domain of soundscape research (Southworth, 1969;Schafer, 1977;Guastavino, 2018), taxonomies have been developed to describe and characterize complex acoustic scenes in different contexts (e.g., urban, nature) and locations (e.g., restaurant, household), their perception (Lindborg, 2016), and their impact on quality-of-life metrics such as noise pollution. In the field of sound design, description schemes are used to structure the metadata of large sound databases and for automated audio (i.e., recorded, transmitted or reproduced sound) retrieval, for example, to select sound effects for the digital content industry (e.g., film and video game industries) (Cherny et al., 2016). Machine learning and artificial hearing are other relevant research fields in which sound description schemes have been developed. This field is rapidly expanding, also thanks to recent breakthroughs and techniques such as deep neural networks (DNN), which achieved near-human performance for different sound recognition tasks (Hershey et al., 2017). In this research field, sound description schemes are developed to organize labels for training DNNs and other supervised algorithms  and to assess the algorithms' performance. Finally, taxonomies have been developed to describe in great detail specific classes of sound sources, e.g., in music theory (i.e., the discipline encompassing the methods and concepts used to compose music, where music is defined as the "vocal or instrumental sounds-or both-combined in such a way as to produce beauty of form, harmony, and expression of emotion"-Oxford Languages dictionary) to systematically classify musical instruments based on the mechanisms of sound production (von Hornbostel and Sachs, 1961) or in medicine to distinguish physiological from pathological sounds and assist in the diagnostic procedure (Prystupa, 2013).
The remainder of the article describes the work we conducted: (i) to identify publications reporting on taxonomies or ontologies of sounds; (ii) to map each manuscript/scheme to a specific research field; (iii) to quantify metrics of these knowledge organization systems (e.g., total number of classes/levels) and (iv) to label each class in the systems according to their sound descriptors type (e.g., acoustics; sound quality, sound-source semantics, or other). We then briefly discuss representative publications in each research field, with the goal of highlighting their distinctive aspects. Finally, we discuss our findings critically and suggest general guidelines for the future development of comprehensive sound ontologies. The data analyzed in this survey, as well as an implementation of each of the identified taxonomies and ontologies using the Web Ontology Language (OWL), are made available in a public repository (see Data and code availability, below).

Search and selection strategy
From October 2020 to January 2021, we used two approaches to retrieve existing taxonomies and ontologies of sounds: (a) Two systematic searches on the databases Google Scholar and Web of Science were performed on December 8th 2020. The following terms were included in the search: (i)"Audio"; "Sounds"; (ii) "Taxonomy"'; "Ontology." (i) and (ii) were separated by the Boolean term "AND, " and the terms within (i) and (ii) were separated by the Boolean term "OR." . /fpsyg. .

(b) References in key publications about sound taxonomies and ontologies.
Records were eligible for the survey if the full specification of the taxonomy/ontology was available either as a computer file (TTL, JSON, OWL), or in the manuscript. Records were excluded if the sound description scheme was a simple classification (i.e., a set of terms), or if the scheme did not provide all class headings (e.g., a dendrogram with subordinate class headings only). Taxonomies specific to music (e.g., music genres, Pachet and Cazaly, 2017;musical instruments, von Hornbostel and Sachs, 1961) were not considered to keep the review focused on natural sounds and avoid results heavily biased toward the music domain. Note that this criterion did not result in the complete elimination of music-related descriptors from our analyses because part of the selected taxonomies did include music genre and musical instrument classes (e.g., Gemmeke et al., 2017). All retrieved records were combined for duplicates removal, using R statistics package dplyr (Hadley et al., 2022), and further by manually removing additional duplicates not detected using dplyr. Records were selected by authors RdMA and BLG, first based on a screening of the abstracts, and second based on the full text. The list and characteristics of the records selected for this review are reported in Table 1.

Information extraction and sound descriptor analysis
For every study included in the review, the following information was extracted: first author, title, year of publication, research field, and reasoning (i.e., data-driven or theorydriven). We then computed the total number of classes in each taxonomy/ontology. The subordinate levels of some taxonomies (Bones et al., 2018, p. 3), included repeated classes that referred to different sounds exemplars of the same class presented in the behavioral experiments (e.g., "dog1, " "dog2, " etc.). These exemplar-level repetitions were excluded from the analyses (e.g., only "dog1" was kept).
We then carried out an analysis of the semantic domains covered by the class definitions by assigning taxonomy/ontology class descriptors to the following categories (classification carried out by authors RdMA and BLG; see Figure 1): (a) Source-describing the attributes of the sound-generating objects and events, and comprising the following attribute classes: (a.i) what-denoting the vibrating object and substance that generate the sound (e.g., shoes); (a.ii) how-denoting the actions or mechanisms of sound generation (e.g., walking); (a.iii) who-denoting the sound-generating agent (e.g., person); (a.iv) where-denoting the spatial context (e.g., in a train station); (a.v) when-denoting the temporal context (e.g., in the morning; during a holiday).
Each of the above six descriptor categories included a generic sub-category that included the category name (e.g., "signal") and a series of descriptors that could not differentiate among class exemplars (e.g., "natural sound, " "auditory scene, " "soundscape" for the sound source category; "perception, " "experience, " for the category of sensation descriptors).
Each class definition could contain multiple sound descriptions (e.g., multiple class-defining exemplars as "crumpling, crushing" in Gaver, 1993) and each sound description could contain multiple words (e.g., "train station" in Salamon et al., 2014). With the notable exception of several generic descriptors defined by multiple words (e.g., "sound source, " "audio event"), the sound-descriptor classification was in general carried out on a word-by-word basis, at the level of single verbs, nouns, adjectives and adverbs in the class definitions. Each word could potentially be assigned to multiple descriptor categories (e.g., "earthquake" was classified both what/how), with eventual arising ambiguities resolved within the context of each taxonomy/ontology (e.g., "pop" in "pop . /fpsyg. . music" was classified as a music genre, whereas "pop" in the class definition "burst, pop" was classified as how/onomatopoeia).

Systematic search
The systematic search in Google Scholar and Web of Science yielded a total of 1,000 and 419 hits, respectively. After screening the abstracts, 49 records were selected from the results of Google Scholar and 53 records were selected from Web of Science, resulting in 88 unique records after removal of duplicate records. After full-text screening, 15 records were included for review. Additionally, 14 records were identified through references in key publications about sound taxonomies and ontologies, for a total of 29 records through the systematic search (see Figure 2, for a flow chart of the inclusion of records and Table 1 for the list and characteristics of the selected records).
The vast majority of the words (3,478, 94.41%) were assigned to one single sound descriptor class (see Figure 3). The remaining 206 words (5.59%) were instead assigned to more than one descriptor class. Most of these 206 "plurivalent" word classifications were onomatopoeias (183, 88.84%) that denoted in the majority of the cases the properties of a soundgenerating action (e.g., "crack, " "woosh"; how/onomatopoeia ambivalence; N = 167, 81.07%). The rest of the plurivalent word classifications (23, 16.02%) included words classified as belonging to multiple subclasses of the sound-source .
We finally carried out a dictionary-overlap analysis to quantify the contribution of each research field to the verbal description of natural sounds in each of the semantic categories of the hierarchy of natural sound descriptors ( Figure 6). Machine hearing systems included the largest number of natural sound descriptors for the entire dictionary (N unique words = 1,563), for the majority of the main classes of sound descriptors (source, signal, onomatopoeia, music, and other, N unique words = 1,130, 256, 99, 80, and 61, respectively), and for the majority of the sound descriptor subclasses (what, how, who, and non-auditory subclasses, N unique words = 444, 451, 112, and 30, respectively). This, in general, expressed the fact that machine hearing systems comprised, overall, the largest . /fpsyg. .

References Title (ID)
Research field Type Reasoning N classes Gaver (1993) What in the world we hear? An ecological approach to auditory event perception Auditory cognition Taxonomy  Theoretical  56 Guastavino (2007) Miranda et al. (2000) Categorizing The number (N) of classes is summed across categorization schemes from the same publication. ID, identification label for multiple categorization schemes, part of the same resource.

FIGURE
Co-occurrence of sound-descriptor word-by-word classifications across all considered knowledge organization systems. O -diagonal values indicate words classified, simultaneously in more than one descriptor class, whereas diagonal values indicate word classifications attributed to one single class. The subscript g denotes the generic descriptor classes (e.g., "sound source," "audio signal"); non-aud, non-auditory; onom, onomatopoeia.
dictionary of sound descriptors (N unique words = 913; N unique words for systems in auditory cognition, soundscape research, sound design, and medicine = 268, 465, 396, and 50, respectively). More interestingly, systems developed within the field of sound design contained the largest dictionary of sensory descriptions (N unique words = 89) and, more specifically, for the description of the auditory attributes of the heard sounds (N unique words = 55), whereas systems developed within the . /fpsyg. . field of soundscape research included the largest dictionaries for the description of the spatial and temporal sound source context (where and when, N unique words = 136, and 20, respectively).

Auditory cognition
The theoretical framework introduced by Gaver in the early 1990's has been very influential for subsequent auditory cognitive research on real-world sound perception (Gaver, 1993). Gaver considered listening to everyday sounds from the perspective of its ecological goal, i.e., source inference, rather than from a merely perceptual perspective. He produced a taxonomy of everyday sounds based on audible source attributes, linked to the physical properties of the sources and to the physical mechanisms of sound generation and not on common auditory perceptual dimensions. The main categorization in Gaver's taxonomy thus entails the material of the sound producing sources: vibrating (solid) objects, aerodynamic sounds and liquid sounds. The subsequent hierarchical tree refers to simple material-specific physical mechanisms (e.g., impact, scraping, for solid material, explosions for gasses, and dripping, splashing for sounds) and to more complex and hybrid events that combine different materials and distinct physical mechanisms (e.g., rain on a surface, drip in a container). The taxonomy is rather limited (it has a total of 56 classes), but it has the merit of highlighting the relevance of physical mechanisms (how) in the description of everyday sounds. Whereas Gaver's work was purely theoretical, records labeled in our survey as behavioral research included mostly taxonomies derived through experimental studies where participants are asked to listen to multiple sounds and to group them according to their perceptual similarity. In these studies, authors aim to understand how humans perceive and categorize sounds at different levels of semantic abstraction. In a recent study, Bones et al. (2018) derived five different taxonomies from their behavioral experiments, conducted through a web application. Participants were asked to listen to different sounds and sort them into five arbitrary groups, which they could name as they wish. For each of the five experiments, authors derived taxonomies by hierarchical cluster analysis of participants' responses. The "top level" of their taxonomies, the "soundscape" taxonomy, represents sounds described in the soundscape literature, including multiple indoor and outdoor environments. The taxonomy has three Frontiers in Psychology frontiersin.org . /fpsyg. . classes on its superordinate level: "man-made, " "nature" and "people." The class "people" has the subclasses "voices" and "music." Nature has the subclasses "animals" and "weather" and man-made has the subclasses "industrial" and "household." The subordinate level includes the audio samples used in the behavioral task (e.g., "children, " "crowd" as subclasses of "voices" or "water 1" or "thunder 1" as subclasses of "weather"). This taxonomy has in total 50 classes. From this top-level "Soundscape" taxonomy, two categories were selected for further exploration in a new experiment: "man-made" and "nature." Two middle level taxonomies were derived from each experiment. The "nature" taxonomy has three classes in the superordinate level: "animals, " "water" and "nature" and includes 21 classes (after cleaning repeated classes in the subordinate level). The "man-made" taxonomy has two subordinate classes: "home" and "outside" and includes 28 classes. The article also includes two additional taxonomies, addressing selected sound classes ("dogs" and "engines") generated through hierarchical clustering of participants' responses. A similar specific level of sound categorization was addressed in Guyot et al. (2017) describing an experiment to understand how humans categorize liquid sounds, through a sorting task performed by 30 participants. The participants had to create classes of liquid sounds according to their physical similarities. By means of agglomerative cluster analysis, the results were subsequently grouped in a taxonomy describing continuous and discrete interactions of solids and liquids.

Soundscape research
Brown and colleagues defined a taxonomy for soundscape studies reflecting mostly the outdoor, urban sonic environment (Brown et al., 2011). With a total of 47 classes, this taxonomy was built with the goal of generating a common framework for soundscape research and of promoting interoperability between different research fields. The taxonomy originated from a working group conducted in 2009 ("assessment of soundscape quality"), where participants from multiple disciplines were involved, such as acoustics, engineering, planning, architecture, design, park management, psychology, sound quality, sociology and geography. The classes of this taxonomy were based on two criteria: I-It can be applied to any acoustic environment, and IIthe way the classes were named aimed to avoid judgements of value or double connotations. The "acoustic environment" class is located at the superordinate level, and is parent of the classes "Indoor" and "outdoor" acoustic environments. The taxonomy is focused on the class "urban acoustic environment" and, at its . /fpsyg. . subordinate level, it includes classes such as "rail traffic, roadway traffic, marine traffic and air traffic, " all under the same parent class "motorized transport." Another example of subordinate classes are "bells, " "clock chimes, " "fireworks, " "azan" and "alarms, " all subclasses of the parent class "social/communal." Lindborg (2016) provides a recent example of soundscape taxonomy aimed at describing sounds of a specific environment, such as restaurant sounds. This taxonomy could assist in developing tests for soundscape research (e.g., regarding pleasantness), or for developing sound simulations of . /fpsyg. .

FIGURE
Example of ontological characterization of a "walking on wooden floor" sound event in an ontology that combines descriptions at the sound source and signal levels.
restaurants (Lindborg, 2016). Moreover, restaurant owners could use these findings to tailor the soundscapes in their restaurant and improve customer experience. This taxonomy is data-driven, as it is based on data collected in on-site experiments, using free-form annotations and a questionnaire targeting customer behavior and environment perception at restaurants. Data from 40 different restaurants were collected, with a total of 393 subjects participating in the study. The taxonomy was built following cladistics (from the Greek "klados", branch). Cladistics considers a taxon as the unit that can be associated with the same unit at a higher level. This posed a challenge for our modeling, as a class cannot be parent and child at the same time in ontologies. We therefore computed an extra class, indicating that it is the same class connected to the repeatable parent (e.g., laughter -> laughter.laughter instead of laughter->laughter). We converted Lindborg's taxonomy in an ontology file (.OWL) based on the pictorial representation reported in the manuscript, and estimated 58 classes.
Machine learning and artificial hearing Mitrovic et al. (2010) performed an extensive review of the literature on automated content-based audio retrieval, with the purpose of summarizing and organizing the (lower level) audio features that have been used. Audio features can be defined as a digital form of information representation retrieved from the original audio signal. For instance, low-level features incorporate parameters like fundamental frequency, bandwidth or mel-frequency cepstral coefficients (MFCCs). This taxonomy was tailored to assist researchers in selecting (groups) of features appropriate to the specific audio retrieval task and classifies the audio features based on their extraction process and on the domain to which the audio features belong to. In total, this taxonomy has 87 classes. The description of each of these classes is beyond the scope of the present manuscript, and thus just a short overview is provided here. At the higher level of the hierarchy, the following classes are proposed: temporal domain, frequency domain, cepstral domain, modulation frequency domain, eigen domain and phase space. Commonly used features are included in the temporal and spectral domain. Subclasses of the temporal domain are amplitude, power and zero crossing. The subclasses of the frequency domain are distinguished in physical and perceptual subclasses. The physical subclasses include features such as adaptive timefrequency decomposition, autoregression coefficients and shorttime Fourier transformation. The subclasses of the perceptual class are related predominantly to psychoacoustic features, such as brightness, chroma, harmonicity, loudness, pitch and tonality. In the subordinate level of the latter are the subclasses group delay, spectral flux, spectral peaks, spectral slope and subband energy ratio.
Whereas, the taxonomy by Mitrovic et al. (2010) is mostly concerned with low level acoustic features, the Audioset ontology, published by the Google Research group in 2017  organizes real-world sounds, based on their high level (semantic) relation (see Figure 7). AudioSet consists of a hierarchical taxonomy, which is openly accessible, and a dataset of more than 2 million 10 s sound clips drawn from YouTube videos. The project is aimed at assisting . /fpsyg. . the development of sound classification models. The full ontology has 632 classes. A subset of 527 and 523 classes were then used as labels for training the VGGish and Yamnet DNN models (Hershey et al., 2017), respectively. In Audioset, sound classes are subdivided across the following superordinate levels: Human sounds, Animal sounds, Natural sounds, Music, Sound of things, Source-ambiguous sounds, Channel environment and Background. In the subordinate levels, classes like biting, cacophonia or raindrops can be found. Whereas, the large majority of classes has a single parent, a few classes have multiple parents. Classes are linked to URLs from WordNet or Wikipedia. Furthermore, individual instances were added to the model containing the links to the YouTube video. For instance, biting is a subordinate class, and it contains a rdfs:comment with a WordNet URL. It contains the following instances: Biting1 until Biting6. Each of these Biting instances has a URL of a YouTube video with biting sounds.

Sound design
Despite being a relatively new field, sound design ranked third for number of records (Cherny et al., 2016) built an ontology to optimize searches in very large audio sample databases. Typically, sound designers do this by querying an audio sample management software, which relies on limited, unstructured metadata and generally returns multiple results to be inspected. As that is time-consuming, the authors engineered an ontology that enables structural queries through semantic classes. For instance, queries that can find a specific sound described by the onomatopoeia "whoosh." To develop the ontology, the authors restructured existing textual metadata of sound samples databases, analyzed various knowledge sources and examined the needs of sound designers through personal interviews. This ontology has 283 classes and deals both with basic metadata (file and library names, etc.) as well as semantic content (topics). Among these topics, the authors distinguished .

Medicine
Two medical sound taxonomies were identified through the systematic search, one for abdominal sounds (Dimoulas et al., 2011), and another one for lung sounds (Boucher et al., 2013). In both cases, the authors developed the taxonomies to assist diagnosis and to educate medical staff and students in distinguishing physiological from pathological sounds. The Abdominal Sounds Pattern Analysis (ASPA) Schema is a hierarchical taxonomy to classify abdominal sounds, while accounting for interfering noises, such as respiratory related sounds, ambient noises and interfering heart sounds (Dimoulas et al., 2011). In total, this taxonomy has 13 classes. Boucher and colleagues published a taxonomy of breath sounds and lung auscultation based on their literature review. In their taxonomy, "lung sounds" are subdivided in physiological and pathological sounds, which are further categorized in several subclasses. Boucher's taxonomy has in total 12 classes.

General considerations
In this article, we surveyed published taxonomies/ontologies of everyday sounds. Through a systematic search in two databases, we identified 36 taxonomies/ontologies originating from 29 publications, which could be mapped to five different research fields: auditory cognition (n = 11, 30.5%), soundscape research (n = 5, 13.8%), artificial hearing (n = 8, 22.2%), sound design (n = 5, 13.8%), and medicine (n = 3, 8.3%). Approximately half (n = 17, 47%) of the identified taxonomies/ontologies were developed as theoretical frameworks for the characterization of sounds, while the others (n = 19, 53%) were data-driven, i.e., they were derived through the analysis of behavioral data or the analysis of properties of sounds in databases. The extension of the taxonomies/ontologies, as indicated by the number of classes, differed largely and ranged from a few classes to n = 633 classes, included in Audioset (Google), the largest taxonomy in our search. Furthermore, the level of precision differed greatly between schemes and, in some cases, also within the same scheme for different sound classes.
To evaluate the identified taxonomies/ontologies within a homogeneous framework, we introduced a set of descriptors (Source, Sensation, Signal, Onomatopoeia, Music, Other), which we then used to label all extracted classes (n = 2,319). The resulting quantitative analysis showed that none of the taxonomies/ontologies covered these different descriptors at the same level of detail. The large majority of taxonomies/ontologies were developed to characterize higher-level semantic relations between sound sources. This suggests that causal listening (Schafer, 1977, or everyday listening as defined by Gaver, 1993) is a default listening mode for natural sounds. More precisely, it shows that the most informative way to describe natural sounds verbally focuses on the properties of the sound source, rather than on sensory or acoustic attributes, or an onomatopoeic dictionary. Examination of the sound descriptors that further specify the source properties indicated that the super-ordinate organization of animate (who) and inanimate (what) objects was based on commonalities either of the mechanisms of sound generation (how) or of the semantic/environmental context (what/where). The first case is most frequent for taxonomies/ontologies developed in auditory cognitive research (e.g., Gaver, 1993), whereas the latter case is most frequent in soundscape research. Audioset follows a taxonomic organization based on the semantic/environmental context for some classes, and on the sound generation mechanisms for other classes. Other taxonomies/ontologies, especially in the research field of artificial hearing and sound design, were concerned with the description of sounds at signal level (e.g., Mitrovic et al., 2010). In these cases, most of the classes described low-level acoustic features, which are directly computable from the sound waveforms, while other classes referred to perceptual auditory dimensions (e.g., pitch, loudness). However, in these taxonomies/ontologies, no formal attempt was made to capture simultaneously all these aspects of the semantic of everyday sounds, or to organize systematically the relation between different semantic aspects such as acoustic-sensory features and higher level semantic properties (e.g., should acoustic-sensory features define the organization of sound exemplars of specific sound sources, or should they constitute an organization scheme orthogonal to that denoting the sound source?). As revealed in an analysis of the overlap of the sound description dictionaries in different fields, a potential route to the development of such a comprehensive ontology of natural sounds will integrate knowledge organization systems across the fields of machine learning (what and how descriptions), sound design (sensory and auditory descriptions) and soundscape research (where and what descriptions).

Open challenges and contribution to future research
The clear separation between taxonomies/ontologies that cover, at least partially, the high-level semantic properties of sound sources on one side and those covering the sound acoustics on the other side, highlights the lack of a comprehensive ontology covering simultaneously multiple aspects of the sound semantics. A new ontology should describe each sound in terms of a set of properties (as axioms using the Web Ontology Language-OWL), such that an automated OWL-reasoner would be able to check sound description consistency and derive taxonomies automatically. For instance, such an ontology could represent the sound "walking on a wooden floor" as a "patterned sequence (signal) of impact sounds generated by the interaction between two solids (how: mechanism): the human feet (who/what: agent, agent part) and the wooden floor (who/what: object)." With these structured descriptions, an OWL-reasoner automatically infers that "walking on a floor" and "tapping fingers on a table, " for example, are both impact sounds (Figure 8).
In this new ontology, the descriptors we introduced could be modeled as sound properties and could be elaborated further. The current set of who/what words, as derived from the reviewed studies (see Table 1) is exhaustive and covers a wide range of everyday living and non-living sound-generating objects. However, the set of who/what classes can be easily extended by linking the ontology to existing digital resources, which already include detailed taxonomies of everyday objects (e.g., WordNet, Wikidata). We expect that this approach will result in a collection of objects large enough to cover the majority of applications. Additionally, other object taxonomies developed for specific fields (e.g., musical instruments) or applications (e.g., medical sounds) can be easily incorporated or linked to. The actual required precision of the classes (i.e., if it is acceptable or not that e.g., "seagull" == "bird") will then depend on the specific use-case and on concrete aspects, such as the number of labeled sounds in the database for machine hearing applications, or the specific research question for behavioral studies. If needed, classes can be hierarchically propagated to the higher levels to match the desired requirements.
An important challenge for the new ontology will be to express how words (typically action verbs and abstract nouns) in terms of elementary sound generation mechanisms, such as those indicated by Gaver (1993) and subsequent psychoacoustic research (e.g., Lemaitre and Heller, 2013). In the example above, "walk" is broken down into a sequence of impacts of solid objects. Performing a similar attribution of properties for all other "how" words will introduce a relevant dimension of sound description, which is not immediately captured neither at signal/acoustic level nor at the level of (linguistic) semantic relations between the sources. To the best of our knowledge, there is no available resource that categorizes verbs and nouns based on the underlying sound-generating mechanisms, thus modeling this property will likely require expert knowledge.
Finally, additional properties in the ontology could be included to describe. contextual information, as that conveyed by where/when words. Typically, this information refers either to physical (e.g., "restaurant") or relative/generic location (e.g., "nearby") and to temporal contexts ("at night"). An ontology of everyday sounds developed along these lines would capture the multifaceted nature of sound semantics and could be used in several research domains and practical applications, from systems organizing sounds within highly specific domains (such as those encountered in medicine) to the development of multi-task artificial hearing systems capable of captioning sound scenes in terms of objects involved and in terms of pleasantness experienced by the listener. Furthermore, considering the recent trends and direction in AI research, such a resource could be useful to create neurosymbolic systems for sound analysis (Hitzler and Sarker, 2022). Importantly, using novel ontology embeddings methods (e.g., Chen et al., 2021), numerical representations of sounds and auditory scenes can be derived that encode the semantics described in the ontology graphical and logical structure and in the lexical descriptions. In turn, these numerical sound representations can be used to test and compare hypotheses in model-based analysis of empirical behavioral and neuroimaging data (e.g., Elliott et al., 2013;Giordano et al., 2013;Norman-Haignere et al., 2015;De Angelis et al., 2018).
To favor the integration of the information we gathered in this survey in future developments, we are making available a structured, machine-readable implementation of all taxonomies/ontologies, as well as the entire set of extracted classes (see Data and code availability, below). Only a few of the publications provided a digital version of the taxonomy/ontology and, in most cases, taxonomies/ontologies could only be derived through figures in the article. We reasoned that the availability and easy access to digital versions of the ontologies may help future developments and facilitate their comparison, and enable intercommunication and integration across the research fields. We thus converted each identified taxonomy/ontology in OWL language, which is readable through any text editor and can be edited using ontology editors (e.g., Protegé R ).

Data availability statement
Data and owl implementations from this study are available at https://doi.org/10.6084/m9.figshare.20813626. Further inquiries can be directed to the corresponding author.