HEAR4Health: a blueprint for making computer audition a staple of modern healthcare

Recent years have seen a rapid increase in digital medicine research in an attempt to transform traditional healthcare systems to their modern, intelligent, and versatile equivalents that are adequately equipped to tackle contemporary challenges. This has led to a wave of applications that utilise AI technologies; first and foremost in the fields of medical imaging, but also in the use of wearables and other intelligent sensors. In comparison, computer audition can be seen to be lagging behind, at least in terms of commercial interest. Yet, audition has long been a staple assistant for medical practitioners, with the stethoscope being the quintessential sign of doctors around the world. Transforming this traditional technology with the use of AI entails a set of unique challenges. We categorise the advances needed in four key pillars: Hear, corresponding to the cornerstone technologies needed to analyse auditory signals in real-life conditions; Earlier, for the advances needed in computational and data efficiency; Attentively, for accounting to individual differences and handling the longitudinal nature of medical data; and, finally, Responsibly, for ensuring compliance to the ethical standards accorded to the field of medicine. Thus, we provide an overview and perspective of HEAR4Health: the sketch of a modern, ubiquitous sensing system that can bring computer audition on par with other AI technologies in the strive for improved healthcare systems.


Introduction
Following the rapid advancements in artificial intelligence (AI), and in particular those related to deep learning (DL) [1], digital health applications making use of those technologies are accordingly on the rise.Most of them are focused on diagnosis: from computer vision techniques applied to digital imaging [2] to wearable devices monitoring a variety of signals [3,4], AI tools are being increasingly used to provide medical practitioners with a more comprehensive view of their patients.Computer audition complements this assortment of tools by providing access to the audio generated by a patient's body.Most often, this corresponds to speech produced by the patients -sometimes natural, mostly prompted [5,6,7].However, there exists a plethora of auditory signals emanating from the human body, all of which are potential carriers of information relating to disease.
Acquiring those auditory biosignals is the first, crucial step in a computer audition pipeline.Oftentimes, this must be done in noisy environments where audio engineers have little to no control, e. g., in a busy hospital room or the patient's home.This results in noisy, uncurated signals which must be pre-processed in order to become usable, a process which is extremely laborious if done manually.Automating this process becomes the domain of the first of four outlined pillars, (I) Hear, which is responsible for denoising, segmenting, and altogether preparing the data for further processing by the downstream algorithms.
Those algorithms typically comprise learnable components, i. e., functions whose parameters are going to be learnt from the consumed data; in the current generation of computer audition systems, the backbone of those algorithms consists of DL models.These models, in turn, are typically very data 'hungry', and require an enormous amount of computational resources and experimentation to train successfully.However, in the case of healthcare applications, such data might not exist, either due to privacy regulations which prohibit their open circulation, or, as in the case of rare or novel diseases, simply because this data does not exist.Yet doctors, and subsequently the tools they use, are commonly required to operate in such low-data regimes.Therefore, it is imperative to make these algorithms operational (II) Earlier than what is currently possible; this can be done, for example, by transferring knowledge from domains where data is widely available to data-sparse healthcare applications.
The first two pillars are of a more 'engineering' nature; the third one requires more theoretical advances.Statistical learning theory, which forms the foundation of DL, is based on the core assumption that data are independent and identically distributed [8].In the healthcare domain, this translates that the population of training patients is representative of the entire population -an assumption that often does not hold in practice.Instead, patients come from different backgrounds, and are typically organised in sub-populations.Oftentimes, the level of analysis reaches all the way down to the individual; in this case, every patient is considered 'unique'.Furthermore, the larger upside of using AI in medicine lies in providing more fine-grained information in the form of longitudinal observations.Handling the need for individualised analysis with multiple observations over time requires algorithms to operate (III) Attentively to individual -often changing -needs.
The last pillar corresponds to the translation of the mandate enshrined in the Hippocratic oath to computer audition, and more generally AI: any developed technologies must be developed and be able to operate (IV) Responsibly.The responsibility lies with the developers and users of the technology and is targeted towards the patients who become its objects.This informs a set of guidelines and their accompanying technological innovations on how data needs to be sourced, how algorithms must meet certain fairness requirements, and, ultimately, on 'doing good' for mankind.
Finally, we would be amiss not to mention the potential applications that can benefit from the introduction of computer audition in healthcare.This becomes the central component which permeates all aspects of the four pillars: they exist insofar as they serve the overarching goal of providing medical practitioners with novel tools that can help them understand, analyse, diagnose, and monitor their patients' Health.
An overview of the four pillars, as well as their interactions with one another is shown in Figure 1.In the following sections, we proceed to analyse each one in more detail, and close with an overview of the healthcare applications in which we expect computer audition to make a decisive difference.Thus, we present HEAR4Health: a blueprint of what needs to be done for audition to assume its rightful place in the toolkit of AI technologies that are rapidly revolutionising healthcare systems around the world.

Hear
A cornerstone of computer audition applications for healthcare is the ability to Hear: that is, the set of steps required to capture and pre-process audio waves and transform them into a clear, useful, and high-quality signal.This is all the more true in the healthcare domain, where recordings are often made in hospital rooms bustling with activity or conducted at home by the non-expert users themselves.Therefore, the first fundamental step in an application is to extract only the necessary components of a waveform.
In general, this falls under a category of problems commonly referred to as source separation and diarisation [9,10]: the separation part corresponds to the extraction of a signal coming from a particular source amongst a mixture of potentially overlapping sources, whereas diarisation corresponds to the identification of temporal start and end times of components assigned to specific subjects.In healthcare applications, these target components are the relevant sounds; this can include vocalisations (both verbal and non-verbal) but also other bodily sounds that can be captured by specialised auditory sensors attached to their body, or general ones that are monitoring the environment.These sounds need to be separated from all other sources; these may include a medical practitioner's own body sounds (e. g., their voice in doctor-patient conversations) or background environmental noise (e. g., babble noise in a hospital).Accordingly, successful preparation entails a) the ability to recognise which sounds belong to the target subject, b) the ability to Traditionally, these steps are tackled by specialised pipelines, which include learnable components that are optimised in supervised fashion [10].For example, the ability to recognise which sounds belong to the target subject is generally referred to as speaker identification [11].While this term is usually reserved for applications where speech is the sound of interest, it can also be generalised to other bodily sounds [12].
Similarly, separation is typically done in a supervised way [10].During the training phase, clean audio signals are mixed with different noises, and a network is trained to predict the original, clean signal from the noisy mixture.As generalisability to new types of noise sources is a necessary pre-requisite, researchers often experiment with test-time adaptation methods, which adaptively configure a separation model to a particular source [13].
The crucial role of the Hear pillar becomes evident when considering data collection.There are three main data collection paradigms employed in healthcare applications: a) the (semi-)structured doctor-patient interview, b) ecological momentary assessments (EMAs) based on prompts [14], and, c) passive, continual monitoring [15].All of them require very robust patient identification and diarisation capabilities.

Earlier
The major promise of digital health applications is their ubiquitous presence, allowing for a much more fine-grained monitoring of patients than was possible in the past.This requires the systems to work on mobile devices in an energy-efficient way.Additionally, these systems must be versatile, and easy to update in the case of new diseases, such as COVID-19.This requires them to generalise well while being trained on very scarce data.However, training state-of-the-art DL models is a non-trivial process, in many cases requiring weeks or even months, and is furthermore notoriously data intensive.Moreover, the technology required, such as high-end GPUs, is often expensive and has exceptionally high energy consumption [16].
There have consequently been increasing efforts to develop AutoML approaches that optimise a large network until it is executable on a low-resource device [17,18].Many of these approaches focus on reducing the memory footprint and the computational complexity of a network while preserving its accuracy.These techniques have shown promise across a range of different learning tasks, however, their potential has not yet been realised for audio-based digital health applications.
On the issue of data efficiency, there has been a lot of research on utilising transfer learning techniques for increasing performance and decreasing the required amount of data.This is usually done by transferring knowledge from other tasks [19,20], or even other modalities [21,22].However, in the case of audio in particular, an extra challenge is presented by the mismatch between the pre-training and downstream domains [23].Recently, large models pre-trained in self-supervised fashion have reached exceptional performance on a variety of different downstream tasks, including the modelling of respiratory diseases [24], while showing more desirable robustness and fairness properties [25].
The implementation details of the Earlier pillar largely depend on the biomarkers related to the specific medical condition of interest.For example, in terms of mental disorders, which mostly manifest as pathologies of speech and language, it is mostly tied to generalisation across different languages.On the one hand, linguistic content itself is a crucial biomarker; on the other hand, it serves to constrain the function of acoustic features; thus, there is a need to learn multi-lingual representations that translate well to low-resource languages.For diseases manifesting in sounds other than speech signals, the Earlier pillar would then improve the data efficiency of their categorisation.For example, contrary to speech signals, for which large, pre-trained models are readily available [26], there is a lack of similar models trained on cough data; a lack partially attributable to the dearth of available data.This can be overcome, on the one hand, through the use of semi-supervised methods that crawl data from public sources [27], and, on the other hand, by pursuing (deep) representation learning methods tailored to cough sound characteristics.
When COVID-19 took the world by storm in early 2020, it represented a new, previously unseen threat for which no data was available.However, COVID-19 is 'merely' a coronavirus targeting the upper and lower respiratory tracts, thus sharing common characteristics with other diseases in the same family [28].Transferring prior knowledge from those diseases, while rapidly adapting to the individual characteristics of COVID-19, can be a crucial factor when deploying auditory screening tools in the face of a pandemic.

Attentively
Most contemporary digital health applications focus on the identification of subject states in a static setting, where it is assumed that subjects belong to a certain category or have an attribute in a certain range.However, many conditions have symptoms that manifest gradually [29], which makes their detection and monitoring over time a key proposition for future digital health applications.Furthermore, disease emergence and progression over time can vary between individuals [30,31,32].For example, the age at onset and the progression rate of age-related cognitive decline varies between individuals [30], while there is substantial heterogeneity in the manifestation and development of (chronic) cough across different patients [33].Focusing on these aspects of digital health by adapting to changes in distributions and developing personalised approaches can drastically improve performance.
Recent deep neural network (DNN)-based methods for personalised machine learning (ML) [34] and speaker adaptation [35] already pave the way for creating individualised models for different patients.However, these methods are still in their nascent stage in healthcare [36].Personalised ML is a paradigm which attempts to jointly learn from data coming from several individuals while accounting for differences between them.Advancing this paradigm for speech in digital health by utilising longitudinal data from several patients for learning to track changes in vocal and overall behaviour over time is a necessary precondition for the digital health systems of the future.This means that time-dependent, individualised distributions are taken into account for each patient, by that requiring the development of novel techniques better suited to the nature of this problem; in particular, developing versatile DL architectures consisting of global components that jointly learn from all subjects, and specialised ones which adapt to particular patients [37,38].This novel framing will also enable faster adaptation to new patients by introducing and adapting new models for those patients alone.

Responsibly
The development of responsible digital health technology is a key pillar of future healthcare applications.This ensures trustworthiness and encourages the adherence of users to monitoring protocols.Consequently, addressing crucial factors and technology-related consequences in automated disease detection concerning human subjects in a real-world context is of paramount importance.This pillar intersects with all previous ones and informs their design, adhering to an 'ethical-by-design' principle which is fundamental for healthcare applications.Naturally, a first requirement that applies to all pillars is one of evaluation: all components of a healthcare application need to be comprehensively evaluated with respect to all sub-populations and sensitive attributes.This holds true for all components of a computer audition system: from extracting the target audio signal (Hear) to generating efficient representations (Earlier) and adapting to individual characteristics (Attentively), any developed methods should perform equally for different sub-populations.The evaluation could be complemented by explainability methods, which explicitly search for biases in model decisions [39].

HEAR4Health
Mental, behavioural, or neurodevelopmental disorders Aside from comprehensively evaluating all methods with respect to fairness, explicit steps must be taken to improve on those [40].To this end, adversarial [41] and constraint-based methods [42] have been proposed to learn fair representations.In adversarial debiasing, the main predictive network learns to perform its task while an adversary pushes it toward representations which obfuscate the protected characteristics.Constraint-based methods instead solve the main prediction task subject to fairness constraints (such as equality of opportunity); these methods rely on convex relaxation or game-theoretic optimisation to efficiently optimise the constrained loss function.
The second requirement placed on the three other pillars is privacy.For example, the Hear pillar could be co-opted to remove private information (e. g., via using keyword spotting to remove sensitive linguistic information).The Earlier pillar would then take the extracted signal and remove any paralinguistic information unrelated to the task; this could be achieved by targeted voice conversion that preserves any required signal characteristics but changes the patient's voice to be unrecognisable [43].
Satisfying this requirement, however, is particularly challenging for the Attentively pillar, as there is a natural privacypersonalisation trade-off: the more private information is removed, the less context remains to be utilised for the target patient.The main solution to this obstacle is the use of federated learning [44]: to ensure that sensitive information cannot be derived from central models, differential privacy methods have been proposed, such as differentially private stochastic gradient descent [45] and a private aggregation of teacher ensembles [46].These methods would update the global model backbone discussed in Section 4, which is shared among all 'clients', while any personalised components would remain local -and thus under the protection of safety mechanisms implemented by the client institutions.

Healthcare Applications
Naturally, any advances in computer audition targeted towards healthcare applications are inextricably tied to the specific medical conditions that lend themselves to modelling via audio; the necessary pre-requisite is that these conditions manifest themselves, at least to some extent, in auditory biomarkers emanating from the patients' body.Historically, a significantly higher emphasis has been placed on vocalisations compared to other body acoustics such as heart sounds [5].Accordingly, this choice has shaped most of the existing approaches and, thus, also becomes the central point of our review.Figure 2 shows the main ICD-11 1 categories on which previous research has focused, ordered clockwise according to their order in the ICD-11 manual.In the following sections, we proceed to analyse each of those categories, presenting prior computer audition works that have focused on specific diseases, and discussing the impact that our HEAR4Health framework can have on them.

Infectious or parasitic diseases
This broad category covers several communicable diseases, from bacterial, gastrointestinal infections, to sexually transmitted diseases and viral infections, the majority of which do not manifest in auditory biomarkers; the ones that do, however, number several auditory symptoms such as (persistent) coughing or having a sore throat.The ones predominantly appearing in computer audition literature are: (respiratory) tuberculosis (1B10) [47,48,49,50]; pertussis (1C12) [51,52]; and influenza(1E) [53].Existing works have predominantly focused on detecting and analysing coughs; in particular, the onset of DL and the increase in available data have unveiled the potential of detecting coughs and subsequently categorising them as pathological or not.

Sleep-wake disorders
Research in sleep-wake disorders has been typically targeted to breathing-disorders -mainly apnoeas [71,72], while some research has been focused on the detection of the resulting sleepiness [73].Apnoeas, on the one hand, mostly manifest as very loud snoring, which is caused by a prolonged obstruction of the airways and subsequent 'explosive' inspirations.These signals can be automatically detected and analysed using auditory ML systems [74].Daytime sleepiness, on the other hand, has been mostly studied as a speech and language disorder; it manifests in lower speaking rates and irregular phonation [75].

Diseases of the circulatory system
Auscultation has been a mainstay of a medical examination since the invention of the stethoscope by Renë Laennec in 1816, by now a trademark of medical practitioners around the world [84].It is particularly useful when listening to the sounds of the heart or the lungs of a patient.Accordingly, its digital equivalent can be immensely useful in detecting pathologies of the circulatory system, such as arrhythmias or congenital heart diseases.Analysing those signals has become the topic of multiple PhysioNet challenges [85] was also featured in the 2018 version of the ComParE series [86], with computer audition systems being developed to detect and classify abnormal events ('murmurs') in phonocardiograms [87,88].

Developmental anomalies
Developmental disorders, such as the Angelman syndrome (LD90.0),Rett syndrome (LD90.4),and fragile X syndrome (LD55) manifest in divergent vocalisation and speech development patterns from an early age [65,98,99,100].Infants with specific developmental disorders produce abnormal cooing sounds and less person-directed vocalisations, and their vocalisations are found to be of lower complexity as compared to typically developing infants.From a signal perspective, these anomalies manifest in speech, first in pre-linguistic sounds and later on in linguistic vocalisations of young children.As the emphasis is on children and young adults, they present an additional challenge to data collection, on the one hand due to ethical and privacy reasons, and on the other due to a potentially reduced compliance of children with recording requirements.
7 HEAR4Health: A blueprint for future auditory digital health Unifying the four pillars results in a working digital health system which we name HEAR.Our system can be used to supplement the decision-making of practitioners across a wide facet of diseases.In general, we anticipate two distinct functioning modes for it.On the one hand, it can be used as a general-purpose screening tool to monitor healthy individuals and provide early warning signs of a potential disease.This hearing with 'all ears open' mode takes a holistic approach, and emphasises a wide coverage of symptoms and diseases, thus functioning as an early alarm system that triggers a follow-up investigation.Following that, it can be utilised to monitor the state of patients after they have been diagnosed with a disease, or for measuring the effect of an intervention.This second, more constrained setting necessitates a 'human-in-the-loop' paradigm, where the doctor isolates a narrower set of biomarkers for the system to monitor -now with more focus and prior information about the patient's state-which is then reported back for each new follow-up.
In either case, there are stringent requirements for reliability and explainability that can only be satisfied with the use of prior knowledge, attention to the individual, and an adherence to ethical principles.Ultimately, it is user trust that is the deciding factor behind the adoption of a transformative technology.The use of computer audition in healthcare applications is currently in its nascent stages, with a vast potential for improvement.Our blueprint, HEAR4Health, incorporates the necessary design principles and pragmatic considerations that need to be accounted for by the next wave of research advances to turn audition into a cornerstone of future, digitised healthcare systems.

Figure 1 :
Figure 1: Overview of the four pillars for computer audition in healthcare.

Figure 2 :
Figure 2: ICD-11 categories that are a focal point for computer audition research.