Auditory chaos classification in real-world environments

Background & motivation Household chaos is an established risk factor for child development. However, current methods for measuring household chaos rely on parent surveys, meaning existing research efforts cannot disentangle potentially dynamic bidirectional relations between high chaos environments and child behavior problems. Proposed approach We train and make publicly available a classifier to provide objective, high-resolution predictions of household chaos from real-world child-worn audio recordings. To do so, we collect and annotate a novel dataset of ground-truth auditory chaos labels compiled from over 411 h of daylong recordings collected via audio recorders worn by N=22 infants in their homes. We leverage an existing sound event classifier to identify candidate high chaos segments, increasing annotation efficiency 8.32× relative to random sampling. Result Our best-performing model successfully classifies four levels of real-world household auditory chaos with a macro F1 score of 0.701 (Precision: 0.705, Recall: 0.702) and a weighted F1 score of 0.679 (Precision: 0.685, Recall: 0.680). Significance In future work, high-resolution objective chaos predictions from our model can be leveraged for basic science and intervention, including testing theorized mechanisms by which chaos affects children’s cognition and behavior. Additionally, to facilitate further model development we make publicly available the first and largest balanced annotated audio dataset of real-world household chaos.

Utility sounds.Low-volume dishwasher, washing machine, microwave hum, fan noise.Includes sounds similar to white noise machine, sleep machines, wave machines (need to be calming or consistent/monotone sounds) irrespective of their volume as they are typically placed closed to the infant wearing the LENA sensor.Includes water spraying sounds from a spray bottle, water draining, trickling, dripping, pouring.
Outdoor sounds.Footstep sounds, stroller on pavements or roads, walking on gravel, wind blowing or light breeze, low hum when inside the car.
Impact/Source-ambiguous/Other sounds.Less than or about 2.5 seconds of continuous, non-jarring, everyday impact sounds or less than 4 distinct impact sounds.Includes impact sounds caused by another child playing quietly in the background.Includes sounds, irrespective of volume, from infant clothes rustling, uncovering plastic wrappers, phone vibrating, feet shuffling, soft tapping or patting or sounds caused by moving the LENA sensor.

Medium Chaos (2)
Segments that contain stimulating sounds are included in this category.
Respiratory sounds.Really loud snoring, coughing loudly and repeatedly, sneezing repeatedly.Blowing raspberries on a surface.
Adult speech.Two non-contingent overlapping speeches or one overlapping conversation (can be one in the background and one in the foreground).Includes harsh, high pitch or raised voices, commanding tones (no shouting), loud in-person singing.Includes adult crying or sudden abrupt loud vocal sounds e.g.grunts or groans.
Child speech.Isolated infant and child attention-grabbing vocalizations or distress, hiccups between cries.Includes infant laughing or happy loud sound (not squeals), more affective or expressive tones and sounds than neutral sounds.High-pitched or annoying child speech or singing or vocalizations.Includes a child playing -has to be rough, running, a bit stimulating/chaotic.Electronic speech.Low-volume to normal speech, sounds, singing from TV, radio, phone, speakers.Has to be "recorded" voice through an electronic device.
Music.Stimulating children's music from a toy or electronic device.Includes loud, disjointed, isolated, non-continuous toy sounds or noises.Includes stimulating instrumental music e.g.pop music, continuous bell sounds, rattle, low-volume opera or choir singing.
Utility sounds.Vacuuming or blow drying, elevator or microwave dings, microwave door opening and closing, dishes clanking.Includes more than 2.5 seconds of running water or shower or faucet or dishwasher spray or garbage disposal sounds (has to be loud).
Animal sounds.Isolated, non-continuous or distant dog barking.Birds chirping and cat meowing.
Outdoor sounds.Vehicles passing-by when outside (not loud).Car seat belt notification sound.Distant police siren, emergency vehicle or horn.Low to medium volume/distant public crowded areas (e.g.restaurants, grocery stores, playgrounds, etc.).Low to medium volume chatter in the background (not attention-grabbing, loud or chaotic).
Impact/Source-ambiguous/Other sounds.More than 2.5 seconds of continuous loud impact sounds or 4 or more distinct loud impact sounds.Includes snapping, knocking, clapping or loud infant patting, phone notification sounds.Construction sounds or vehicle sounds heard from inside the house (similar to being muffled).

High Chaos (3)
Segments that contain highly stimulating or scary/jarring sounds are included in this category.
Adult speech.Adults arguing loudly or shouting.Multiple non-contingent, overlapping conversations similar to that of a chatter.
Child speech.High volume infant or child crying/wailing or tantrumming.Includes screaming, shrieking, squealing (high-pitched), foreground shouting.Electronic sounds.High volume/loud TV, radio, speakers, phone (does not include phone calls unless people are shouting/screaming/arguing).Must include speech (clearly audible words) and other sources of sounds e.g.exciting music, loud base/beats, loud instruments, applause, orchestra, choir, opera, death metal music, laughter, shouting, cheering, commotion, gun shots, war cries or multiple overlapping speech.
Utility sounds.Blender or similar.
Animal sounds.Dog barking loudly and repeatedly, loud crow cawing nearby or other loud bird noises.
Outdoor sounds.Cars honking, close-by traffic or vehicle sounds if you are outside, police or emergency vehicle siren.Public environments like restaurants, schools, playgrounds, public swimming pools, etc. Loud construction sounds, lawn-mover.
Impact/Source-ambiguous/Other sounds.Fire or smoke alarms.Includes loud scary-sounding/jarring impact sounds or loud abrupt sounds or crashes.Can be one sound or multiple.

COMPARING MODELS USING PAIRED T-TEST
In order to ascertain the best-performing model, we conducted paired t-test to compare each pair of model performance for the Annotated dataset, Filtered set and Unfiltere set.Average participant-specific weighted F1 score, precision and recall along with their standard deviations for every model for each test set are summarized in Table S1.

Model Performance on Annotated Dataset
When the models were tested on the 22 participants in the Annotated dataset in a LOPO-CV fashion, the CNN model achieved the highest average weighted F1 score, followed by RF-53f.However, when the weighted F1 scores were compared using paired t-tests, these two models were not statistically differentiated from one another (t(21) = −0.576,p = 0.571).The baseline model, RF-3f, had statistically significantly worse performance than the CNN (t(21) = −8.06,p < .001)and the RF-53f (t(21) = −9.77,p < .001)with reference to average weighted F1 score.

Model Performance on Filtered Set
When the models were tested on all 21 participants in the Filtered set in a LOPO-CV fashion, the pattern of results mirrored results on the Annotated dataset.The CNN had the highest but statistically indistinct average weighted F1 score from the RF-53f model (t(20) = −0.377,p = 0.710) both of which performed substantially and significantly better than the baseline model, RF-3f (CNN: t(20) = −6.11,p < .001;RF-53f: t(20) = −7.24,p < .001) in terms of average weighted F1 score.

Model Performance on Unfiltered Set
When the models were tested on all 6 participants in the Unfiltered set in a LOPO-CV fashion, the CNN had the highest performance in terms of average weighted F1 score which was statistically similar to that of the RF-53f model (t(5) = −0.193,p = 0.854).The baseline model, RF-3f, had the lowest performance with respect to average weighted F1 score which was substantially and significantly worse than the CNN (t(5) = −7.52,p < .001)and the RF-53f model (t(5) = −9.21,p < .001)Note: Models were trained using 40 hours of balanced data across four levels of auditory chaos randomly sampled from the Annotated dataset and tested using LOPO-CV on their respective test sets.Participant-specific weighted F1 score, precision and recall were computed for each fold of LOPO-CV and average weighted metrics were calculated by averaging across the results of all 22 participants.Results for each analysis are separated using emphasis lines.Model performances in bold represent the highest average weighted F1 score achieved across all five models for that particular test set.

Table S1 .
Participant-specific weighted model performance for our three models on different test sets

Table S2 .
Results of our training data ablation study -Effect of exponentially decreasing amounts of balanced training sets on model performance.All models were trained and tested using LOPO-CV on the Annotated dataset across all 22 participants