Interactive extraction of diverse vocal units from a planar embedding without the need for prior sound segmentation

Annotating and proofreading data sets of complex natural behaviors such as vocalizations are tedious tasks because instances of a given behavior need to be correctly segmented from background noise and must be classified with minimal false positive error rate. Low-dimensional embeddings have proven very useful for this task because they can provide a visual overview of a data set in which distinct behaviors appear in different clusters. However, low-dimensional embeddings introduce errors because they fail to preserve distances; and embeddings represent only objects of fixed dimensionality, which conflicts with vocalizations that have variable dimensions stemming from their variable durations. To mitigate these issues, we introduce a semi-supervised, analytical method for simultaneous segmentation and clustering of vocalizations. We define a given vocalization type by specifying pairs of high-density regions in the embedding plane of sound spectrograms, one region associated with vocalization onsets and the other with offsets. We demonstrate our two-neighborhood (2N) extraction method on the task of clustering adult zebra finch vocalizations embedded with UMAP. We show that 2N extraction allows the identification of short and long vocal renditions from continuous data streams without initially committing to a particular segmentation of the data. Also, 2N extraction achieves much lower false positive error rate than comparable approaches based on a single defining region. Along with our method, we present a graphical user interface (GUI) for visualizing and annotating data.


Prerequisites
The GUI and auxiliary code were written in Matlab R2019b and require the Image Processing Toolbox and Curve Fitting Toolbox. The UMAP embedding requires the algorithm's implementation in Matlab from the file exchange server.

Practical User's Guide
To extract vocalizations from the example data, we recommend the following workflow 1. Execute example_g17y2.m which will load the data and open two figures (Figure 1): • The GUI (Matlab Figure 32) that visualizes the planar embedding • The context figure (Matlab Figure 24) that visualizes spectrogram snippets in their temporal context 2. In the GUI (Figure 1), you will see 't=1', which indicates the onset time slice. Pressing right arrow increases the time slice, which can be seen by the blobs moving around. Pressing left arrow decreases the time slice. Negative numbers indicate the time slice relative to the offset.
3. Mark both onsets and offsets of each vocalization type by crtl + left mouse click each on a positive and a negative time slice of that vocalization. 1 Half of the dots associated with that vocalization will turn pink. Choose your slice such that the blob is well distinct from other vocalizations. The true onset and offset are always back-calculated to 't=1' and 't=-1' respectively. The disk radius should be maximized to not miss any snippets outside the dense area, and the pixel threshold theta should be minimized to enlarge the blob. Vice versa, the blob should not be too large that it overlaps with confounding vocalizations. Press the up and down arrow to change the disk radius and '-' and '=' to control theta.
Additional notes: • Press the spacebar to change the visibility of the blobs and white dots.
• Press shift + up arrow and shift++ to skip plotting (speed up your shaping of the blob).
• Press 0 to go back to the zero time slice.
• Press shift + right arrow or shift + left arrow to display the context of the current slice in 4. When done clicking both an onset and an offset blob associated with a vocalization, press w. This will write the chosen onsets and offsets and the intermediate points to temporary variables. All purple dots will turn black.
5. If you have clicked the wrong blob or made some other mistake, click c to undo all controlclicks since last pressing 'w'. If you clicked 'w' wrongly, press shift+c to start again from scratch (point 3).
6. When all syllables are thus marked using 'control-click' and 'w', click q to finalize the extraction and close the GUI. Fig. 24 pressing the up and down arrow to switch between vocalizations and the mouse scroll wheel to browse through the elements. Pressing 'h' lists more keypress functions that control the figure.

Terminology
Snippet A snippet is a very short window (64ms) of the spectrogram of s sound interval. As subsequent snippets are overlapping, they form trajectories in the embedding and snippets of similar vocalizations will be scattered along similar trajectories.
Disk A disk describes the circular kernel around a snippet. Increasing the disk radius increases the amount of overlap between nearby snippets and increases the chances of creating and enlarging a blob.
Theta The threshold theta defines how much disk overlap is required for creating blob. Changing theta will increase or decrease the blobs' sizes.
Blob A blob is an area of overlapping disks where the number of overlaps is bigger than theta. Only snippets in the chosen time slice are considered for the blob.

Time slice
The time slice reflects the position of the snippets within the sound interval. One can count forward, starting from the onset, or count backwards form the offset.
Cluster A cluster is a selection of similar sound events.