OpenLabCluster: active learning based clustering and classification of animal behaviors based on kinematic body keypoints

Li, Jingyuan; Keselman, Moishe; Shlizerman, Eli

doi:10.3389/fnsys.2025.1630654

ORIGINAL RESEARCH article

Front. Syst. Neurosci., 31 October 2025

Volume 19 - 2025 | https://doi.org/10.3389/fnsys.2025.1630654

This article is part of the Research TopicUnderstanding Neural Processing as an Integrated Intelligent SystemView all 6 articles

OpenLabCluster: active learning based clustering and classification of animal behaviors based on kinematic body keypoints

Jingyuan Li¹^*

Moishe Keselman²

Eli Shlizerman^1,2

¹Department of Electrical and Computer Engineering, University of Washington, Seattle, WA, United States
²Department of Applied Mathematics, University of Washington, Seattle, WA, United States

Introduction: Quantifying natural behavior from video recordings is a key component in ethological studies. Markerless pose estimation methods have provided an important step toward that goal by automatically inferring kinematic body keypoints. Such methodologies warrant efficient organization and interpretation of keypoints sequences into behavioral categories. Existing approaches for behavioral interpretation often overlook the importance of representative samples in learning behavioral classifiers. Consequently, they either require extensive human annotations to train a classifier or rely on a limited set of annotations, resulting in suboptimal performance.

Methods: In this work, we introduce a general toolset which reduces the required human annotations and is applicable to various animal species. In particular, we introduce OpenLabCluster, which clusters temporal keypoint segments into clusters in the latent space, and then employ an Active Learning (AL) approach that refines the clusters and classifies them into behavioral states. The AL approach selects representative examples of segments to be annotated such that the annotation informs clustering and classification of all temporal segments. With these methodologies, OpenLabCluster contributes to faster and more accurate organization of behavioral segments with only a sparse number of them being annotated.

Results: We demonstrate OpenLabCluster performance on four different datasets, which include different animal species exhibiting natural behaviors, and show that it boosts clustering and classification compared to existing methods, even when all segments have been annotated.

Discussion: OpenLabCluster has been developed as an open-source interactive graphic interface which includes all necessary functions to perform clustering and classification, informs the scientist of the outcomes in each step, and incorporates the choices made by the scientist in further steps.

1 Introduction

Analysis and interpretation of animal behavior are essential for a multitude of biological investigations. Behavioral studies extend from ethological studies to behavioral essays as a means to investigate biological mechanisms (Sturm et al., 2019; Monosov et al., 2024; Marques et al., 2018; Johnson et al., 2020; Berman et al., 2016; Jazayeri and Afraz, 2017; Datta et al., 2019; Weber et al., 2022; McCullough and Goodhill, 2021; Pereira et al., 2019). In these studies, methodologies facilitating robust, uninterrupted, and high-resolution observations are key. Indeed, researchers have been recording animal behaviors for decades with various modalities, such as video, sound, placement of physical markers, and more (Taiwanica, 2000; Morrow-Tesch et al., 1998; Han et al., 2011; Lynch et al., 2013; Nakamura et al., 2016; Buccino et al., 2018; Bain et al., 2021). Recent enhancements in recording technologies have extended the ability for the deployment of recording devices in various environments and for extended periods of time. The enhancement in the ability to perform longer observations and in the number of modalities brings forward the need to organize, interpret, and associate recordings with identified repertoires of behaviors, i.e., perform classification of the recordings into behavioral states. Performing these operations manually would typically consume a significant amount of time and would require expertise. For many recordings, manual behavior classification becomes an unattainable task. Therefore, it is critical to develop methodologies to accelerate the classification of behavioral states and require as little involvement from the empiricist as possible (Anderson and Perona, 2014; Dell et al., 2014; Krause et al., 2013).

Early efforts in automatic behavior classification focused on raw video analysis using machine learning techniques such as convolutional neural networks (CNNs) (van Dam et al., 2020; Jia et al., 2022; Batty et al., 2019; Brattoli et al., 2021), recurrent neural networks (RNNs) (Stern et al., 2015; Murari et al., 2019), temporal Gaussian mixture models Bohnslav et al. (2021), and temporal CNNs (Marks et al., 2022). While effective in specific scenarios, video-based methods often incorporate extraneous background information and noise (e.g., camera artifacts), which can undermine reliability and require considerable computational resources due to the high-dimensional nature of video data (Marks et al., 2022). In contrast, approaches that concentrate on movement by utilizing body keypoints or kinematics-extracted from video frames-can circumvent these limitations (Xu et al., 2017; Insafutdinov et al., 2017, 2016; Sturman et al., 2020; Isik and Unal, 2023).

Markerless pose estimation techniques, such as OpenPose, DeepLabCut, Anipose, and others (Mathis et al., 2018; Deeplabcut, 2018; Cao et al., 2019; Nath et al., 2019; Lauer et al., 2022; Karashchuk et al., 2021), enable accurate keypoint detection without the need for physical markers. Furthermore, numerous related tools and approaches have also been further introduced to advance animal pose estimation (Pereira et al., 2019; Wu et al., 2020; Bala et al., 2020; Pereira et al., 2022; Zhang et al., 2021; Usman et al., 2022; Ye et al., 2024). Once body keypoints are estimated, behavioral segmentation can be achieved using unsupervised clustering methods-such as HBDSCAN (Ester et al., 1996; Hsu and Yttri, 2021), hierarchical clustering (Huang et al., 2021), and the Watershed algorithm (Meyer, 1994; Berman et al., 2014; Dunn et al., 2021)-which group similar postural states and differentiate distinct behaviors (Marques et al., 2018; Hsu and Yttri, 2021). Dimensionality reduction techniques, including principal component analysis (PCA) and uniform manifold approximation and projection (UMAP), further enhance the representation of body keypoints for effective clustering (McInnes et al., 2018; Huang et al., 2021; Kwon et al., 2024; Hsu and Yttri, 2021; Wiltschko et al., 2020).

Recent deep learning methods have advanced latent keypoint representation learning through task-specific optimization, as demonstrated by TREBA (Sun et al., 2021) and its automated extension, AutoSWAP (Tseng et al., 2022). Contrastive learning approaches have also been proposed to refine the latent space by drawing together similar behavioral samples and separating dissimilar ones (Zhou et al., 2023; Schneider et al., 2023). One of the challenges in such approaches is the selection of appropriate positive and negative samples which remains challenging without human guidance. General methods such as Predict&Cluster (Su et al., 2020) and VAME (Luxem et al., 2022) address these challenges by focusing on sequence reconstruction and future prediction, enabling the unsupervised clustering of behavioral patterns (Su et al., 2020; Luxem et al., 2022; Weinreb et al., 2024).

While unsupervised clustering can identify similar behavioral patterns (Su et al., 2020; Luxem et al., 2022; Weinreb et al., 2024), it may not effectively identify behaviors of specific interest. Supervised classification approaches address this limitation by mapping behavioral segments to behavioral categories of interest, under the guidance of annotated training data (Xu et al., 2017; Sturman et al., 2020; Segalin et al., 2021). The classification accuracy critically depends on both the choice of classifier and the quality and quantity of annotations. Early success in behavioral classification was achieved using classical machine learning approaches (Dankert et al., 2009; Jhuang et al., 2010; Segalin et al., 2021; Burgos-Artizzu et al., 2012; Hong et al., 2015; De Chaumont et al., 2019; Goodwin et al., 2024; Hsu and Yttri, 2021). These have recently been supplemented by deep learning approaches (Rousseau et al., 2000; Sakata, 2023; Zhou et al., 2023; Ye et al., 2024; Sturman et al., 2020). Nevertheless, manual annotation remains labor-intensive and subject to inter-annotator variability.

To address the need for manual annotation, methods such as SaLSa-which assigns uniform labels to pre-computed unsupervised clusters-and JAABA-which provides an interactive framework for correcting misclassifications-have been developed (Sakata, 2023; Kabra et al., 2013). Active learning (AL) techniques further streamline the process by automatically selecting samples for annotation, balancing annotation effort with classification accuracy (Cohn et al., 1996; Settles, 2009, 2012; Li and Shlizerman, 2020b; Tillmann et al., 2024). In particular, for behavior recognition, A-SOiD Tillmann et al. (2024) employs AL to prioritize samples with high prediction uncertainty for animal behavior recognition; however, uncertainty-based selection may inadvertently target redundant samples. Integrating clustering information with classifier uncertainty could improve the efficiency of sample selection.

In this study, we extend previous methods by jointly learning representations for AL and classifier training using pose estimated from video recordings, e.g., keypoints estimated using DeepLabCut (Mathis et al., 2018; Nath et al., 2019; Deeplabcut, 2018). In particular, we introduce the OpenLabCluster toolset, an AL based semi-supervised behavior classification platform embedded in a graphic interface for animal behavior classification from body keypoints. The system implements and allows the use of multiple semi-supervised AL methods. AL is performed in an iterative way, where, in each iteration, an automatic selection of a subset of candidate segments is chosen for annotation, which in turn enhances the accuracy of clustering and classification. OpenLabCluster is composed of two components illustrated in Figure 1: (1) Unsupervised deep encoder-decoder clustering of behavior representation, Cluster Maps, which depicts the representations as points and show their groupings, followed by (2) Iterative automatic selection of representations for annotation and subsequent generation of Behavior Classification Maps. In each iteration, each point in the Cluster Map is re-positioned and associated with a behavioral class (colored with a color that corresponds to a particular class). These operations are performed through the training of a clustering encoder-decoder [component (1)] along with a deep classifier [component (2)]. OpenLabCluster implements these methodologies as an open-source graphical user interface (GUI) to empower scientists with little or no deep-learning expertise to perform animal behavior classification. In addition, OpenLabCluster includes advanced options for experts.

Figure 1

Flowchart illustrating a behavior classification process. Step 1: Videos and features are transformed into a 2D/3D space, creating a cluster map through unsupervised deep clustering. Step 2: Candidates for annotation are automatically selected and annotated. Using semi-supervised deep classification, a behavior classification map is produced, identifying behaviors like grooming, rearing, and hanging. The process is iterative and involves active learning.

Figure 1. OpenLabCluster overview. (1) Clustering: Input of body keypoints segments is mapped to low dimensional space. Unsupervised encoder–decoder maps them to a Cluster Map. (2) Classification: AL algorithms automatically select candidates for annotation after which Cluster Map is reshaped into the Behavior Classification Map where each point is associated with a behavioral state [Grooming (red), Hanging (blue), Rearing (blue)]. Mouse images are reproduced from frames of videos provided in the dataset of Jhuang et al. (2010).

2 Results

2.1 Datasets

Behavioral states and their dynamics vary from species to species and from recordings to recordings. We use four different datasets to demonstrate OpenLabCluster applicability to various settings. The datasets include videos of behaviors of four different animal species [Mouse (Jhuang et al., 2010), Zebrafish (Marques et al., 2018), C. elegans (Yemini et al., 2013), Monkey (Bala et al., 2020)] with three types of motion features (body keypoints, kinematics, segments), as depicted in Figure 2. Two of the datasets include apriori annotated behavioral states (ground truth) (Mouse, C. elegans), while the Zebrafish dataset includes ground truth a priori predicted by another method, and the Monkey dataset does not include ground truth annotations. Three of the datasets have been temporally segmented into single-action clips (Mouse, Zebrafish, C. elegans), i.e., temporal segments, while the Monkey dataset is a continuous recording that requires segmentation into clips. We describe further details about each dataset below.

Figure 2

Section A shows a mouse with keypoints marked. Section B displays kinematics of zebrafish, featuring movement patterns. Section C illustrates segments of C. elegans, highlighting body parts. Section D presents a monkey with keypoints identified, performing various actions in a cage setting.

Figure 2. Visualization of four animal behavior datasets. (A) Home-Cage mouse dataset (mouse); (B) C. elegans movement dataset (C. elegans); (C) zebrafish free swimming dataset (Zebrafish); (D) OpenMonkeyStudio Macaque behaviors dataset (Monkey). The top row shows positions of extracted keypoints for each dataset. Images sources: (A) Images are reproduced from frames of videos in the dataset of Jhuang et al. (2010). (B) Images are reproduced from dataset of Marques et al. (2018). (C) Images are reproduced from datasets provided by Yemini et al. (2013). (D) Images are reproduced from Figure 8 and images of Bala et al. (2020).

2.1.1 Home-cage mouse

The dataset includes video segments of 8 identified behavioral states (Jhuang et al., 2010). The Home-Cage Mouse dataset is selected considering its clearly segmented videos, each with well-defined behavioral categories. In particular, it contains videos recorded by front cage cameras when the mouse is moving freely and exhibits natural behaviors, such as drinking, eating, grooming, hanging, micromovement, rearing, walking, and resting. Since keypoints have not been provided in this dataset, we use DeepLabcut (Mathis et al., 2018; Nath et al., 2019; Deeplabcut, 2018) to automatically mark and track eight body joint keypoints (snout, left-forelimb, right-forelimb, left-hindlimb, right-hindlimb, fore-body, hind-body, and tail) in all recorded videos frames. An example of estimated keypoints overlaid on top of the corresponding video frame from a side view is shown in Figure 2A (top). To reduce the noise that could be induced by the pose estimation procedure, we only use the segments for which DeepLabCut estimation confidence is high enough. We use 8 sessions for training the models of clustering and classification (2856 segments) and test classification accuracy on 4 other sessions (393 segments).

2.1.2 Zebrafish

The dataset includes video footage of zebrafish movements and was utilized in Marques et al. (2018) for unsupervised behavior clustering using 101 precomputed kinematic features, a procedure that identified 13 clusters which were manually related to 13 behavior prototypes (see Appendix 1.3). In the application of OpenLabCluster to this dataset, we utilize only a small subset of these features (16 features) and examine whether OpenLabCluster is able to generate classes aligned with the unsupervised clustering results obtained on full 101 features (as the ground truth). We use 5,294 segments for training and 2,781 segments for testing.

2.1.3 C. elegans

The dataset is recorded with Worm Tracker 2.0 when the worm is freely moving. The body contour is identified automatically using contrast to background from which kinematic features are calculated and constitute 98 features that correspond to body segments from head to tail in 2D coordinates, see Yemini et al. (2013) and Figure 2C. Behavioral states are divided into three classes: moving forward, moving backward, and staying stationary. We use ten sessions (a subset) to investigate the application of OpenLabCluster to this dataset, where the first 7 sessions (543 segments) are used for training and the remaining 3 sessions (196 segments) are used for testing.

2.1.4 Monkey

This dataset is from OpenMonkeyStudio repository (Bala et al., 2020) and captures freely moving macaques in a large unconstrained environment using 64 cameras encircling an open enclosure. 3D keypoints positions are reconstructed from 2D images by applying deep neural network reconstruction algorithms on the multi-view images. Among the movements, 6 behavioral classes have been identified. In contrast to other datasets, this dataset consists of continuous recordings without segmentation into action clips. We thereby segment the videos by clipping them into fixed duration clips (10 frames with 30 fps rate) which results in 919 segments, where each segment is ≈ 0.33 s long. OpenLabCluster receives the 3D body key points of each segment as inputs. Notably, a more advanced technology could be implemented to segment the videos as described in Sarfraz et al. (2021). Here, we focused on examining the ability of OpenLabCluster to work with segments that have not been pre-analyzed and thus used the simplest and most direct segmentation method.

2.2 Evaluation metrics

We evaluate the accuracy of OpenLabCluster by computing the percentage of temporal segments in the test set that OpenLabCluster correctly associated with the states given as ground truth, such that 100% accuracy will indicate that OpenLabCluster correctly classified all temporal segments in the test set. Since OpenLabCluster implements a semi-supervised approach to minimize the number of annotations for segments, we compute the accuracy given annotation budgets of overall 5%, 10%, and 20% labels to be used over the possible iterations in conjunction with AL. In particular, we test the accuracy when the Top, CS, and MI AL methods implemented in OpenLabCluster are used for the selection of temporal segments to annotate. Method details are provided in Section 4 and Appendix 1.3.

2.3 Benchmark comparison

We evaluated OpenLabCluster against established animal behavior classification approaches either with or without AL methods. For non-AL approaches, we compared against K-Nearest Neighbor (KNN) (Cover and Hart, 1967), Support Vector Machine (SVM) (Jhuang et al., 2010), SimBA's Random Forest Classifier (RFC) (Goodwin et al., 2024), and VAME with an additional classifier (VAME+C) (Luxem et al., 2022). With respect to AL, we compared our method to A-SOiD (Tillmann et al., 2024), which employs RFC with selective sampling. Furthermore, we conducted ablation studies by evaluating OpenLabCluster with decoder removed and explored alternative architectures by integrating VAME's encoder-decoder (OpenLabCluster-V) with various AL strategies (CS, TOP, and MI). Detailed experimental settings and additional results are provided in the Benchmark Details Section 1.1.

2.4 Outcomes

The results of evaluation consisting of 5 runs are shown in Tables 1, 2 and further analysis in Figures 3, 4. We summarize the main outcomes of the evaluation and their interpretation below.

Table 1

Table 1. Classification accuracy of Home-Cage Mouse behaviors for increasing number of annotated segment (reported as percentage (%)).

Table 2

Table 2. Classification accuracy of Zebrafish and C. elegans behaviors for increasing the number of annotated segments [reported as percentage (%)].

Figure 3

Bar graph, line graphs, and confusion matrices illustrate annotation requirements and accuracy across datasets for achieving eighty percent accuracy. Panel A shows the number of annotations needed, with methods like KNN and ASOID requiring more. Panel B depicts accuracy curves for Mouse, Zebrafish, and C. elegans datasets, indicating improvement in accuracy with more annotations using various methods. Panel C presents confusion matrices for different annotation levels, demonstrating precision at two hundred sixty-five, one thousand fifty-nine, and five thousand two hundred ninety-four annotations.

Figure 3. Relation between accuracy and annotation. (A) The amount of annotations required to achieve 80% accuracy for classification of Home-Cage Mouse behaviors. Computed for benchmark methods (KNN, SVM, and C, SimBA, A-SOiD, VAME+C), and variants of OpenLabCluster with three AL methods (Top, MI, CS). (B) Prediction accuracy with increasing annotation budget on three datasets of Mouse, C. elegans and Zebrafish. (C) Confusion matrix for zebrafish dataset for increasing annotation budget (5%, 10%, 20%, 100%).

Figure 4

Scatter plots show clustering of keypoint and latent representation samples into six clusters. Adjacent panels display selected animal images from each cluster. Below, two line graphs compare the Calinski-Harabasz Index (CHI) and Davies-Bouldin Index (DBI) for keypoints and latent representations, with better values indicated.

Figure 4. 2D tSNE projection of behavioral segments. (A) 2D t-SNE projections comparing representations derived from keypoints vs. a latent representation. The six K-means clusters are color-coded. For per cluster per representation, the four samples with the shortest distance to each cluster's centroid are shown as examples. The images are sourced from the dataset of Bala et al. (2020). (B) Plots of the Calinski-Harabasz (CHI) and Davies-Bouldin (DBI) scores, which assess clustering quality, for each projection method across a range of cluster numbers.

2.4.1 Accuracy of classification

We observe that the accuracy of classification of OpenLabCluster across datasets is consistently higher than standard supervised classification methods (e.g., C, SVM, and SimBA) for almost any budget of annotation. Specifically, for the Home-Cage Mouse Behavior dataset (Table 1), OpenLabCluster achieves the accuracy of 66.2% when just 143 (5% of 2,856) segments have been annotated. Accuracy improves along with the increase in the number of annotated segments, i.e., accuracy is 76.6% when 10% of segments are annotated, and 81.5% when 20% of segments are annotated. Compared to C-the encoder-only version of OpenLabCluster-OpenLabCluster achieves an average accuracy increase of approximately 12%. This improvement underscores the importance of both the encoder-decoder structure and active sample selection. Meanwhile, VAME+C, which incorporates an encoder-decoder and a classifier, shows promise by reaching an accuracy of 67.4% with only 5% annotated samples. However, it remains less optimal than OpenLabCluster-V, which attains 74.7% under the same annotation budget. These results further illustrate the effectiveness of AL for accurate behavior classification with limited labels. It should be noted that although A-SOiD also integrates active learning for sample selection, its classifier design and exclusive reliance on an uncertainty-based selection method appears to limit its performance relative to OpenLabCluster variants. Among the AL strategies TOP, CS, and MI, both CS and MI outperform TOP on the Home-Cage Mouse dataset. Notably, while AL is expected to be especially effective in sparse annotation scenarios when all segments are annotated (fully supervised scenario), the accuracy of the OpenLabCluster MI approach exceeds supervised classification approaches (C) by 12.4% (rightmost column in Table 1). This reflects the effectiveness of the targeted selection of candidates for annotation and the use of clustering latent representation to enhance the overall organization of the segments.

For Zebrafish and C. elegans datasets, OpenLabCluster consistently achieves higher accuracy, except when 100% of the C. elegans dataset is annotated, demonstrating its generalizability across various animal behavior datasets. When compared to its encoder-only variant (C), OpenLabCluster exhibits an accuracy improvement of approximately 7.8% on the zebrafish dataset and approximately 2.2% on the C. elegans dataset. These gains are less pronounced than those observed on the Home-Cage Mouse dataset. These could be associated with not having manually identified ground truth behavior states for Zebrafish and having only three classes for the C. elegans dataset which is a simpler semantic task that does not challenge classifiers. We can indeed observe that when all annotations are considered ins C. elegans dataset, all approaches perform well (above 92%) and a standard classifier achieves the best accuracy. In contrast to the results observed with the Home-Cage Mouse dataset, we find that on the Zebrafish dataset, OpenLabCluster-V underperforms OpenLabCluster and exhibits comparable performance on C. elegans. Moreover, the CS strategy appears unsuitable for OpenLabCluster-V in the Zebrafish setting, resulting in diminished outcomes.

2.4.2 The amount of required annotations

Since accuracy varies across datasets and depends on the number of classes and other aspects, we examine the relationship between accuracy and the number of required annotations. In Figure 3A, we compute the necessary number of annotations required to achieve 80% of classification accuracy with benchmark methods, OpenLabCluster and OpenLabCluster-V on the Home-Cage Mouse dataset. We observe that AL methods require only 15–20% of the annotated segments to achieve 80% of classification accuracy, whereas benchmark methods (KNN, SVM, C, A-SOiD, SimBA) require roughly nine times as many annotations as the optimal AL approaches. Among the AL methods, the MI and CS embedded variants of OpenLabCluster and OpenLabCluster-V achieve 80% accuracy with approximately 400 annotated samples. The variant with the TOP selection method turns out to be slightly less effective, requiring approximately 700 annotations.

We further visualize the effectiveness of pertaining (C vs. OpenLabCluster) under varying annotation budgets in Figure 3B. We observe that for most cases, OpenLabCluster methods lead to higher accuracy for a given number of annotations than the counterpart, encoder-only classifier. The curves indicating the accuracy of various OpenLabCluster AL methods (red, green, blue) have a clear gap between them and C curve (darkgray), especially in the mid-range of the number of annotations. However, the performance of OpenLabCluster and C is comparable on C. elegans dataset, likely due to the dataset's relative simplicity. In Figure 3C, we further examine class-wise confusion matrices for the Zebrafish dataset on 4 annotation budgets (5%, 10%, 20%, 100%). From visual inspection, it appears that the matrix that corresponds to 20% annotations is close to the matrix that corresponds to 100% annotations. This proximity suggests that the annotation of the full dataset might be redundant. Indeed, further inspection of Figure 3C indicates that samples annotated as LCS and BS classes (y-axis) by the unsupervised learning method are likely to be predicted as the LLC (x-axis) by OpenLabCluster. One possibility for the discrepancy could be annotation errors of the prior clustering method, which are taken as the ground truth. Re-examination of the dynamics of some of the features (e.g., tail angle) further supports this hypothesis and demonstrates that the methods in OpenLabCluster can potentially identify the outlier segments whose annotation settles the organization of the Behavior Classification Map (for more details see in Appendix 1.3).

2.4.3 Organization of the latent representation

Our results indicate that the Latent Representation captured by the OpenLabCluster encoder-decoder and the classifier are able to better organize behavioral segments in comparison with direct embeddings of body keypoints. We quantitatively investigate such an organization with the Monkey dataset, for which ground truth annotations and segmentation are unavailable. Specifically, we obtain the Cluster Map of the segments with OpenLabCluster through the unsupervised training stage, projecting keypoints into the latent representation space. We then depict the 2D tSNE projection of the Latent Representation and compare it with the 2D tSNE projection of body keypoints in Figure 4A. The color in both plots indicates Kmeans Clusters. We set the number clusters as 6 which is defined by the OpenMonkeyStudio dataset (Bala et al., 2020). Indeed, it can be observed that within the Cluster Map, segments are grouped into more distinct and enhanced clusters. To measure the clustering properties of each embedding, we apply clustering metrics of Calinski-Harabasz (CHI) (Caliński and Harabasz, 1974) and Davies-Bouldin (DBI) (Davies and Bouldin, 1979). CHI measures the ratio of inter- and intra-cluster dispersion, with larger values indicating better clustering. DBI measures the ratio of inter-cluster distance to intra-cluster distance, with lower values indicating better clustering. CHI and DBI are shown in the bottom of Figure 4 considering the various number of clusters (from 2 to 20 with interval 2). The comparison shows that the CHI index is higher for the Latent representation than the embedding of the keypoints regardless of the number of clusters being considered and is monotonically increasing with the number of clusters. The DBI index for the Cluster Map is lower than the index of the embedding of the keypoints and illustrating the DBI index decreasing with increasing number of clusters. This is consistent with the expectation that clustering quality will be consistent with the number of behavioral types.

3 Discussion

In this study, we introduce OpenLabCluster, a novel toolset for quantitative studies of animal behavior from video recordings in terms of automatic grouping and depiction of behavioral segments into clusters and their association with behavioral classes. OpenLabCluster works with body keypoints which describe the pose of the animal in each frame and across frames reflecting the kinematic information of the movement that is being exhibited in the segment. The advancement and the availability of automatic tools for markerless pose estimation in recent years allows the employment of such tools in conjunction with OpenLabCluster for performing almost automatic organization and interpretation of a variety of ethological experiments.

The efficacy of OpenLabCluster is attributed to two major components: (i) Unsupervised pre-training process which groups segments with similar movement patterns and disperses segments with dissimilar movement patterns (Clustering); (ii) Automatic selection of samples of segments for association with behavioral classes (AL) through which all segments class labels are associated (classification) and the clustering representation is being refined.

We evaluate OpenLabCluster performance on various datasets of recorded animal species freely behaving, such as Home-Cage Mouse, Zebrafish, C. elegans, and Monkey datasets. For the datasets for which ground-truth labels have been annotated, we show that OpenLabCluster classification accuracy exceeds the accuracy of a direct deep classifier for most annotation budget even when all segments in the training set have been annotated. The underlying reason for the efficacy of OpenLabCluster is the unsupervised pre-training stage of the encoder-decoder which establishes similarities and clusters segments with the Latent Representation of the encoder-decoder. This unsupervised pretraining through encoder-decoder modeling effectively mitigates noise from body orientation and inaccuracies in keypoint estimation to provide a summarized representation. Such a representation turns out to be useful in informing which segments could add semantic meaning of the groupings and refine the representation further.

In practice, we observe that even a sparse annotation of a few segments (5%–20% of the training set) chosen with appropriate AL methods would boost clustering and classification significantly. Classification accuracy continues to improve when more annotations are performed; however, we also observe that the increase in accuracy is primarily in the initial annotation steps, which demonstrates the importance of employing clustering in conjunction with AL selection in these critical steps. Indeed, our results demonstrate that among different AL approaches, more direct approaches such as Top are not as effective as others considering the need to include more metrics quantifying uncertainty and similarity of the segments.

As we describe in the Methodology section, OpenLabCluster includes advanced techniques of unsupervised and semi-supervised neural network training through AL (Appendix 1.1). Inspired by the DeepLabCut project (Mathis et al., 2018; Nath et al., 2019; Deeplabcut, 2018), we implement these techniques jointly with a graphic user interface to enable scientists to use the methodology to analyze various ethological experiments with no deep learning technical expertise. In addition, OpenLabCluster is an open-source project and is designed such that further methodologies and extensions would be seamlessly integrated into the project by developers. Beyond ease of use, the graphic interface is an essential part of OpenLabCluster functionality, since it visually informs the scientists of the outcomes in each iteration step. This provides the possibility to inspect the outcomes and assist with additional information "on-the-go". Specifically, OpenLabCluster allows for pointing at points (segments) in the maps, inspecting their associated videos, adding or excluding segments to be annotated, working with different low-dimensional embeddings (2D or 3D), switching between AL methods, annotating the segments within the same interface, and more.

4 Materials and methods

Existing approaches for behavior classification from keypoints are supervised and require annotation of extensive datasets before training (Xu et al., 2017; Sturman et al., 2020). The requirement limits the generalization of classification from one subject to another, from animal to animal, from a set of keypoints to another, and from one set of behaviors to another due to the need for re-annotation when such variations are introduced.

In contrast, grouping behavioral segments into similarity groups (clustering) typically it does not require annotation and could be done by finding an alternative representation of behavioral segments reflecting the differences and the similarities among segments. Both classical and deep-learning approaches address such groupings (Marques et al., 2018; Hsu and Yttri, 2021). Notably, clustering is a ‘weaker' task than classification since it does not provide the semantic association of groups with behavioral classes; however, it could be used as a preliminary stage for classification. If leveraged effectively, as a preliminary stage, clustering can direct annotation to minimize the number of segments that need to be annotated and at the same time to boost classification accuracy.

OpenLabCluster, that is primarily based on this concept, first infers a Cluster Map and then leverages it for automatic selection of sparse segments for annotation (AL) that will both inform behavior classification and enhance clustering. It iteratively converges to a detailed Behavior Classification Map where segments are grouped into similarity classes and each class is homogeneously representing a behavioral state. Below we describe the components.

4.1 Clustering

The inputs into OpenLabCluster denoted as $X$ are sets of keypoint coordinates (2D or 3D) or kinematics features for each time segment along with the video footage (image frames that correspond to these keypoints). Effectively, each input segment of keypoints is a matrix with the row dimension indicating the keypoints coordinate, e.g., the first row will indicate the x-coordinate of the first keypoint and the second row will indicate the y-coordinate of the first keypoint and so on.

The first stage of OpenLabCluster is to employ a recurrent neural network (RNN) encoder-decoder architecture that will learn a Latent Representation for the segments as shown in Figure 5. The encoder is composed of m bi-directional gated recurrent units (bi-GRU) (Cho et al., 2014) sequentially encoding time-series input into a Latent Representation (latent vector in ℝ^m space). Thus, each segment is represented as a point in the Latent Representation ℝ^m space. The decoder is composed of uni-directional GRUs that receive as input the latent vector and decode (reconstruct) the same keypoints from the latent vector. Training optimizes encoder-decoder connectivity weights such that the reconstruction loss, the distance between the segment keypoints reconstructed by the decoder and the input segment, is minimized, see the Appendix for further details (Section 1.1). This process reshapes the latent vector points in the Latent Representation space to better represent the segments similarities and distinctions.

Figure 5

Diagram illustrating unsupervised deep clustering. Multi-dimensional time-series features are processed by an encoder and then a decoder, resulting in a reconstructed output. A latent representation is projected to create a cluster map, represented by scattered points.

Figure 5. Latent Representation is learned by performing the reconstruction task using an encoder-decoder structure. Latent vectors (last state of the encoder) are projected onto low dimensional space with various dimension reduction techniques to visualize the Latent Representation which constitute the Cluster Map. Mouse images are reproduced from frames of videos in the dataset of Jhuang et al. (2010).

To visualize the relative locations of segments in the Latent Representation, OpenLabCluster implements various dimension reductions (from ℝ^m → ℝ² or ℝ^m → ℝ³), such as PCA, tSNE, and UMAP, to obtain Cluster Maps, see Figure 5-bottom. Thus, each point in the Cluster Map is a reduced-dimensional Latent Representation of an input segment. From inspection of the Cluster Map on multiple examples and benchmarks, it can be observed that the Latent Representation clusters segments that represent similar movements into the same clusters, to a certain extent, typically more effectively than an application of dimension reduction techniques directly to the keypoints segments (Su et al., 2020; Su and Shlizerman, 2020).

4.2 Classification

To classify behavioral segments that have been clustered, we append a classifier, a fully connected network, to the encoder. The training of the classifier is based on segments that have been annotated and minimizes the error between the predicted behavioral states and the behavioral states given by the annotation (cross-entropy loss). When the annotated segments well represent the states and the clusters, the learned knowledge is transferable to other unlabeled segments. AL methods such as Cluster Center (Top), Core-Set (CS), and Marginal Index (MI) aim to select such representative segments by analyzing the Latent Representation. Top selects representative segments which are located at the centers of the clusters [obtained by Kmeans (Li and Shlizerman, 2020a)] in the Latent Representation space. This approach is effective at the initial stage. CS selects samples that cover the remaining samples with minimal distance (Sener and Savarese, 2018). MI is an uncertainty-based selection method, selecting samples that the network is most uncertain about. See Supplementary materials 1.1 for further details regarding these methods. Once segments for annotation are chosen by the AL method, OpenLabCluster highlights the points in the Cluster Map that represent them and their associated video segments, such that they can be annotated within the graphic interface of OpenLabCluster (choosing the most related behavioral class). When the annotations are set, the full network of encoder-decoder with appended classifier is re-trained to perform classification and predict the labels of all segments. The outcome of this process is the Behavior Classification Map which depicts both the points representing segments in clusters and associated states labels with each point (color) as illustrated in Figure 6. In this process, each time that a new set of samples is selected for annotation, the parameters of the encoder-decoder and the classifier are being tuned to generate more distinctive clusters and more accurate behavioral states classification. The process of annotation and tuning is repeated, typically until the number of annotations reaches the maximum amount of the annotation budget, or when clustering and classification appear to converge to a steady state.

Figure 6

Diagram showing behavior classification with active learning. An encoder-decoder creates a latent representation. A classifier labels actions like drink, eat, groom, hang, head, rear, rest, and walk.

Figure 6. Behavior classification map is generated by a fully connected classifier network (green rectangle) which receives the latent vector transformed by the encoder-decoder as input and classifies them into behavior classes (example shown: 8 classes in Home-Cage Mouse Behavior dataset). Behavioral Classification Map is generated from the Cluster Map and indicates the predicted classes of all segments.

4.3 Implementation details

OpenLabCluster code (OpenLabCLuster, 2022) was developed in University of Washington UW NeuroAI Lab by Jingyuan Li and Moishe Keselman. OpenLabCluster interface is inspired by Deeplabcut (2018), which code is used as a backbone for user interface panels, interaction with the back-end, logging, and visualization. OpenLabCluster also uses Google Active Learning Playground code (Google Active Learning Playground Github Repository, 2017) for the implementation of the K-center selection method in the Core-Set AL option. For specific usage, please see the third_party folder within the OpenLabCluster code repository (OpenLabCLuster, 2022).

OpenLabCluster is available as a GitHub Repository (OpenLabCluster) https://github.com/shlizee/OpenLabCluster and also can be installed with Package Installer for Python (PIP) pip install openlabcluster Jingyuan Li (2022). The repository includes a manual, instructions, and examples.

4.4 Benchmark details

As described earlier, OpenLabCluster summarizes keypoints or kinematic features of a temporal segment into a latent representation and then classifies the behavior using this summarized representation. This approach captures the intrinsic dynamics of short behavior prototypes, in contrast to benchmark methods that compute movement features at each timestep via predefined protocols (Segalin et al., 2021; Tillmann et al., 2024) and classify behavior on a per-timestep basis. To ensure fair comparison, we concatenated the frame-wise features within each segment and applied each frame-wise classification method to the resulting representation. Specifically, for KNN (Cover and Hart, 1967), SVM (Jhuang et al., 2010), and A-SOiD (Tillmann et al., 2024), we concatenated the frame-wise features of each action segment and then employed the classifier proposed by each method for behavior recognition. For SimBA (Goodwin et al., 2024), movement features were extracted from each frame and integrated with pose-based features. The final representation was formed by concatenating these integrated features across all timesteps within the segment. VAME (Luxem et al., 2022) closely resembles OpenLabCluster by learning a unified representation for entire sequences. In VAME+C, we pre-trained VAME, appended a classifier to its latent feature-which encodes the temporal segment's dynamics-and fine-tuned the model using a classification loss. For the Home-Cage Mouse, Zebrafish, and C. elegans datasets, sequences are pre-segmented so that each segment contains a single behavioral prototype. For the OpenStudio Monkey dataset, which is continuously recorded, we divided the videos into fixed temporal windows. More advanced approaches, such as change-point detection algorithms (Edelhoff et al., 2016; Etemad et al., 2021), could also be employed for video segmentation.

Data availability statement

Publicly available datasets were analyzed in this study. These datasets can be found at: HomeCage Mouse (https://dspace.mit.edu/handle/1721.1/49527), OpenMonkeyStudio (https://github.com/OpenMonkeyStudio/OMS_Data), Zebrafish (https://data.mendeley.com/datasets/r9vn7x287r/1), C. elegans (http://wormbehavior.mrc-lmb.cam.ac.uk/).

Ethics statement

Ethical approval was not required for the study involving animals in accordance with the local legislation and institutional requirements because the study used previously published datasets.

Author contributions

JL: Formal analysis, Writing – original draft, Data curation, Visualization, Methodology, Conceptualization, Validation, Investigation, Software, Writing – review & editing. MK: Software, Investigation, Writing – review & editing, Formal analysis, Methodology. ES: Writing – review & editing, Project administration, Supervision, Investigation, Visualization, Conceptualization, Funding acquisition, Resources.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. The authors acknowledge the partial support of HDR Institute: Accelerated AI Algorithms for Data-Driven Discovery (A3D3) National Science Foundation grant PHY-2117997 (JL, ES) and EFRI-BRAID-2223495 (ES). The authors also acknowledge the partial support by the Departments of Electrical Computer Engineering (JL, ES), Applied Mathematics (ES).

Acknowledgments

Authors are thankful to the Center of Computational Neuroscience and the eScience Center at the University of Washington.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Gen AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnsys.2025.1630654/full#supplementary-material

References

Anderson, D. J., and Perona, P. (2014). Toward a science of computational ethology. Neuron 84, 18–31. doi: 10.1016/j.neuron.2014.09.005

PubMed Abstract | Crossref Full Text | Google Scholar

Bain, M., Nagrani, A., Schofield, D., Berdugo, S., Bessa, J., Owen, J., et al. (2021). Automated audiovisual behavior recognition in wild primates. Sci. Adv. 7:eabi4883. doi: 10.1126/sciadv.abi4883

PubMed Abstract | Crossref Full Text | Google Scholar

Bala, P. C., Eisenreich, B. R., Yoo, S. B. M., Hayden, B. Y., Park, H. S., and Zimmermann, J. (2020). Automated markerless pose estimation in freely moving macaques with openmonkeystudio. Nat. Commun. 11, 1–12. doi: 10.1038/s41467-020-18441-5

PubMed Abstract | Crossref Full Text | Google Scholar

Batty, E., Whiteway, M., Saxena, S., Biderman, D., Abe, T., Musall, S., et al. (2019). “BehaveNet: nonlinear embedding and bayesian neural decoding of behavioral videos,” in Advances in Neural Information Processing Systems, 32.