Automatic Behavior and Posture Detection of Sows in Loose Farrowing Pens Based on 2D-Video Images

The monitoring of farm animals and the automatic recognition of deviant behavior have recently become increasingly important in farm animal science research and in practical agriculture. The aim of this study was to develop an approach to automatically predict behavior and posture of sows by using a 2D image-based deep neural network (DNN) for the detection and localization of relevant sow and pen features, followed by a hierarchical conditional statement based on human expert knowledge for behavior/posture classification. The automatic detection of sow body parts and pen equipment was trained using an object detection algorithm (YOLO V3). The algorithm achieved an Average Precision (AP) of 0.97 (straw rack), 0.97 (head), 0.95 (feeding trough), 0.86 (jute bag), 0.78 (tail), 0.75 (legs) and 0.66 (teats). The conditional statement, which classifies and automatically generates a posture or behavior of the sow under consideration of context, temporal and geometric values of the detected features, classified 59.6% of the postures (lying lateral, lying ventral, standing, sitting) and behaviors (interaction with pen equipment) correctly. In conclusion, the results indicate the potential of DNN toward automatic behavior classification from 2D videos as potential basis for an automatic farrowing monitoring system.


INTRODUCTION
The monitoring of farm animals and the automatic detection of abnormal behavior has recently gained considerable importance in farm animal science research. In practical agriculture it can be used, for example, as indicators of cycle-related hormonal changes (Widowski et al., 1990) or the occurrence of diseases (Weary et al., 2009). Compared to manual data collection, the advantage of automatic recording/sensor systems is that the documentation is continuous, objective and can lead to significant time and cost savings (Cornou and Kristensen, 2013). With the progressive development of machine learning methods, especially in digital image processing, the possibility of contactless, continuous monitoring of animals is emerging. Computer Vision (CV) algorithms allow monitoring of the entire visible body of an animal, are non-invasive, do not influence the animal and are theoretically not limited in their runtime (external power supply), . If real-time processing is possible, the video data can be processed via live stream and do not have to be saved. The relevant data-output can be backed up to a small amount of storage, by saving as e.g., table. Such capabilities are consistent with most of the characteristics required for a sensor to assess animal welfare (Rushen et al., 2012) and have already ensured that several research approaches have been investigated using different camera systems and algorithms of varying complexity (examples are described in detail in the following passages). These approaches can be separated by different camera types (e.g., 2D images, 3D depth images), and different type of monitoring (e.g., single pigs with detailed behavioral observation (mainly sows) or multiple pigs and animal interactions).
For multiple pigs, Viazzi et al. (2014) developed a detection algorithm for aggressive behaviors among fattening pigs using intensity and specific pattern of movement recognized by 2D CV-algorithms. Nasirahmadi et al. (2019) developed a 2D CV and deep learning-based method to detect standing and lying posture of multiple pigs under varying conditions. Matthews et al. (2017) used 3D-data to detect basic behavior such as standing, feeding, and drinking with individual tracking among group-housed pigs for the purpose of getting information about individual animal health and welfare aspects. Approaches for single pigs can be found in the farrowing sector. E.g., the monitoring of the prenatal behavior of sows for the estimation of the onset of farrowing using natural given behavior deviations like nest-building and the varying amount of position changes before farrowing using 3D accelerometer or 2D pixel-movement (Pastell et al., 2016;Traulsen et al., 2018;Küster et al., 2020). More detailed approaches focused on inter birth interval and prevention of asphyxia as well as counting of piglets or the postpartum lying behavior of sows in regard to prevent piglet crushing and gathering information of nursing behavior from 2D-images  and 3D-depth images (Okinda et al., 2018;Zheng et al., 2018). Leonard et al. (2019) developed an accurate algorithm for sow postures like sitting, standing, kneeling and lying as well as behaviors, such as feeding and drinking using a 3D-depth threshold-based decision algorithm of crated sows body regions. Summarizing, most of the approaches addressing behavior or posture detection are using 3D cameras as data input.
With the objective of avoiding complex setups in practical production systems in mind, which are prone to errors but essential for high data transmission with 3D-cameras. The aim of this study was to determine, if behavior and posture of sows without farrowing crate during late pregnancy can be automatically detected by CV in 2D-images, acquired from a simpler, less error-prone surveillance system. The predicted postures/behaviors could be used, for example, to achieve breeding goals with regard to maternal behavior, to optimize birth management or to enable timely birth support for sows and piglets and decrease diseases and losses (Welp, 2014), as well as a better understanding of sow interactions with different pen environments/equipment without restrictions of a farrowing crate.

Video Material and Editing
The RGB-video material was recorded at the agriculture research farm Futterkamp of the Chamber of Agriculture of Schleswig-Holstein from April 2016 to January 2017 in a group-housing farrowing compartment. Each sow had her own pen [6.24 m² (large version), 5.28 m² (small version)] for farrowing (Figure 1) and first days of the suckling period. Before farrowing, all sows can freely move between the pen areas and the common area of the compartment until 3 days antepartum (a.p.) (for more information about the grouphousing compartment see Grimberg-Henrici et al., 2019;Lange et al., 2020). The cameras (Axis M3024LVE) were placed as central as possible on one side of each pen (see Figure 1) and recorded 24/7 with IR-Light during night. The data of six cameras was stored on one Synology R network attached storage (NAS) with 8 TB storage space via Ethernet cable (25 m) connection with Power over Ethernet (PoE). Since the pens were reconstructed during data acquisition, there were two different designs for the piglet nest type and location in the farrowing pens (new and old, see Figure 1). The sows were not restrained and were able to move freely in the pen area during the whole time. Randomly chosen videos of eleven sows, 2 days before farrowing until the onset of farrowing, recorded with a resolution of 1,280 × 800 pixels and five frames per second, were used for training and evaluation in this study.
Using a Python-script (Version 3.6), 1,500 images were randomly (uniform distribution) extracted from the previously defined subset (48 h a.p. until partum) out of the videos, containing 525 images with the new and 975 images with the old pen design. The relevant objects for posture detection (head, tail, legs, teats) and for interaction detection (head, trough, straw rack and jute bag) were annotated manually by two individuals in the form of rectangular "Bounding Boxes" (BB's) using the open source-tool "Ybat" 1 (Figure 2). Since the annotation in form of rectangular BB's was difficult to carry out for large objects without marking a large part of the entire image, the classes jute bag and teats were annotated differently. The jute bag was annotated only on the top of the pen wall where it was attached. The teats were annotated in varying form or in several BB's, according to its appearance. After annotation, the data set was randomly divided into a training set 63% (945 images) and a test set 37% (555 images) based on the common 2/3 (training set) 1/3 (test set) split (Witten et al. 2011, p. 152). Detailed information about the dataset structure for YOLO training and evaluation can be found in Table 1. Note, that the classes jute bag and legs can occur more than once per image (Legs up to four and the jute bag up to two, because the bag of the neighbor pen was partly visible too).

Training Object Detection
The annotated dataset was used without further conversions for the training process of the "You Only Look Once"-Version 3 (YOLO V3) object detection algorithm (Redmon et al., 2016).  YOLO V3 was implemented using the Dark net framework (Redmon and Farhadi, 2018) via the Jupyter notebook development environment (Kluyver et al., 2016) on a computer with 11 GB GPU. Darknet-53 was used as backbone of the neural network and only the YOLO-Layers were fine-tuned with the present data.
As starting weights, pre-existing YOLO V3-weights 2 , which are pre-trained on the ImageNet-dataset 3 , were used. In total 2 https://pjreddie.com/media/files/yolov3.weights (accessed March 22, 2021). 3 https://www.image-net.org/ (accessed March 3, 2021). 14,000 iterations with a batch size of 64 were performed to finetune the weights for the given detection tasks. The learning rate was set to 0.001, after 11,200 iterations (80% of total iterations) and after 12,600 iterations (90%) the learning rate was multiplied by 0.1. The input images were down sampled to a resolution of 352 × 352 px. Following 1,000 iterations, the Average Precision (AP) and the F1-Score on the test set were automatically determined after every 500th iteration with an Intersection over Union (IoU) of ≥ 0.5 (Everingham et al., 2010) and a Class Confidence Score (YOLO V3 threshold) ≥ 0.5 for each class (Redmon et al., 2016) (see Figure 5). Formula for evaluation metrics can be found in Table 2.

Classification of an Interaction or a Posture
The posture or the behavior of a sow were classified inside the test set according to an ethogram, reflecting typical definitions for manual behavior observation (Table 3). Following these behavioral definitions with taking human expert knowledge into account, a hierarchical conditional statement was developed.
Using the detected objects (body parts and pen equipment), their positions (pixel coordinates of the corresponding BB) recognized by the trained model (Section Training Object Detection), as well as their distance to each other, this statement was implemented inside a deterministic algorithm to assign a behavior or a body posture from the ethogram to each image. Behaviors involving interactions with the trough (eating/drinking) and interactions with jute bag or straw rack (nest-building behavior) are summarized as "interactions" since the number of images showing these behaviors is underrepresented in the data set due to the fact that sows are sitting or lying about 85-90% of the day (Lao et al., 2016). The algorithm can be subdivided into (i) the analysis of the pen environment (feeding trough, jute bag, straw rack) in relationship/distance (activation area) to the position of the head with the purpose to classify interactions (see Figure 3) and (ii) the distance of the sow body parts [teats, head, tail and leg(s)] to each other to classify a posture (see Figure 4). In its first step (i), the algorithm checks in hierarchical order (1. trough, 2. jute bag, 3. straw rack) if pen classes are detected, where they are and if the head is within the activation area. Note that the "feeding/drinking"-interaction is different, since the feeding trough is not visible, due to viewing angle, when the sow is feeding or drinking. Therefore, the path of the algorithm is triggered when no trough is detected and the FIGURE 3 | Illustration of the hierarchical conditional statement for classification of interactions (eating/drinking with the trough or nest-building-behavior (NBB) with jute bag or straw rack). White arrow means "yes," black arrow "no" and striped arrows are highlighting calculation-/threshold-steps. The pink highlighted path is an example of the situations in Figure 8C (interaction with jute bag and Figure 8E (interaction with jute bag classified, although the sow is actually lying with a displaced jute bag).
Frontiers in Animal Science | www.frontiersin.org Where AP (k) = the AP of class k and n = the number of classes

Weighted
The weighted metrics for multiclass classification of unbalanced datasets Where b is a behavior class, f the frequency and M is a metric score head is next to the coordinates of the last detected trough (since the trough is fixed). If no interaction is classified, the second part of the algorithm (ii) loops, also in hierarchical order [1. teats, 2. head, 3. tail, 4. leg(s)], through the detected body parts. Every posture has several unique paths, which are triggered dependent on the composition of the other body parts (see Figure 4). At the end of each path, a posture is classified to the image. If YOLO V3 does not detect a body part, the image is classified as "Not classified." The performance of this multiclass classification was evaluated using the metrics in Table 2.

Evaluation of Video Sequences
For automatic evaluation of video sequences, the algorithm considers contextual, geometric and temporal relationships of successive images with consistent interval. In order to verify the resulting detection and deduct a qualified labeling of each image, we implemented plausibility checks. One verification method is to check for a reasonable number of detected classes and their geometrical context (e.g., Is there more than one head detected? If yes, determine which head has the closest proximity to the other detected features). After saving an interaction or a posture, the algorithm performs the other plausibility check using temporal threshold [a new posture/interaction needs to endure more than 2 s based on the defined duration of a drink nipple visit (Kashiha et al., 2013)]. If this threshold is not reached, the posture/interaction is deleted and the image gets the assignment accordingly to the last classification (Table 4 and Equation 1). Equation 1: Temporal threshold: where X 0 is the actual image and X −1 and X −2 are respectively previous and pre-previous images.

RESULTS
In

YOLO V3 Object Detection Algorithm
The accuracy of the detection of the trained classes (body parts and pen equipment) with YOLO V3, is given in the form of the Average Precision (AP), defined as the Area under Curve (AuC) of the recall-precision graph, evaluated on the test set (555 images). The mean Average Precision (mAP), which is the arithmetic mean of the AP from all classes, was 0.84. The pen equipment was detected with an AP of 0.97 (straw rack), 0.95 (feeding trough) and 0.86 (jute bag). The body parts show more diverse accuracy values ranging from 0.97 (head), 0.78 (tail), 0.75 (legs) to 0.66 (teats) (see Figure 5). The average intersection over Union (IoU) of all classes was improved from 0.59 to 0.69 and the average F1-Score of all classes, which represents the harmonic mean of precision and recall for all classes, was raised from 0.73 to 0.88 during the training process. An i7 CPU with 2.1 GHz needs in mean 2.57 s to predict the boxes and save them to a table for 1 frame on the test set.

Behavior/Posture Classification
The classification of posture/interaction of sows by the deterministic algorithm with implemented hierarchical conditional statement was evaluated on the test set (555 images), since we know the performance level of object detection for this data (   Figure 8A (lying ventral classified, although the sow is actually standing), Figure 8B (standing with detected head) and Figure 8D (standing without detected head).
(True Negative Rate) and F1-Score) are given for postures: lying lateral, lying ventral, standing, sitting and behaviors summarized as interactions as well as no classification and a weighted metric score of all classes equal weighted based on their occurrence on the test set ( Table 6). In total, the overall accuracy was 59.6% on the test set (Table 5). It is notable that the postures lying ventral, standing, and sitting account for over 80% of the total FP predictions of the algorithm, even though they make up  The bold values indicate an example where the temporal threshold was not reached and the image "x-1" gets a new posture assignment.
only 30% of the data set ( Table 5). These result are confirmed by the low values for Precision, Recall and F1-Score for these three classes ( Table 6). An i7 CPU with 2.1 GHz needs on average 0.047 s for the classification step of one frame of the test set.

Diurnal Amount of Interactions and Postures Changes
As an example, the mean amount in minutes per hour of postures/interactions of one sow for 48 consecutive hours (48 h a.p.-partum) was evaluated. One frame per second was Overall accuracy 0.596 In total, 552 out of 555 images were classified with a body posture/behavior, while three images resulted in no detected features and were classified as "Not classified." The gray elements show the amount of correct classifications for each behavior/posture. The overall Accuracy = sum correct classification/total classifications. All values are rounded to the second decimal. Metric Formula can be found in Table 2.  (Hartsock and Barczewski, 1997). Compared to the 1st daytime (34 h a.p.−22 h a.p.) the amount of standing and interactions rises while the amount of lying lateral decreases. Figure 7 shows the variation of posture and behavior changes per hour of the same sow in 8-h intervals with regard to daytime and time until farrowing. Especially the last 16 h before the onset of farrowing differ from the first 32 h with higher mean amount and lower variance between each hour within an interval. Figure 8 shows three examples of fail classifications due to struggling of feature detection (Figures 8A-D) or displacement of the normally fixed class jute bag ( Figure 8E). Every time YOLO V3 does not detect all visible features in an image, the classification algorithm can struggle with the output for this image. Image A and B showing the same sow and consecutive frames. On both images, the sow is standing. On image A YOLO V3 fails to detect the left front leg of the sow, which results in a wrong posture classification (lying ventral). Image C and D are also showing the same sow and situation (nest-building-behavior/interaction in IR-mode vision). Because the head is not detected on image D. The posture is classified as standing, since the head is not in the activation area of the jute bag. Image E shows a situation where the algorithm detects and classifies correct, but the jute bag has fallen from the pen wall and changed its position. With the position change of the jute bag the regularities of the statement changed, which was not considered in the development and therefore results in a classification of an interaction while the sow is lying.

DISCUSSION
The results show that the present approach to analyze behavior 2D-sequences in sows before farrowing automatically works. Up to our knowledge, this was the first attempt of fine-tuning a pre-trained network for object detection in combination with a deterministic behavior classification. The benefits are that the need of large annotated data sets for object detection can be bypassed by using a pre-trained network (Shin et al., 2016) and the classification using the hierarchical conditional statement is transparent and easy adaptable also with changing human expert knowledge about sow behavior. Another benefit is the facilitation of using videos without pre-processing, which can be beneficial toward real-time execution. The implementation of plausibility checks enables the individual analysis of videos even if there is an additional sow next to the focus sow inside the videos. Which can be very helpful, when videos are showing areas from adjacent pens to the focus pen too. The training effect (Figure 5) was relatively small, which indicates the good capability of detecting shapes and objects similar to our tasks of the pre-trained YOLO V3-weights. The pen equipment has been recognized with a high AP. This is partly due to the good visibility, but can also be explained with the static placement of pen equipment. The body parts were partly recognized with a lower AP, but all above 0.6. Especially the classes teats (0.66), leg(s), (0.75) and tail (0.73) can be optimized. Their mobility as well as their biological variance and the change of their position in relation to the camera (viewing angle) are making them more difficult to be annotated and detected correctly. The main reason for only reaching average accuracy (59.6% overall) is in our opinion the behavior/posture classification step. The dataset was created by selecting single images with a uniform distribution and with that heterogeneity and variance of visible situations were not controlled. Machine learning classification methods such as decision tree or support vector machines might increase accuracy, but need a larger dataset with increased homogeneity in terms of class quantity and higher variation in terms of class appearance. To accomplish these needs, methods of dataset augmentation and expansion, with techniques like geometrical transformations, flipping and rotation might be helpful (Cubuk et al., 2019). A positive attribute of the conditional statement is, that it is easier to transfer to other husbandry conditions than a machine learning classification approach. As Figure 8E shows, the classification is adaptable when it comes to differences in positions of pen equipment (e.g., trough, straw rack or jute bag). When objects are different in design, the object detection step still has to be retrained to minimize accuracy losses. Regarding the results of the present approach the weighted F1-Score, which contains precision and recall, should be seen as the best metric to describe the performance of the present approach since the data is unbalanced. It is remarkable, that especially the differentiation of the postures lying ventral, standing and sitting (see Tables 5,  6) is insufficient. The problem of the differentiation arises from the fact that in the mentioned postures almost the same body parts are visible from the top-view camera perspective (head, tail, no teats, two or fewer legs). Nasirahmadi et al. (2019) found the same problems for differentiation of standing and lying on belly for fattening pigs. This leads to the assumption that the differentiation of these postures is more complex than just combining visible body parts with one another. More information on these behavior classes would be necessary for automatic behavioral analysis systems.
Toward a practical implementation, the overall accuracy needs to be optimized. Machine learning methods for behavior/posture detection should be tested. Furthermore, a good performance of the first step (object detection) is fundamental for the second step and therefore the basis for the overall accuracy. To improve this step a polygon based annotation such as (Bolya et al., 2019) might be beneficial as BB's include also non-object information particularly for objects that are large and varying in position and shape such as the teats of a sow. This could optimize the accuracy of annotation, detection and localization within the image. If relevant objects (especially body parts) are not visible in the image, due to occlusion, key point pose estimation, using a "skeleton-form, " could be a useful annotation approach too (Mathis et al., 2018;Graving et al., 2019;Pereira et al., 2019). The skeleton-form enables the prediction of occluded body parts, which could be an additional feature for machine learning classification methods if the accuracy is sufficient. Nevertheless, for further studies an annotation guideline with precise definitions of shape and percentage of visibility for annotating polygon masks and/or key point features within a class should be considered. The annotation guidelines of the PASCAL Visual Object Classes Challenge 2007 (VOC2007) 4 could be helpful in this context. Furthermore, the execution speed of the present approach is optimizable. A refactoring of the algorithm is planned. Additionally, a possible implementation of the newest version of YOLO (V5) seems to work promisingly faster than V3.
The example for diurnal evaluation of an individual sow suites the results of established studies well and shows great potential to identify the onset of farrowing or possible diseases, which can affect individual diurnal act out of behavior/postures. Like in previous work, a sow individual diurnal pattern needs to be taken into account to detect diseases or the onset of farrowing (Cornou and Lundbye-Christensen, 2012;Küster et al., 2020).

CONCLUSION
In conclusion, an approach to analyze 2D-video sequences of single loose housed sows (top view) with regard to automatically output individual postures and interactions of sows, was developed. This solution is composed of two-steps including as first step object detection implemented with YOLO V3 and as second step posture and interaction classification implemented as a human knowledge based deterministic conditional statement using spatiotemporal information with implemented plausibility checks. It enables the automatic evaluation of 2D-videos without further pre-processing and has advantages when it comes to transferability to other environments, but still the overall accuracy only achieved 59.6%. While the accuracy of the object detection was sufficient, but still optimizable, the implemented classification step, which was developed as a solution to overcome issues of the present dataset composition in form of balance, variation and size, can only be seen as a proof of concept. All in all, after adapting the suggested future works, this approach has potential toward a practically implementable automatic behavior surveillance of sows housed in free farrowing systems based on 2Dvideos.

ETHICS STATEMENT
Ethical review and approval was not required for the animal study because the animals (sows) included in this study were only videotaped. Any treatment of animals in this study was in accordance with the German legal and ethical requirements of appropriate animal procedures. Written informed consent was obtained from the owners for the participation of their animals in this study.

AUTHOR CONTRIBUTIONS
SK and IT contributed to concept and design of the study. SK organized the database, wrote the first draft of the manuscript with support, and statistical supervision from CM. PN performed methodology. PN and SK carried out statistical analysis. BS and IT supervised this work. All authors contributed to manuscript revision, read, and approved the submitted version.

FUNDING
We acknowledge support by the Open Access Publication Funds of the Göttingen University.