Pediatric Otoscopy Video Screening With Shift Contrastive Anomaly Detection

Ear related concerns and symptoms represent the leading indication for seeking pediatric healthcare attention. Despite the high incidence of such encounters, the diagnostic process of commonly encountered diseases of the middle and external presents a significant challenge. Much of this challenge stems from the lack of cost effective diagnostic testing, which necessitates the presence or absence of ear pathology to be determined clinically. Research has, however, demonstrated considerable variation among clinicians in their ability to accurately diagnose and consequently manage ear pathology. With recent advances in computer vision and machine learning, there is an increasing interest in helping clinicians to accurately diagnose middle and external ear pathology with computer-aided systems. It has been shown that AI has the capacity to analyze a single clinical image captured during the examination of the ear canal and eardrum from which it can determine the likelihood of a pathognomonic pattern for a specific diagnosis being present. The capture of such an image can, however, be challenging especially to inexperienced clinicians. To help mitigate this technical challenge, we have developed and tested a method using video sequences. The videos were collected using a commercially available otoscope smartphone attachment in an urban, tertiary-care pediatric emergency department. We present a two stage method that first, identifies valid frames by detecting and extracting ear drum patches from the video sequence, and second, performs the proposed shift contrastive anomaly detection (SCAD) to flag the otoscopy video sequences as normal or abnormal. Our method achieves an AUROC of 88.0% on the patient level and also outperforms the average of a group of 25 clinicians in a comparative study, which is the largest of such published to date. We conclude that the presented method achieves a promising first step toward the automated analysis of otoscopy video.

Ear related concerns and symptoms represent the leading indication for seeking pediatric healthcare attention. Despite the high incidence of such encounters, the diagnostic process of commonly encountered diseases of the middle and external presents a significant challenge. Much of this challenge stems from the lack of cost effective diagnostic testing, which necessitates the presence or absence of ear pathology to be determined clinically. Research has, however, demonstrated considerable variation among clinicians in their ability to accurately diagnose and consequently manage ear pathology. With recent advances in computer vision and machine learning, there is an increasing interest in helping clinicians to accurately diagnose middle and external ear pathology with computer-aided systems. It has been shown that AI has the capacity to analyze a single clinical image captured during the examination of the ear canal and eardrum from which it can determine the likelihood of a pathognomonic pattern for a specific diagnosis being present. The capture of such an image can, however, be challenging especially to inexperienced clinicians. To help mitigate this technical challenge, we have developed and tested a method using video sequences. The videos were collected using a commercially available otoscope smartphone attachment in an urban, tertiary-care pediatric emergency department. We present a two stage method that first, identifies valid frames by detecting and extracting ear drum patches from the video sequence, and second, performs the proposed shift contrastive anomaly detection (SCAD) to flag the otoscopy video sequences as normal or abnormal. Our method achieves an AUROC of 88.0% on the patient level and also outperforms the average of a group of 25 clinicians in a comparative study, which is the largest of such published to date. We conclude that the presented method achieves a promising first step toward the automated analysis of otoscopy video.

INTRODUCTION
Ear related concerns constitute the leading cause for seeking pediatric healthcare attention in the USA. Given the current lack of cost-effective confirmatory testing, accurate diagnosis and subsequent management depend on visual detection of characteristic findings during otoscope examination. Despite the frequency of such encounters, the medical literature suggests that the diagnostic accuracy for such pathology is only around 46-56%. In alignment with the bias toward over-diagnosis of ear disease, it is currently estimated that between 25-50% of all antibiotics prescribed for ear disease are not indicated (1)(2)(3). Beyond risking unnecessary medical complications and the downstream unintended consequence of potential antibiotic resistance, over-diagnosis of ear disease adds an estimated $59 million in unnecessary healthcare spending in the US per annum (4). Computer-aided diagnosis on otoscopy images (5) has been suggested as a potential tool to improve the care of ear disease. Previous studies have mainly focused on applying machine learning to classify images of the eardrum. (6) compare support vector machine (SVM), k-nearest neighbor (k-NN), and decision trees on predicting ear conditions with feature extracted by filter bank, discrete cosine transform (DCT), and color coherence vector (CCV) (6). More recently, deep learning has been applied to classify otoscopy images. Zafer studies the combination of the fused fine-tuned deep features and SVM model applied to 3 diagnostic class image classifications (7).
Furthermore, (8) compare the performance of a variety of CNN architectures and find that DenseNet (9) produces the best result in the classification of 3 diagnostic classes. The authors also use Grad-CAM (10) to visualize the important regions in the image for the model.
In the study by (11), the authors compare the performance of applying 9 kinds of ImageNet (12) pre-trained CNN networks to 6 class otoscopy image classification and then further improve the performance by ensembling the output of 2 networks. The pre-mentioned deep learning based methods overall show great performance in the respective settings, achieving high accuracy ranging from 94 to 99%.
The accuracy reported by previous study, even in multiclass settings, is remarkable, yet, all previous approaches process high quality still images of the eardrum. Unfortunately, in a clinical setting the capture of high quality still images can be challenging in practice, especially given that pediatric patients are often moving and uncooperative with the examination. The use of a single image also increased the chance of failing to capture a clinically accurate representation of the anatomy due to an incomplete view. A method that analyzes otoscopy video sequences rather than still images could help overcome the aforementioned shortcomings.
The straightforward approach to have the whole video sequence as input is to train a 3D (2D images plus temporal dimension) CNN mapping from videos to labels. However, that would require (1) heuristics to harmonize videos to a preset length, and (2) large amounts of labeled data to avoid over-fitting, especially considering that the class frequency in any otoscopy video dataset is likely to be heavily skewed toward normal cases. The present study addresses these challenges by framing otoscopy interpretation as a video anomaly detection problem. Under our setting, the model is trained with normal videos only and will flag videos that diverge from the normal as an anomaly during testing.
The contribution of this study is 2-fold. First, we devise a two stage model to apply on full video sequences collected in clinics, which differentiates our approach from all previous methods that only considered high quality still images. A region detector first extracts eardrum patches from video frames that are then used to perform anomaly detection in the following stage. This architecture restricts anomaly detection to semantically useful regions. Second, we develop the shift contrastive anomaly detection (SCAD) method that leverages color-jitter-based distributional shift-transformation to improve the separability between normal and abnormal data. Tailored to our application, our self-supervision task enforces the model to leverage subtle color features to identify abnormalities during testing. Even when the amount of training videos is relatively small, our approach generates superior binary (normal/abnormal) otoscopy video screening results compared to both a baseline method and the average of a group of clinicians. We believe that this study constitutes a promising first step toward achieving algorithmic decision support for the diagnosis of pathology of the middle and external ear and the development of a smart tool that might aid clinicians in the accurate diagnosis and management of common ear disease.

RELATED STUDY
The task of anomaly detection (13,14), also referred to as outof-distribution or novelty detection depending on the context, is to identify unfitted data samples. This is a crucial task in many real world applications such as detecting system malfunction, financial fraud, and health issues. Anomaly detection is a useful but challenging task, because anomalies are hard to define. In a supervised anomaly detection setting, abnormal samples are provided to the model in the training stage to give guidance on what qualifies as an "anomaly." However, there could be many reasons a data sample is abnormal such that collecting a representative amount of abnormal samples is a hard problem itself. This is especially challenging in medical applications, where the dataset is very likely to the heavily skewed toward the normal class. Unsupervised anomaly detection (UAD), on the other hand, does not need normal/abnormal supervision signals during training. Current approaches usually utilize only the normal data during training, which makes it more appealing to applications with a skewed dataset. Therefore, in the scope of this study, we focus only on the UAD setting where the detector can only access samples from the normal data distribution during training. Moreover, in many real world applications, the task is to perform UAD on high-dimensional imagery data (e.g., in our case, detecting anomalies from endoscopic videos), which makes the problem even more challenging.
A rich set of methods (13) has been proposed to study the UAD problem on imagery data, approaching the problem with four main paradigms: (i) Density-based methods first estimate the normal data density using methods, including Gaussian Mixture and Energy Based Models, and then detect data with a low estimated density as an anomaly. DAGMM (15) and DSEBM (16) are methods that belong to this category. (ii) One-class classifier-based methods fit a classifier, e.g., Deep SVDD (17) and DROC (18), to separate normal and all other data and then use it to detect anomalies. (iii) Reconstruction-based techniques learn a reconstruction model, e.g., AnoGAN (19), of normal images and detect anomalies as samples with high reconstruction error. (iv) Self-supervised-based methods learn a feature extractor using self-supervised tasks, such as distinguishing whether or not a certain transformation has been applied to the image (20). Then, during learning features of augmented versions of the same image should be closer than features of different images (21). Upon convergence, anomaly detection is performed on the extracted features. Notable methods along this line include SVD-RND (22), CutPaste (23), CSI (24), SSD (25), PANDA (26), and MSC (27). UAD has also been applied to medical imaging (28) across many domains, including X-ray (29,30), CT (31,32), MRI (33)(34)(35), and endoscopy (36) datasets.
Anomalies in medical imaging tend to be more subtle and only reside in a small region of the whole image, which makes it particularly challenging to apply UAD in this field. To address this problem, (37) propose to connect Contextencoding (CE) and Variational Autoencoders (VAE) to combine both reconstruction and density based anomaly scores. Doing so shows superior performances in three brain MRI image datasets. To directly target subtle anomalies in small regions, there are efforts to develop self-supervised tasks that are tailored to a certain anomaly group. It is desirable to apply the artificial transformation of data that is similar to real anomaly so that the self-supervised task can force the model to learn features that are discriminative enough to detect anomalies in testing. As an example, CutPaste (23) creates discontinuous defects by cutting a region of an image and pasting it to another location. While this method works well for industrial visual inspection, the introduced sharp discontinuity is uncommon in medical imagery making this method less useful for anomaly detection in medical data. Alternatively, foreign patch interpolation (FPI) (38) creates artificial defects by interpolating a small local patch with a foreign patch of the image. A network is trained to estimate the pixel-wise interpolating factor from the synthesized image. The estimated interpolating factor is then used as an anomaly score during testing. Success is demonstrated using Brain MRI and abdominal CT image datasets. Poisson image interpolation (PII) (39) extends this idea by using Poison image editing to blend in the foreign patch instead of direct linear interpolation, targeting more subtle and continuous irregularities. Improvement in performance is shown using chest X-ray and fetal ultrasound datasets. Meanwhile, both FPI and PII utilize self-supervised tasks targeted at regional shape anomaly which is less applicable to our problem where the subtle color anomaly is more crucial.

METHODOLOGY
The otoscope video database is skewed toward the non-disease status given the practice of capturing bilateral ear examinations and occurrence of non-ear disease presenting with symptoms such as ear pain. Dataset imbalance such as this can present a significant challenge for the development of robust design boundaries, particularly when relying on small data sets. To overcome such limitations, we propose to use UAD approach to provide computer-aided screening for otoscopy video. The formal definition of this problem is as follows.
We are given a large training dataset D train comprising only normal video sequences and a smaller testing dataset D test comprising both normal and abnormal video sequences. Given the training dataset, our objective is to learn an anomaly score function A(v) with video v as input and detect the abnormal videos as an anomaly during testing. Ideally, the function learns to map normal videos to small anomaly scores and abnormal videos to large anomaly scores. During testing, we threshold the score, where A(s) > ψ indicates an anomaly.
In otoscopy video diagnosis, the temporal relation between frames is not as influential as in action recognition tasks. Whether or not a video is abnormal is directly determined by the existence of abnormal frames therein. With this observation, we turn the video anomaly detection problem into a frame anomaly detection problem with an additional video level aggregation function. Moreover, only some of the frames in the video are vital to diagnosis, and in those frames, only the region that visualizes the eardrum. Therefore, it would be desirable to only conduct anomaly detection on such frames, and the eardrum region in particular. As shown in Figure 1, our method has two steps during training: Step 1) Supervised learning of a detection module that extracts eardrum patches x from given video v using provided labels.
Step 2) Self-supervised representation learning on the labeled region of interest to obtain an embedding function that ideally projects normal and abnormal samples to distinct regions of the embedding space. During testing, we compute anomaly scores using the detected patches and the learned embedding function.

Eardrum Detection
To detect the frames and regions showing the eardrum, we use a convolutional neural network and train it in a supervised manner using image-label pairs. Different from the standard detection problem where each image may have multiple instances, we can only have at most one eardrum in each frame. Therefore, we do not use a standard object detection architecture and rather choose to have a single convolutional neural network to predict the binary label and corresponding bounding box. We define L cls as the cross entropy loss of the groundtruth label and the classification score and define L local as the L1 loss of the groundtruth bounding box and the predicted bounding box. Then, the overall loss is simply combining the classification loss and localization loss.
Given our problem setup, we train the detection module on the normal training videos only. Even though the model does not use abnormal videos during training, we find it generalizes well to abnormal frames during testing as shown in Section 4.2.
FIGURE 1 | The screening architecture of our shift contrastive anomaly detection (SCAD) is composed of three sub-blocks: eardrum detection, representation learning, and anomaly detection. During training, eardrum detection is learned by supervised training with provided labels. Representation learning is conducted with self-supervision to obtain an embedding function φ. Both detection network and embedding function network are convolutional neural network that takes an individual frame as input. During testing, we use the embedding function φ to produce a frame level anomaly score. The video level anomaly score is then obtained by aggregating image level anomaly scores of all detected eardrum frames in a video.

Shift Contrastive Anomaly Detection
Contrastive Learning: Contrastive learning (21, 40) has become the top performing self-supervised learning method in recent years. In this paradigm, the first step of the training procedure is to sample a minibatch of size N and perform augmentation twice to each sample x i to obtain (x i ′ , x i ′′ ) termed as positive pair, producing 2N samples in total. All images are passed through a feature extractor and then features are typically scaled to the unit sphere by l 2 normalization to get a representation φ. The contrastive loss to pair (x i ′ , x i ′′ ) is then defined as where τ is a temperature hyper-parameter.
The contrastive learning objective pulls x i ′ close to x i ′′ and pushes all other samples away from x i ′ . Intuitively, this will make the learned embedding capture a latent structure that is meaningful enough to separate samples from one another and thus be beneficial for downstream tasks. This paradigm shows great success in self-supervised training for image recognition but has inherent problems for anomaly detection. To minimize the loss function, the angles between positive and negative samples need to be maximized even though both samples are from the normal class. This results in a scenario where representations of normal samples span across the whole unit sphere. During testing, anomaly samples could, therefore, be projected onto a location that is close to normal embeddings, which challenges the paradigm of anomaly detection.
Mean-Shifted Contrastive Loss: To adapt contrastive learning to anomaly detection, MSC (27) proposes the alternative mean-shifted contrastive loss. Rather than directly minimizing the contrastive loss in the representation space, MSC constructs a mean-shifted counterpart, by subtracting the center c of the whole training set and then normalizing to it the unit sphere. For a given sample x, the mean-shifted embedding is defined as The mean-shifted loss is then constructed by applying typical contrastive loss on this mean-shifted embedding space: Because the mean-shifted representation is normalized around c, the training no longer spans the normal samples across the whole unit sphere of φ(x). This makes anomaly samples more separable from normal samples in the φ(x) space. Distributional Shift Angular Loss: To further improve separability, MSC (27) uses additional angular center loss to shrink normal samples around the normalized center. The assumption is that normal data lying in a small region around the center will be more discriminative. The total loss that MSC uses then becomes L angular + L angular . This approach works well in datasets composed of regular images (e.g., CIFAR dataset (41)), where images from different classes are treated as an anomaly. Empirically, we find that this strategy does not perform as well in our medical imaging application. The main reason is that we are having much smaller semantic variations between normal and abnormal images so that their respective embeddings are still close to each other even after training using mean-shifted contrastive and angular center loss.
To address this issue, we develop the distributional shift angular center loss to further enhance separability between normal and abnormal samples. As shown in Figure 2, our method constructs additional shift-transformed samples z from x and then pushes φ(z) away from the normalized training center c. Shift-transformed samples z are created by applying distributional shift-transformations on x so that z is no longer from the original data distribution. The intuition is straightforward: not only should the normal samples occupy a small region around the center but also the samples not drawn from the original distribution cannot be too close to the center either. Hinge loss is applied to the angle between center c and shift-augmented samples z. The loss function takes the form: Ideally, these shift-transformed samples z are semantically similar to real anomalies. In the context of otoscopy images, color is an important feature in that infection may lead to the change of color of the whole or part of the eardrum. Thus, we employ three variants of color-jitter as the distributional shift-transformation. They are color-jitter-random-cut (CJ-RC), color-jitter-randomregion (CJ-RR), and color-jitter-whole-frame (CJ-WF). CJ-RC uses a random rectangle as a mask of transformation. CJ-RR is constructed by interpolating the color-jittered image and the original image using pixel-wise weights for each image. We first initialize the pixel-wise weights to be all zero, and then pick random points to be filled with one, and apply a Gaussian filter to obtain smoothed weights. CJ-WF simply applies color-jitter to the whole frame. Figure 3 presents illustrations of the three augmentation strategies. Final Loss: We combine (i) the mean-shifted contrastive loss L msc and (ii) the angular loss with distributional shift augmentations L shift_angular : This loss function is tailored to this specific medical imaging application and enjoys the best of both objectives. The meanshifted loss makes the features to be representative of the images and the angular loss encourages normal and abnormal instances to be distant from each other in the feature space. Therefore, the combination of these two losses achieves superior performances, which is demonstrated through ablation in Section 4.3. Frame Level Anomaly Score: In order to classify a sample as normal or abnormal, we use a simple criterion based on kNN. kNN predicts the label of a data point by ensembling the k closest labeled samples according to some distance measure. In this study, we use the cosine distance between the features of the target image x and those of all training images. The anomaly score is then given by: where N k (x) denotes the k nearest features to φ(x) in the training feature set {φ(z)} z∈X train . Video Level Anomaly Score: We consider a relatively simple method to aggregate the frame level anomaly score to construct the video level anomaly score. The aggregation function is simply taking the average of the frame level anomaly score across all detected eardrum patches. The video level anomaly score is defined as where {a(x)} represents the set of frame level anomaly scores for all frames in the given video s.
In summary, during training, we learn a frame level embedding function φ with all individual normal frames in the training set. During testing, we compute the kNN-based anomaly score using the embedding feature from φ for all detected frames. We then aggregate the predicted anomaly score for each detected frame in the testing video to produce a video level anomaly score A through averaging. Such a score is expected to be small for normal videos and large for abnormal videos. Any frame that is substantially different from normal frames in the training set (e.g., a red, bulging eardrum) would lead to a large distance to training examples in embedding space and then produce a large image and video level anomaly score. The threshold can be selected to meet the desired clinical requirement (e.g., a certain specificity value) and in practice is selected using the validation dataset consisting of both normal and abnormal videos.

Data Preparation
We collected a total of 100 otoscopy videos from pediatric patients that were seen for various conditions in an urban, tertiary-care pediatric emergency department that treats over 34,000 patients, aged between 0 and 21 years of age per year. Patients were recruited as a convenience sample based on their willingness to participate in the study. At this site, patient ages range from 0 to 22 years. Study protocols were approved by the local institutional review board. The length for each video ranges from 5 to 40 s. Twenty seven to thirty frames per second are extracted from the video. Two otolaryngologists (ears, nose, and throat specialists) annotated every video, and consensus agreement was used as ground truth. While the expert annotations are more granular, due to the small size of the dataset, we decided to restrict the analysis to two classes and each video is assigned into either the normal or abnormal category. Additional to the binary categorical label on video level, we also annotate frame level binary labels and bounding boxes for each appearance of the eardrum, where the eardrum occupies at least 20% of the image size in width and height. Of the 100 videos, 80 are labeled normal and 20 are labeled abnormal. We assign 60 normal videos to the training set, 10 to the validation set, and 10 to the testing set. For anomaly videos, we assign 10 to the validation set and 10 to the testing set. The testing set is used for the evaluation of the algorithms and comparison with human clinicians. We do not have overly apparent abnormal examples (e.g., the whole eardrum is filled with blood) during testing which makes it a challenging comparative study with human clinicians. Video capture of ear exam was performed using Cellscope Oto attached to a first-generation iPhone SE. Maintaining relative uniformity of video settings such as color is critical to the accuracy of the AI algorithm. Clinical deployment of such technology, therefore, presents a significant challenge given the ad hoc selection of otoscopes used in the clinical setting and perhaps more importantly the overwhelming reliance on nondigital otoscopes in clinical practice. As a result, we opted to use Cellscope Oto as an affordable solution that can potentially be deployed at scale. The Cellscope Oto device is designed to attach to an ear speculum and provides an additional light source but relies on the iPhone camera to capture the exam. We do note that the image and video quality captured on this device is not as high as reported in previous studies (6,42), especially with respect to sharpness and clarity. Unlike Cellscope Oto, however, more advanced high quality otoscopes with digital video capacity are often cost prohibitive to wider adoption.

Eardrum Detection
We first train a model for eardrum detection to extract eardrum patches. We use Resnet-101 as the backbone network and choose the stochastic gradient descent (SGD) optimizer with a 1e − 3 learning rate, 0.9 momentum, and 1e − 4 weight decay. The model is trained using all frames from the normal training set videos for 5,000 epochs with a batch size of 128. The input images are resized to 256 by 256 and randomly cropped to 224 by 224, followed by a random selection of data augmentation of color-jitter, cutout, rotation, and shearing.
Results: Since under our setup there is at most one instance of a single class, we evaluate the performance of our model by the accuracy of eardrum detection under different intersection over union (IoU) (43) thresholds. A correct detection is defined as a frame that produces a correct frame-level classification and, if the given frame shows the eardrum, also produces a bounding box with IoU score larger than a certain threshold in that given frame. As shown in Table 1, our model produces high accuracy across varied IoU thresholds. A visualization of eardrum detection in video sequences can be found in Figure 4. Note that even though the model is supervised only on instances from training normal videos, it generalizes well to both normal and abnormal videos in the testing set. This indicates that the network extracts features that are more consistent with the image semantic and agnostics to the details on the eardrum.

Anomaly Detection
Following the architecture suggested by (27), we construct the feature extractor φ to be the ResNet-101 convolutional neural network followed by an additional l2 normalization layer. We augment the normal samples by sequentially applying 224 by 224-pixel crop from a randomly resized image and random horizontal flips. The model is initialized with the ImageNet pretrained model and then the two last blocks of the ResNet-101 are fine-tuned for 5,000 epochs with the loss function in Equation (7). The temperature τ in Equation (2) is set to be 0.25. We use an SGD optimizer with a learning right of 1e − 5 weight decay of 5e − 5 and no momentum. For each minibatch size of 60, we sample one frame from each video to prevent similar frames treated as negative pairs. During training, we use the groundtruth bounding box to extract patches from videos. We use KNN (k=2) in computing the frame-level anomaly score. We conduct experiments with three random seeds for evaluations.
Results: We evaluate our methods on the anomaly detection task in the previously introduced otoscopy video dataset. We adopt the Area Under the Receiver Operating Characteristic curve (AUROC) and the Area Under the Precision Recall Curve (AUPRC) as anomaly detection performance score. We compare our approaches against the recently proposed top performing method MSC (27). The training and testing procedures are the same as ours except that it does not utilize shift-transformations and uses the angular loss (3) instead of shift angular loss (4) in computing the final loss function. In Table 2, we present a comparison among three variants of SCAD and three state of the art methods SSD (25), PANDA (26), MSC (27) evaluated by AUROC and AUPRC. We see that the SCAD-CJ-WF variation performs the best, reaching an AUROC of 88.0% and AUPRC   of 87.9% on average. This suggests that (i) color is indeed a very important feature in separating anomaly from normal in otoscopy images, and that (ii) color jitter on the whole image can produce semantically more similar samples to the real anomaly. This could be due to that both random cut and random regions introduce artificial patterns on the image that are utilized in the self-supervised training but are not seen in the abnormal samples during testing. To intuitively visualize the feature embedding produced by different methods, we plot the latent embedding for all testing frames and the corresponding kNN based framelevel anomaly score in Figure 5. As we can see, SCAD-CJ-WF generates the embedding that is most separable both qualitatively and quantitatively as measured by the anomaly score. Training Objective: The individual effect of each loss component is presented in Table 3. We note that neither L msc or L angular individually performs well in our dataset. L shift_angular outperforms all other individual objectives and combining it with L msc results in further improvement.

Comparison With Human Clinicians
To further evaluate our performance in a real-world setting, we performed a cross-sectional study comparing the performance of clinicians vs. our model. The study was approved by the local institutional review board.
Study Setting: Clinicians who routinely evaluate eardrums by otoscopy in a primary care, urgent care, or emergency medicine setting were invited to participate in this study. Qualifying clinicians included physicians (MD/DO), nurse practitioners (NP), and physician assistants (PAs). The is clinician group reflects frontline healthcare providers who typically diagnose and manage ear disease. Exclusion criteria were practicing outside of the United states, employment by the Johns Hopkins University or Johns Hopkins Hospital, visual impairment limiting video analysis, or non-English speakers. Recruitment took place through digital messaging via professional email mailing lists and professional social media groups during November 2020.
Study Procedure: Ten normal and ten videos designated as acute otitis media (AOM) by ground truth were selected as the testing set of images, as described above. Clinicians completed an online survey in which they were instructed to review the 20 eardrum videos without any further clinical information provided. Following each video, participants were asked to indicate whether the eardrums appeared normal or abnormal. They were also asked to rate their level of confidence in the accuracy of their diagnosis using a Likert scale (1= not at all confident to 4 = very confident). Participants were also asked to report the number of times they viewed each video. Participant responses were collected and managed using REDCap electronic data capture tools (44) hosted at Johns Hopkins University.  Study Population: A total of 59 clinicians were recruited and assessed for eligibility. After removing ineligible, declined, or incomplete surveys, 25 surveys were included for analysis. The process is shown in the consort flow diagram in Figure 6. Participant demographics are summarized in Table 4. All respondents are early-to-mid career physicians with the majority being female and practicing pediatrics.
Results: Clinician confidence has a mean of 3 and a SD of 0.5 and the view count per video has a mean of 1.5 and an SD of 0.5. We use linear regression to analyze the correlation of both clinicians' confidence and view count with their score. No statistically significant correlation between clinician confidence and the accuracy score (P=.07) or with view count and score (P=.32) were found (a P-value of ≤.05 was considered statistically significant).
In Table 5, we compare anomaly detection methods against clinicians in accuracy, sensitivity, specificity, and precision. We compute the metrics for all individual clinicians and then compute the mean and SD of the group. We choose the decision threshold of machine learning approaches of which the sensitivity (TPR) on the validation set to be 90%, and then report the metric values on the testing set. Compared to clinicians, our methods perform better and achieve lower variances across all metrics. We further notice that SCAD-CJ-WF outperforms both clinicians and all other methods in most metrics.
In Figure 7, we provide further visualizations comparing our best performing model SCAD-CJ-WF to clinicians. All seeds of our model produced a ROC curve that exceeds the average clinician response. Controlling the same False Positive Rate, our model outperforms the clinician performance in True Positive Rate (Sensitivity) by a large margin from 61.5 to 90%. Besides the improvement in average performance, our model also generates more consistent predictions. We can see from the figure that different clinicians produce results that are not consistent with each other. While the best clinician included in our study did achieve a perfect True and False Positive Rate, the majority of clinician evaluations are scattered with a large variation. The wide variability and lower accuracy of clinician's scores from this sample are comparable to prior reports of pediatricians and general practitioners at identifying AOM from otoscopy (45,46). This variability between clinician scores reflects the clinical challenge of accurately diagnosing AOM. Individual factors such as a clinicians' training and experience, combined with patient factors such as cooperativeness and a non-obstructed view contribute to a successful diagnosis. In contrast, our model across different seeds generated more consistent and more accurate results. Given the clinician variability to diagnose AOM Other practices include emergency medicine, family medicine, and urgent care.   illustrated in our comparative study and previous studies, there is significant value to improve diagnosis in a clinical setting by adopting deep learning based anomaly detection screening like that of the present study.
Discussion: Our presented method shows promising performance in anomaly detection on otoscopy video sequences. The whole workflow runs at 15 ms per frame on a single NVIDIA Quadro RTX 6000 GPU. This suggests the feasibility of adopting computer-aided diagnostics to real time detect and report anomalies may assist front-line clinicians in primary care and eventually even patients at home.
The eventual successful deployment of such technology necessitates more than the successful development of a diagnostic AI algorithm with high accuracy. For such technology to integrate within current healthcare models and work flow, there is a need to create a whole new technological infrastructure.
The implementation of such technology beyond its obvious need to be adopted by healthcare providers, perhaps the more pressing questions where the required financial investment for the creation of this infrastructure will come from.
Meanwhile, our study exhibits several limitations at the current stage. (i) The dataset is relatively small, skewed toward normal, and collected through convenience sampling. As such our data serve to provide some additional support for the proof of concept but does not negate the need for a larger and more rigorously collected data-set before such technology becomes clinically viable. The development of a large video database would likely further improve the algorithm's diagnostic performance and enable training of specific diagnoses as opposed to being limited to detecting normal vs. abnormal ear exams. (ii) Cellscope Oto does not permit pneumotoscopy (blow air at the eardrum to evaluate for movement) which can also serve as a critical step for accurate diagnosis of ear disease. (iii) In our comparative study, clinicians were not provided with patient history which might increase their diagnostic accuracy in a clinical setting.

CONCLUSION
In this study, we present a two-stage anomaly detection method that is designed to perform normal/abnormal classification for otoscopy videos and that can be developed based on a small dataset and highly skewed class distribution. We demonstrate that our method outperforms baseline algorithms and the average of 25 clinicians, constituting a promising step toward a computer-aided diagnosis of the middle and external ear pathology. The further development of computer-aided diagnosis methods, like ours, could contribute to timely and high-quality management of ear disease.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because the institutional review board restrict data to be accessed only by approved personals. Requests to access the datasets should be directed to Weiyao Wang, wwang121@jhu.edu.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the Johns Hopkins Medicine Institutional Review Boards. Written informed consent from the participants' legal guardian/next of kin was not required to participate in this study in accordance with the national legislation and the institutional requirements.

AUTHOR CONTRIBUTIONS
WW, AT, and MU contribute to this study on the algorithm design, implementation, and evaluation. CS, JC, and TC contribute to this study on the clinical comparative