Effectiveness of an image analyzing AI-based Digital Health Technology to identify Non-Melanoma Skin Cancer and other skin lesions: results of the DERM-003 study

Introduction Identification of skin cancer by an Artificial Intelligence (AI)-based Digital Health Technology could help improve the triage and management of suspicious skin lesions. Methods The DERM-003 study (NCT04116983) was a prospective, multi-center, single-arm, masked study that aimed to demonstrate the effectiveness of an AI as a Medical Device (AIaMD) to identify Squamous Cell Carcinoma (SCC), Basal Cell Carcinoma (BCC), pre-malignant and benign lesions from dermoscopic images of suspicious skin lesions. Suspicious skin lesions that were suitable for photography were photographed with 3 smartphone cameras (iPhone 6S, iPhone 11, Samsung 10) with a DL1 dermoscopic lens attachment. Dermatologists provided clinical diagnoses and histopathology results were obtained for biopsied lesions. Each image was assessed by the AIaMD and the output compared to the ground truth diagnosis. Results 572 patients (49.5% female, mean age 68.5 years, 96.9% Fitzpatrick skin types I-III) were recruited from 4 UK NHS Trusts, providing images of 611 suspicious lesions. 395 (64.6%) lesions were biopsied; 47 (11%) were diagnosed as SCC and 184 (44%) as BCC. The AIaMD AUROC on images taken by iPhone 6S was 0.88 (95% CI: 0.83–0.93) for SCC and 0.87 (95% CI: 0.84–0.91) for BCC. For Samsung 10 the AUROCs were 0.85 (95% CI: 0.79–0.90) and 0.87 (95% CI, 0.83–0.90), and for the iPhone 11 they were 0.88 (95% CI, 0.84–0.93) and 0.89 (95% CI, 0.86–0.92) for SCC and BCC, respectively. Using pre-determined diagnostic thresholds on images taken on the iPhone 6S the AIaMD achieved a sensitivity and specificity of 98% (95% CI, 88–100%) and 38% (95% CI, 33–44%) for SCC; and 94% (95% CI, 90–97%) and 28% (95 CI, 21–35%) for BCC. All 16 lesions diagnosed as melanoma in the study were correctly classified by the AIaMD. Discussion The AIaMD has the potential to support the timely diagnosis of malignant and premalignant skin lesions.


Introduction
Non-Melanoma Skin Cancer (NMSC) is the fifth most common form of all types of cancer worldwide, with the most common NMSC types being Basal Cell Carcinoma (BCC), accounting for 75% of cases, and Squamous Cell Carcinoma (SCC), accounting for 23% of NMSC cases (1).In the UK, there are around 156,000 NMSC cases diagnosed, resulting in 920 deaths, per annum.The actual incidence of NMSC may be higher however, as it is known to be under-reported due to the number of multiple diagnoses per patient.Incidence rates of skin cancer have increased by over 2.5-fold (169%) since the early 1990s and are projected to rise by 14% in the UK between 2023 and 2025 (2).While NMSCs make up most of skin cancer diagnoses, melanoma has a much higher mortality rate due to high risk of metastasis, and early diagnosis is critical.When melanoma is caught early, the chances of survival are greatly improved (3).
Currently, diagnosis of NMSC is usually clinical, with subsequent histological confirmation following excision and specialist interpretation (4).To facilitate early diagnosis, alongside managing patient concern, a high proportion of 'suspicious moles' are referred from primary care on the two-week wait pathway, which has seen an increase from 332-thousand referrals in 2015/16 to 509-thousand referral in 2019/20 (5).However, a high proportion of these lesions are benign (6) with the main diagnoses being melanocytic naevi or seborrheic keratosis.Due to the nature of these referrals, they are awarded an inappropriate priority at the expense of more serious disorders.As a result, healthcare services are under pressure with the number of patients being referred for specialist evaluation, onward biopsies and subsequent management of suspicious skin lesions, such that a decreasing percentage of patients referred on a two-week wait pathway are seen within 14 days (5).There is a need to improve diagnostic accuracy of skin lesions earlier on in this process, in order to minimize unnecessary referrals and skin biopsies.
Deep Ensemble for the Recognition of Malignancy (DERM) is a Digital Health Technology that includes an Artificial Intelligence as a Medical Device (AIaMD) algorithm that is able to analyze dermoscopic images of a skin lesion and determine the presence of melanoma in pigmented lesions, with a similar accuracy to clinicians specialized in skin cancer detection (7).The AIaMD has been trained and tested on dermoscopic images of skin lesions with confirmed diagnoses of a range of malignant and non-malignant lesions and sub-types.This helps ensure that, for example, melanoma lesions with different clinical appearance like amelanotic melanoma (8), would be classified as melanoma.However, the AIaMD would not be expected to identify skin cancer from different image types, such as that from reflectance confocal microscopy.The AIaMD is also able to detect BCC and SCC, premalignant and selected benign lesions [such as Intraepidermal Carcinoma (IEC/SCC in situ), actinic keratosis, seborrheic keratosis, and benign melanocytic nevi] providing additional information to aid the clinician in differentiating skin cancers, including melanoma, from benign conditions.The AIaMD provides a high degree of accuracy in the diagnosis of NMSC using historical dermoscopic images, but clinical validation is necessary to demonstrate its utility in clinical practice.DERM is a Class IIa UKCA marked medical device and has been deployed in clinical pathways within the UK since 2020.

Materials and methods
The DERM-003 study was a prospective, multi-center, single-arm, cross-sectional, blinded study (NCT04116983), designed to demonstrate the effectiveness of the AIaMD to identify SCC and BCC.Secondary objectives included demonstrating the effectiveness of the AIaMD to identify premalignant and benign conditions, comparing the AIaMD performance to dermatologists, and demonstrating the feasibility of image capture in a clinic setting.Ethical approval for the study was granted by the Leicester South National Research Ethics committee.
Eligible participants were patients attending dermatology clinics with at least one suspicious skin lesion that was suitable for photographing.Lesions were defined as suspicious by a dermatologist, with no requirement on lesions being of a particular type or pigmentation.Patients provided written informed consent for the study.Recruitment was on a consecutive, competitive recruitment basis in 4 UK hospitals between June 2020 and February 2022.Lesions needed to be less than 15 mm in diameter, not located on an anatomical site unsuitable for photographing (genitals, hair-bearing areas, under nails) or in an area of visible scarring or tattooing, and not previously biopsied, excized or otherwise traumatized.Suitable lesions were photographed by three smartphones (iPhone 6S, iPhone 11 Apple Inc., Samsung Galaxy S10) with (dermoscopic image) or without (macroscopic image) a Dermlite DL1 Basic (DermLite LLC) lens attached, providing a 10x magnification.In addition, one dermoscopic image of healthy skin was also taken by each camera.The AIaMD assessment was not shared with the investigator, who managed the patient in accordance with standard of care.The patient had completed the protocol-defined procedures once the photographs had been taken.For each lesion included in the study, a clinical diagnosis and the clinician's assessment of the likelihood of skin cancer, using a four-point Likert scale (unlikely, equivocal, likely, highly likely), was collected.Where a biopsy was taken, the histopathology-confirmed diagnosis was collected and categorized as melanoma, SCC, BCC, IEC, Actinic Keratosis (AK), Atypical, Benign or other.When there was histopathological uncertainty in the diagnosis, investigators reported the most likely diagnosis.'Other' diagnoses were reviewed by the Chief Investigator.
Images of skin lesions were captured electronically and securely transferred to DERM for analysis by the AIaMD.All images were analyzed by DERM v3 after the completion of the study.The AIaMD generates a numeric output (continuous scale) for each of the examined classes, which reflects its confidence that the lesion is that condition.The sum of the numeric output of all classes is always 1. Threshold settings are defined for each lesion type, above which a lesion is classified as that lesion type.The AIaMD returns the most serious lesion type where the confidence score is above the threshold setting.

Statistical aspects
Patients and lesions that did not meet the inclusion criteria were excluded from the Intention To Treat population (ITT), as were those lesions without a final diagnosis available.Lesions with no AIaMD result available (missing dermoscopic images, and/or where these failed the DERM v3 image quality assessment) were excluded from the Per Protocol (PP) population.The primary analyses were conducted on biopsied lesions in the PP population only.Area Under the Receiver Operator Characteristic (AUROC) curves were used to examine the association of the algorithm's confidence scale with the histopathology-confirmed diagnosis (biopsied lesions) or clinical diagnosis (non-biopsied lesions).The co-primary outcome measures of the study were the one-against-all AUROC for both SCC and BCC.The iPhone 6S camera was used as the reference device.The study aimed to demonstrate both co-primary endpoints were above 0.9.
Assuming the true AUROC curve of the AIaMD is 0.98 and an incidence rate of 11% for SCC and 43% for BCC, a sample size of 45 SCC and 50 BCC lesions was required to demonstrate the AUROCs were superior to 0.9 at alpha = 0.05, with 90% power.A sample size of 543 patients, with an average of 1.2 lesions per patient, was expected to provide sufficient numbers of lesions diagnosed as SCC and BCC, but recruitment remained open until 45 SCC lesions had been included in the study.
Diagnostic accuracy indices (sensitivity, specificity, predictive values, false-positive rates, and false-negative rates) were calculated using decision thresholds determined prior to the image analysis, and applying the hierarchy within the AIaMD.The hierarchy means that, if the AIaMD identifies a lesion as potentially either a BCC or melanoma, it will return the classification of melanoma.Therefore, for a lesion diagnosed as SCC, an output from the AIaMD of "suspected melanoma" is considered a true positive, whereas for a lesion diagnosed as melanoma, an output from the AIaMD of "suspected SCC" is a false negative.The definition of true positive will therefore vary depending on the lesion type being assessed.The likelihood assessment scale was used to calculate a clinician AUROC that could be compared to the AIaMD.
The influence of patient and lesion variables that may affect the AIaMD's accuracy were investigated.The following co-variates were examined: age, sex, Fitzpatrick skin type, skin cancer risk factors including past medical history of skin cancer, lesion body location, experience of reviewing clinician, lesion change, patient's level of concern, clinician's assessment of likelihood of skin cancer, malignancy sub-type and staging.
A p-value of <0.05 was regarded as statistically significant, and all tests were two-tailed.Statistical estimates of accuracy are reported with 95% Confidence Intervals (CIs).Statistical analysis was conducted using R language version 4.1.3(The R Project for Statistical Computing).

Results
A total of 572 patients consented to the study, providing 611 suspicious lesions.Nine patients (6 lesions) were withdrawn / excluded from the study.Eighteen lesions were excluded from the ITT population due to failing to meet eligibility criteria, resulting in 18 patients being excluded due to no eligible lesions.Two further lesions were excluded from the PP population due to missing AIaMD results, resulting in 1 further patient being excluded from the PP population (Figure 1).Of the lesions included in the PP population, 96.7% had images available from all three combinations of hardware, 2.9% had 2 images available, and 2 lesions had just one image available.Nine images failed image quality checks.
Forty-three lesions in the PP population were diagnosed as SCC and 176 as BCC (Table 3) by histopathology.A further 22 lesions were diagnosed as SCC or BCC by clinical diagnosis only, which were excluded from the primary analysis.These lesions did not undergo a biopsy because either the dermatologist chose to treat the lesion (n = 10), the patient refused biopsy (n = 3) or other reason (n = 9), including the biopsy occurred outside the study window.The PP population also included 16 lesions diagnosed as melanoma, and two lesions diagnosed as other malignancies [one Neuroendocrine, and one Spitzoid tumor of uncertain malignant potential (STUMP)] (Supplementary Table 1).Most malignancies were at an early stage.
When pre-set threshold settings were applied, the sensitivity of the AIaMD to identify malignant lesions was above 90%, and the specificity of the AIaMD for malignant lesions was above 41.5% for each individual malignant lesion type and for all malignant lesions (Table 6).Both "other malignant" lesions were classified as malignant by the AIaMD using images from all cameras.The sensitivity and specificity of the AIaMD was more variable for other lesion types, particularly atypical lesions where the sensitivity varied between 38.1% for the Samsung and 86.4% for the iPhone 6S.In comparison, when considering the suspected diagnosis documented by the clinician at the time of their assessment, they labeled fewer melanoma and SCC lesions accurately compared to the AIaMD (melanoma sensitivity of 81.2% compared to >93% by the AIaMD, SCC sensitivity of 63.6% compared to >90%), and more BCC lesions (sensitivity of 97.5% compared to <96%).Conversely, clinicians achieved a much higher specificity for malignant lesions and were more accurate at identifying benign lesions than the AIaMD.Univariate analyses and multiple logistic regression analyses were performed on the FA population, filtered for those images with a final diagnosis available, to identify patient and lesion characteristics that might have influenced the accuracy of the AIaMD results and clinical diagnosis.Age above 60 was associated with a non-significant reduction in the accuracy of both dermatologists and the AIaMD to identify malignant lesions in images from the iPhones (Odds Ratio (OR) = 0.37-0.88,p > 0.16) and minor improvement in images from the Samsung 10 (OR = 1.07-1.18,p > 0.7).The impact only reached significance (p = 0.034) for the AIaMD with images from the iPhone 11, in patients aged 74-82.No significant impact was seen for either the AIaMD assessment or clinicians to accurately identify malignant lesions due to the Fitzpatrick skin type, however no cancers were detected in patients with Fitzpatrick skin types V and VI.Indeed, the only factor associated with a significant improvement on the accuracy of dermatologists to identify malignant lesions was a likely or high likelihood of skin cancer (OR > 7, p < 0.018), and on the AIaMD was a high level of patient concern (OR = 1.95, p = 0.008).

Discussion
The DERM-003 study is the first prospective, powered, clinical validation study that specifically evaluates the ability of the AIaMD to identify NMSC.Previously, the performance of the AIaMD to identify melanoma was evaluated (7), though this was on an earlier version of the software which focused solely on the identification of melanoma.DERM v3 is designed to identify SCC and BCC, alongside melanoma, as well as a range of premalignant, atypical and benign lesions often mistaken for skin cancer.The study recruited patients in dermatology clinics across the UK, such that the population reflects the aging, primarily Caucasian, population seen in these clinics.Although patients with Fitzpatrick Skin types V and VI were recruited, no skin cancers were diagnosed in these patients.Indeed, only 2.2% of the study population had Fitzpatrick skin type IV-VI, limiting the generalizability of these results for patients with darker skin tones.However, this reflects the trend seen in other clinical studies, and in the real world, where few patients with Fitzpatrick skin types IV-VI are seen in dermatology clinics with suspicious skin lesions (7,9) and as such the study population can be seen as representative of the population that DERM would be used on.Robust performance evaluation of technologies, such as DERM, in patients with darker skin types may only be possible through post-market surveillance analyses, where more patients with these skin types can be evaluated (10).Similarly, the study included lesions across a good distribution of body locations, including those with higher sun exposure (head, neck upper body) and lower limbs, where lesions can look different, and a range of skin cancer sub-types and stages that are seen in dermatology clinics.The study also included two "other malignant" lesions, which were diagnosed as STUMP and neuroendocrine, and a range of benign lesions.
When the study was designed, the calculations used to determine the success criteria and sample size were based on in silica performance data, which provided an assumption that the true AUROC for both SCC & BCC was 98%.The clinical performance of AI-based devices has frequently been shown to be lower than that of laboratory-based data (11)(12)(13), and as such an expectation that the true AUROC achieved by the AIaMD on fresh clinical data would be comparable to laboratory results was perhaps unrealistic.Although the study failed to meet either of the co-primary endpoints, the AUROCs achieved by the AIaMD for SCC and BCC were still high and at least comparable to dermatologists.Indeed, the AUROCs of the clinical diagnosis for SCC and BCC lesions do not achieve a 90% AUROC either, indicating that even between clinician and histology there is a huge amount of diagnostic variability.This may be a reflection of clinical practice, where uncertainty of diagnosis drives a conservative view and decision to biopsy.Reassuringly, the AUROC produced by the AIaMD for melanoma was higher than that previously reported (7), demonstrating an improved performance of the AIaMD over the earlier version of the algorithm.
It should be noted that for non-biopsied lesions, the clinical diagnosis was used as the ground truth against which both the AIaMD and clinical diagnosis were compared.Clinical diagnosis therefore will appear more accurate in an all-lesion population, compared to a biopsy-only population, for those lesions where a high proportion do not have a histopathology diagnosis, specifically BCC, AK, and benign lesions.Despite this, the AUROCs achieved by the AIaMD for non-malignant lesions are comparable to those achieved by dermatologists in an all-lesion population, and indeed are notably higher than dermatologists in a biopsy only population.
The study assessed the performance of the AIaMD on images captured by three smartphone cameras available in the UK market at the time of the study.They were chosen to demonstrate performance of the AIaMD across different physical hardware devices (camera specification), operating systems, and price points and included a reference combination (iPhone 6S/DL1) which Skin Analytics has used in a previous study (7).Across the three cameras, the AUROCs for melanoma, SCC and BCC were very similar, indicating a good generalizability of the algorithm across the image capture hardware used.Although a greater variability across the cameras is seen for non-malignant lesions, the AUROCs achieved by the AIaMD from all cameras are still high.
The thresholds used to determine the sensitivity and specificity of the AIaMD were defined to be suitable for use in a secondary care setting at the beginning of the study.The sensitivity achieved by the AIaMD for melanoma, SCC and all malignant lesions were higher than achieved by clinical diagnosis alone, though clinicians referred these lesions for biopsy, so their management decision ensured a sensitivity of 100%.Even for BCC, sensitivity achieved by the AIaMD was around 95% using images from all cameras, and the sensitivity and specificity of the AIaMD to identify premalignant and atypical lesions are at a level that are clinically useful.Additionally, the specificity and NPV values for malignant lesions indicate that the AIaMD could aid the appropriate management of benign lesions.The threshold settings used in live deployments of the AIaMD are different than used in this study, and the sensitivity across all malignant lesions achieved in the real world have been demonstrated to be even higher (10), demonstrating the value in optimizing the settings within the AIaMD for the population it is being used to assess.The sensitivities achieved by the AIaMD for non-malignant lesions are more variable across the cameras than seen for malignant lesions, specifically atypical and benign lesions.Similarly, there was only a moderate concordance between the outputs produced by the AIaMD when analyzing images captured by the different image capture hardware.This may be due to variances in the hardware and post-processing software, or a factor of the threshold settings used by the AIaMD to assign the output label.If the confidence scores produced by the AIaMD on images of the same lesion taken on two different cameras were similar, but fell either side of the threshold set, the AIaMD output label from each image could be different.Since the AUROCs for these lesions were similar, this suggests that the thresholds applied could  be optimized for the image capture hardware being used, to achieve the best sensitivity.
The multivariate analysis identified a different impact of patient factors on the accuracy of malignant lesion detection by the AIaMD compared with previously reported analyses (7).This may reflect a change in how the AIaMD works between the two versions assessed.However, since the impact of patient factors on the accuracy of dermatologists is also different, it may be more a reflection that the previous study focused on melanoma detection, whereas this analysis considered all malignant lesions included in the study population.Further analyses are needed to understand whether these translate into a clinically relevant reduction in sensitivity and/or specificity of the AIaMD in different patient groups.
The main limitation to the DERM-003 study is the clinical setting in which it was conducted, and therefore the population studied.The study was conducted in UK secondary care dermatology clinics in order to include sufficient numbers of SCC and BCC lesions in the study population, and to easily capture the histopathology confirmed diagnosis of biopsied lesions and a dermatologist's clinical assessment of the lesion.This means the study population was made up of patients and lesions that dermatologists determined were suitable for inclusion in the study, which may not be representative of all patients and lesions that would be assessed by DERM.For example, lesions that were clearly benign may have been excluded by a study dermatologist, but on which a less experienced clinician may use DERM to support their patient management decision.That said, the study recruited a broader spectrum of lesions in the study population compared to a previous study (7), where the study population was limited to patients with a pigmented lesion that was due for biopsy.The results of this study are therefore more generalizable to the population of patients seen in secondary care in the UK.Indeed, data from ongoing postmarket surveillance monitoring indicates that DERM can be deployed safely as an adjuvant tool in live clinical services accessible to patients with eligible skin lesions (i.e., excluding those under nails, on genitalia or on hairy areas of skin), from a broad range of age groups and most representative skin types with suspicious skin lesions, with sensitivity and specificity in-line with target thresholds and performance demonstrated in clinical studies (10).Finally, the reliance on clinical diagnosis as the ground truth for non-biopsied lesions not only artificially increases the performance metrics for the dermatologists, as discussed above, but potentially impacts the apparent performance of the AIaMD on non-biopsied lesions.The clinical diagnosis of skin cancer by clinicians is based on the subjective interpretation of morphological features and as such variability in the clinical diagnoses given by dermatologists is known to exist (14).The reliance on one dermatologist to provide the clinical diagnosis used as the ground truth for non-biopsied lesions introduces a potential bias to the results for both the AIaMD and dermatologists.The use of a panel of dermatologists to provide a consensus diagnosis would have provided a greater confidence in the clinical diagnosis ground truth, and provided an independent diagnosis against which to compare the investigating dermatologist.
In conclusion, even though the study failed to meet its co-primary endpoints, the results from the DERM-003 study showed that the AIaMD can detect NMSC and premalignant lesions with a similar level of accuracy as dermatologists, and that taking the images was a quick and well tolerated process.DERM could provide dermatologist level assessment of suspicious skin lesions earlier in the patient pathway, potentially enabling the earlier diagnosis of malignant lesions and improvement of differentiation between harmless and potentially harmful lesions by non-specialists.ROC curves for SCC (left) and BCC (right) produced by the AIaMD when assessing images of all lesions, taken by different cameras.

FIGURE 1
FIGURE 1 Consort diagram.Number of patients in the ITT/PP population = number of patients who have at least one lesion that fulfills the ITT/PP inclusion criteria for at least one capture device; Number of lesions in the ITT/PP datasets = number of lesions from patients included in the ITT/PP population, that fulfill the ITT/PP inclusion criteria for at least one capture device.

TABLE 1
Patient demographics by analysis population.

TABLE 2
Lesion characteristics by analysis population.
FA, Full Analysis; ITT, Intention-to-Treat; PP, Per Protocol; SD, Standard Deviation; GPwSI, General Practitioner with Special Interest.Number of lesions equates to number of lesion records created in the study database, the lesion count is based on clinician provided information on the number of lesions they assessed for each patient.

TABLE 3
Breakdown of lesion diagnoses in the PP population.

TABLE 3 (
Continued) FIGURE 2ROC curves for SCC (left) and BCC (right) produced by the AIaMD when assessing images of biopsied lesions, taken by different cameras.

TABLE 5
AUROC of clinician assessment of likelihood of skin cancer.

TABLE 4
AUROCs produced by DERM, using images taken on each camera.Area Under the Receiver Operator Characteristic Curve; SCC, Squamous Cell Carcinoma; BCC, Basal Cell Carcinoma; IEC, Intraepidermal Carcinoma; AK, Actinic keratosis; CI, Confidence Intervals.Because of the necessity for a dermoscopic image of the lesion to be available for assessment by DERM, the number of lesions included was different for each camera. AUROC,

TABLE 6
Diagnostic performance metrics of clinicians and DERM, using images from each camera, for all lesions in the Per Protocol population.