A prospective multicenter clinical research study validating the effectiveness and safety of a chest X-ray-based pulmonary tuberculosis screening software JF CXR-1 built on a convolutional neural network algorithm

Background Chest radiography (chest X-ray or CXR) plays an important role in the early detection of active pulmonary tuberculosis (TB). In areas with a high TB burden that require urgent screening, there is often a shortage of radiologists available to interpret the X-ray results. Computer-aided detection (CAD) software employed with artificial intelligence (AI) systems may have the potential to solve this problem. Objective We validated the effectiveness and safety of pulmonary tuberculosis imaging screening software that is based on a convolutional neural network algorithm. Methods We conducted prospective multicenter clinical research to validate the performance of pulmonary tuberculosis imaging screening software (JF CXR-1). Volunteers under the age of 15 years, both with or without suspicion of pulmonary tuberculosis, were recruited for CXR photography. The software reported a probability score of TB for each participant. The results were compared with those reported by radiologists. We measured sensitivity, specificity, consistency rate, and the area under the receiver operating characteristic curves (AUC) for the diagnosis of tuberculosis. Besides, adverse events (AE) and severe adverse events (SAE) were also evaluated. Results The clinical research was conducted in six general infectious disease hospitals across China. A total of 1,165 participants were enrolled, and 1,161 were enrolled in the full analysis set (FAS). Men accounted for 60.0% (697/1,161). Compared to the results from radiologists on the board, the software showed a sensitivity of 94.2% (95% CI: 92.0–95.8%) and a specificity of 91.2% (95% CI: 88.5–93.2%). The consistency rate was 92.7% (91.1–94.1%), with a Kappa value of 0.854 (P = 0.000). The AUC was 0.98. In the safety set (SS), which consisted of 1,161 participants, 0.3% (3/1,161) had AEs that were not related to the software, and no severe AEs were observed. Conclusion The software for tuberculosis screening based on a convolutional neural network algorithm is effective and safe. It is a potential candidate for solving tuberculosis screening problems in areas lacking radiologists with a high TB burden.


Introduction
Tuberculosis (TB) remains one of the leading causes of death worldwide, killing 1.4 million people per year (1).Although the disease is largely curable and preventable, an estimated 2.9 million per 10 million people falling sick with TB were not diagnosed or reported to the World Health Organization (WHO) (2).Therefore, there is a pressing need to improve the early diagnosis of TB disease, thus initiating treatment promptly and reducing the transmission of Mycobacterium tuberculosis.One effective strategy is systemic screening, which should distinguish people with a high possibility of TB from those without.Chest radiography (chest X-ray or CXR) is a widely used and cost-effective screening tool for TB detection, achieving a sensitivity and specificity of 85.0 and 96.0% for TBrelated abnormalities, respectively (2)(3)(4)(5).While in areas with low resources and a high TB burden requiring mass screening urgently, the interpretation of CXRs is labor-intensive and time-consuming under the circumstances that trained health personnel interpreting the CXRs are very lacking (6)(7)(8)(9)(10)(11).
Computer-aided detection (CAD) technologies, especially those with AI algorithms, vastly increase the capacity of image reading and have similar or even better diagnostic accuracy performance compared to human readers (12)(13)(14).The technologies make it possible to perform accurate mass screening with fewer resources.Thus, the WHO has recommended CAD as an alternative to human interpretation of CXR for screening and triaging pulmonary TB in individuals aged 15 years or older (2).The edge of CAD is artificial intelligence, especially in the field of deep learning, in which convolutional neural networks (CNNs) are the most promising algorithms for dealing with visual tasks (15)(16)(17).JF CXR-1 (version 2) (JF Healthcare, Jiangxi, China) is a simultaneous CXR detection CAD software based on CNNs that detects multiple thorax diseases, such as TB, lung mass, and lung nodules.The software was trained on 14,160 CXRs from township-level hospitals across China.In the testing phase, 13122 CXRs were provided for JF CXR-1 to detect TB.Among them, 31.5% (4127/13122) were pulmonary tuberculosis, 16.3% (2143/13122) were other pulmonary diseases such as pneumonia, pulmonary abscess, lung cancers, etc., and 52.2% (6852/13122) were normal.And JF CXR-1 has achieved an AUC of 0.94, a sensitivity of 0.91 (3,755/4,127), and a specificity of 0.81 (7,286/8,995), which meet the WHO's criteria for the target product profile.To validate its effectiveness and safety in clinical use, we conducted prospective multicenter clinical research in six general infectious disease hospitals in mainland China.

Study setting and population
The study was conducted at Shanghai Public Health Clinical Center, Beijing Chest Hospital, the Third Hospital of Zhenjiang, Chongqing Public Health Medical Center, Jiangxi Province Chest Hospital, and Hebei Chest Hospital.All of them are designated hospitals for tuberculosis in China, and there are also healthy individuals in the medical examination centers of each hospital.Participants were recruited from the visitors from tuberculosis clinics and medical examination centers of the above hospitals since June 2020.The inclusion criteria were (1) being aged 15 or over, (2) being willing to receive the CXR examination or could provide the image (DICOM format) of the posterior-anterior CXR taken in the late 40 days, and (3) voluntarily participating and providing informed written consent.The exclusion criteria included (1) a history of obsolete pulmonary tuberculosis, (2) neutropenia, (3) infection with the human immunodeficiency virus (HIV), (4) subjects whose CXR images do not meet the diagnostic requirements, (5) those with a history of hematological disorders, (6) those with a history of pulmonary lobectomy, (7) those with a history of mental illness or cognitive disorder, (8)

Procedures
Each participant received a complete blood count (CBC) test and an HIV antibody test.Participants with neutropenia or/and HIV infection were excluded according to the results of the blood tests.Then, each of the rest of the participants received a digital posterior-anterior CXR.Even though not all the centers have the same X-ray machines, efforts were made to set the parameters as similar as possible.The CXR images were saved in DICOM format and imported into the AI-based CAD software (JF CXR-1 v2, produced by Jiangxi Zhongke Jiufeng Smart Medical Technology Co., Ltd.).JF-CXR-1 and the radiologist group read every single CXR image independently and were blinded to all the information, including age and sex.The AI algorithm would produce a probability score for each anonymized image to predict its likelihood of being TB-positive.
A score >0.35 indicated a high possibility of tuberculosis, prompting the need for further diagnostic measures such as microbiological tests and a chest CT scan (When the threshold score = 0.35, the combination of sensitivity, specificity, and Kappa value is the best).The radiologist group consisted of eight certified senior radiologists from a third-party organization.They utilized the Diagnosis for Pulmonary Tuberculosis of Chinese Health Industry Standards as the fundamental criteria for TB x-ray screening.All radiologists have at least 10 years of experience in Grade-A tertiary hospitals.The CXR images were read independently by five senior radiologists.Diagnosis suggestions in CXR reports with "suspect TB" or "TB" were considered positive for TB."Normal" and "other abnormal" CXR were considered negative for TB.The final decision among the five radiologists was determined based on the principle that the minority is subordinate to the majority.If cases where three radiologists shared the same opinion but the other two disagreed, the image would be sent to three other radiologists for arbitration.The final decision of the radiologist group was determined by the three "arbitrators" in a written report.Two months after the CXR examination, the clinical diagnosis information of every participant was collected.TB diagnosis followed China's National TB Diagnosis Guideline (WS288-2017).

Data analysis
All the data were statistically described, including baseline information, effectiveness data, and safety data.The final decision of the radiologist group was set as the reference standard for the software to compare.For effectiveness, the main evaluation indicators were sensitivity and specificity, referring to the results of the radiologists' board.The secondary evaluation indicators were the area under the receiver operating characteristic (ROC) curve (AUC) and the consistency rate with the final diagnosis (including bacteriologically confirmed and clinical diagnoses).If an image was diagnosed as TB or non-TB by both the radiologist group and the AI software, it was defined as true positive (TP) or true negative (TN), respectively.If an image was diagnosed as TB by the radiologist group but non-TB by the software, it was defined as a false negative (FN).Moreover, if an image was diagnosed as non-TB by the radiologist group but TB by the software, it was defined as a false positive (FP).Sensitivity = TP/(TP + FN), specificity = TN/(TN + FP) and the consistency rate = (TP + TN)/(TP + TN + FP + FN).The ROC curve was acquired when the sensitivity was set as the Y-axis, and 1-specificity was set as the X-axis.And the AUC was the area under the ROC.
Sensitivity is the proportion of true positive tests in all patients with a condition.A test or instrument can yield a positive result for a subject with that condition.In our study, it means the test ability of AI software screening out TB on the CXRs compared with the radiologist group.
Specificity is the percentage of people without the disease who are correctly excluded by the test.It is important to exclude people with diseases during screening.In our study, it refers to the ability of AI software to rule out non-TB participants.Ideally, a test should provide high sensitivity and specificity.For safety, adverse events (AE) and severe adverse events (SAE) were evaluated for every participant since their CXR was imported into the AI software and stopped 2 weeks later.

Role of the AI developers
The AI developer had no role in study design, data collection, analysis, or manuscript writing, but they provided us with a free account to use the software and free technical support.

Results
Between 06 Nov 2020, and 04 Jun 2021, 1,218 participants were screened, and 53 failed screening.Thus, 1,165 were enrolled, 1,161 were included in the Full Analysis Set (FAS) and Safety Analysis Set (SS) due to one dropout and three eliminations, and 1,150 were included in the Per Protocol Set (PPS) due to 11 protocol deviations (Table 1; Figure 1).The participants' ages ranged from 15 to 86 years, and the average was 44.4 ± 16.4 years.Men accounted for 60.0% (697/1,161), while women accounted for 40.0%(464/1,161).
In the FAS, the reports of 10 subjects were lost.Thus, the results of 1,151 subjects were analyzed.According to the radiologist group, 601 images were positive for TB, while 550 were negative for TB.Compared with the reference standard, the sensitivity of the software was 94.2% (95% CI: 92.0-95.8%),while the specificity was 91.2% (95% CI: 88.5-93.2%).Therefore, the consistency rate was 92.7% (95% CI: 91.1-94.1%)with a Kappa value of 0.854 (P = 0.000) (Table 2).The AUC was 0.98 (Figure 2).After 2 months of clinical evaluation since enrollment, the diagnosis remained unclear for 29 out of 1,161 participants.Among the rest of the 1,132 participants, 687 were diagnosed with tuberculosis, while the others were not.Therefore, taking the final diagnosis (including bacteriologically confirmed and clinically diagnosed) as the ground truth, the sensitivity of the software was 78.9% (75.7-81.8%), the specificity was 89.9% (86.7-92.4%),and the consistency rate was 83.2% (80.9-85.3%)with a Kappa value of 0.662 (p = 0.000) (Table 3).
In the PPS, the results of 1,150 subjects were analyzed.We found that 600 images were positive for TB, while 550 were negative for TB.Compared with the reference standard, the sensitivity of the software was 94.2% (92.0%−95.8%),while the specificity was 91.2% (88.5-93.2%).Therefore, the consistency rate was 92.7% (91.1-94.1%)with a Kappa value of 0.854 (P = 0.000) (Table 4).The AUC was 0.98 (Figure 3).The diagnosis remained unclear for 29 out of 1,161 participants after 2 months of clinical evaluation since enrollment.Among the rest of the 1,121 participants, 582 were diagnosed with tuberculosis, while the other 539 were not.Therefore, taking the final diagnosis (including bacteriologically confirmed and clinically diagnosed) as the ground truth, the sensitivity of the software was 79.3% (76.1-82.2%), the specificity was 90.5% (87.4-92.9%),and the consistency rate was 83.7% (81.4-85.7%)with a Kappa value of 0.671 (p = 0.000) (Table 5).Since it has already been reported that the JF CXR-1 v2 algorithm   S1).Nevertheless, the sample size was only 233, which was too small to draw a powerful conclusion.
In the safety set (SS), which included 1,161 participants, 0.3% (3/1,161) had AEs, and no SAE was reported.One had his ankle twisted while running, and two had drug-induced dermatitis after initiating anti-tuberculosis treatment.The AEs were mild, generally not bothersome, and unrelated to the software.No SAE was observed.

Discussion
In the era of the application of AI, the state-of-the-art in image recognition is CNNs, which have attracted a number of researchers to develop algorithms to replace human readers to identify tuberculosis in chest CXR images (3,15,(18)(19)(20)(21)(22)(23)(24).However, the majority of the software is only tested with retrospective analysis (3,13,14,17,19,(25)(26)(27) or only with datasets (3,18,22,23).Few studies focus on software evaluation in prospective clinical contexts (28,29).JF CXR-1 v2 had been certified by the National Medical Products Administration of China for the screening and auxiliary diagnosis of active pulmonary tuberculosis in individuals no younger than 15 years and without immunodeficiency.We conducted this prospective multicenter clinical research to evaluate the performance of JF CXR-1 v2 to recognize tuberculosis in persons without immunodeficiency and aged 15 years or older.As described above, 1,151 subjects in the FAS were analyzed.Compared to the results from radiologists on the board (considered the reference standard), the software showed a sensitivity of 94.2% (95% CI: 92.0-95.8%)and a specificity of 91.2% (95% CI: 88.5-93.2%),and the consistency rate was 92.7% (91.1-94.1%)with a Kappa value of 0.854 (P = 0.000).The AUC was 0.98 (Figure 3).The results were very close in the PPS.The study of Nijiati et al. (17), which evaluated the performance of a trained AI model with CNNs screening TB in chest CXRs in an underdeveloped area, demonstrated its sensitivity, specificity, consistency rate, and AUC of 85.7%, 94.1%, 91.0%, and 0.910, respectively.Noteworthy, Nijiati et al. ( 17) also took the results of radiologists as a reference standard because their purpose was screening rather than triage.In a prospective study of a pilot active TB onsite screening project (4), where the reference standard in the project was bacteriologically confirmed, and clinically diagnosed TB, JF CXR-1 had sensitivity, specificity, and AUC of 100.0%, 95.7%, and 0.978 at threshold 30, and of 75.0%, 96.8%, and 0.859 at threshold 50.When the same reference standard was used in our study, the sensitivity, specificity, and consistency rates were 78.9% (75.7-81.8%),89.9% (86.7-92.4%),and 83.2% (80.9-85.3%)with a Kappa value of 0.66 (p = 0.000) in the FAS, and 79.3% (76.1-82.2%),90.5% (87.4-92.9%),and 83.7% (81.4-85.7%)with a Kappa value of 0.67 (p = 0.000) (Table 5) in the PPS.The first open evaluation of JF CXR-1 was reported by Qin et al. in 2021 (13).JF CXR-1 was one of the five commercial AI algorithms evaluated for TB triaging in Dhaka, Bangladesh, a high-burden setting.Chest CXRs from 23,954 individuals were included in the analysis, and Xpert was set as the reference standard.However, JF CXR-1 has not met the WHO's Target Product Profile (TPP) of a triage test of at least 90.0%sensitivity and at least 70.0%specificity (30), similar to that of InferRead DR and Lunit INSIGHT.It has been proven that JF-CXR-1 significantly outperformed radiologists.When the sensitivity was fixed at 90.0%, the specificity was 61.1% (60.4-61.8%),while when the specificity was fixed at 70.0%, the sensitivity was 85.0% (83.8-86.2%).Moreover, the AUC was 0.849 (0.843-0.855).It was found to reduce half of the required Xpert tests while maintaining a sensitivity above 90.0%.JF CXR-1 had higher sensitivity for most of the decision thresholds (above ∼0.15), which confers it more competence to be a better screening tool.Codlin et al. (25) conducted an independent evaluation of CAD software for TB screening; 12 types of software were included.The performance of each software was compared against both an expert and an intermediate human reader.Xpert results were the reference standard, and half of the 12 software programs, including JF CXR-1, achieved similar results on par with the expert reader.The AUC of JF CXR-1 was 0.77 (0.73-0.81), ranking third among the 12 evaluated CAD software.
Currently, there are 17 available or upcoming AI-CAD products for TB detection, including JF CXR-1.Nevertheless, only two are in the catalog of the Global Drug Facility of the Stop TB Partnership: CAD4TB software (The Netherlands) and InferRead DR Chest (Japan).JF CXR-1 is still under evaluation by the WHO (2020 report).CAD4TB is an example in Qin et al.'s retrospective research (13) of comparing the competence of 5 AI-CAD products for TB triaging, including CAD4TB, InferRead DR, Lunit INSIGHT, JF CXR-1, and qXR.CAD4TB was the top performer with sensitivity, specificity, and AUC of 90.0% (89.0-91.0%),72.9% (72.3-73.5%),and 0.903, respectively.Besides, 49.0% of the people triaged by CAD4TB would be recalled for confirmatory tests in a program focused on capturing almost all people with tuberculosis, while the percentage was 57.0% by JF CXR-1 (13).In another prospective research, (31) which recruited symptomatic adults in a Pakistani hospital and used a reference mycobacterial culture of two sputa, CAD4TB had a sensitivity, specificity, and AUC of 93.0% (90.0-96.0%),69.0% (67.0-71.0%),and 0.87 (0.85-0.86), respectively.In brief, the AUC of CAD4TB ranges from 0.71 to 0.94 (14,32), and nearly all the reference standards were set according to the bacteriological results.The sensitivity, specificity, consistency rate, and AUC of JF CXR-1 in our study were non-inferior to most results from other AIbased software.However, since the reference standard in our study was from the human reader instead of bacteriological results, the application of JF CXR-1 is limited to clinical use.While it could play an important role in screening programs, especially in resourceconstrained areas (4), the proportion of bacterially confirmed TB only accounts for 63.0% (1).Moreover, the purpose of screening is to pick out people suspected of having TB and those who need further diagnostic evaluation.
Our study has the advantages of a prospective nature and clinical validation, where the CXR data has not been used for software training and testing.There are also limitations.
First, participants with obsolete tuberculosis were not included.Considering the confusing presentation of CXRs between obsolete tuberculosis and active tuberculosis, the efficiency of AI software might be lowered once obsolete tuberculosis is included.Second, because our purpose was to evaluate the screening performance of JF CXR-1, the main reference standard was the results drawn from the radiologist rather than the bacteriological results; the bacteriological screen has not been feasible due to financial reasons in a less developed, high-TB-burden country like China.Third, when it comes to real-world use, whether the software performs better than human readers remains uncertain because the comparison to bacteria was not conducted.Fourth, since the tool has shown poor performance in the age group >60 years (13), future studies are needed to fine-tune the algorithm in this area.The technology of AI is used not only for human resource optimization but also for better output.Therefore, we plan to conduct clinical research covering obsolete pulmonary tuberculosis participants and further evaluate the triage performance (with bacterial results) of JF CXR-1 in the future.
Furthermore, since the application of AI in tuberculosis screening is still under exploration, there are some practical considerations, especially in resource-constrained settings.First, some people might have little trust in AI and worry about data safety.Therefore, people undergoing TB screening should be aware that the screening tool utilizes AI software, and they should be given informed consent before proceeding with the screening process.Second, safety should be guaranteed for AI in healthcare (33).Unlike AI-based clinical decision support (CDS) software, screening software might be safer because of diagnosis procedures after screening.However, the reliability, validity, and stability of the software still need to be checked from time to time.Third, the network infrastructure might still be insufficient in resource-constrained settings, and there is the cost of managing network establishment, maintenance, and repair.There are limitations to CNNs: (1) CNNs have high computational requirements, and (2) since the CNNs have multiple layers, the training process takes a particularly long time if the computer does not have a powerful graphics processing unit (GPU).Even if TB screening with AI could save much money, whether resource-constrained areas could afford other costs or not remains problematic.
In our study, the software for tuberculosis screening based on a convolutional neural network algorithm was effective and safe, with satisfying diagnosis performance.It is a potential candidate for solving tuberculosis screening problems in areas lacking radiologists with a high TB burden.

FIGURE
FIGUREParticipant selection process and the diagnostic workflow.FAS, analysis set; SS, safety analysis set; PPS, per protocol set; AUC, area under the receiver operating characteristic curves.*, details are in the methods section.
Shanghai Public Health Clinical Center; Center 02, Beijing Chest Hospital; Center 03, The Third Hospital of Zhenjiang; Center 04, Chongqing Public Health Medical Center; Center 05, Jiangxi Province Chest Hospital; Center 06, Hebei Chest Hospital.FAS, full analysis set; PPS, per protocol set; SS, safety set.
*The reports of ten subjects were lost during an office movement due to the COVID-19.
TABLE Performance of JF CXR-against final diagnosis (FAS).