Metrics to guide development of machine learning algorithms for malaria diagnosis

Automated malaria diagnosis is a difficult but high-value target for machine learning (ML), and effective algorithms could save many thousands of children's lives. However, current ML efforts largely neglect crucial use case constraints and are thus not clinically useful. Two factors in particular are crucial to developing algorithms translatable to clinical field settings: (i) Clear understanding of the clinical needs that ML solutions must accommodate; and (ii) task-relevant metrics for guiding and evaluating ML models. Neglect of these factors has seriously hampered past ML work on malaria, because the resulting algorithms do not align with clinical needs. In this paper we address these two issues in the context of automated malaria diagnosis via microscopy on Giemsa-stained blood films. First, we describe why domain expertise is crucial to effectively apply ML to malaria, and list technical documents and other resources that provide this domain knowledge. Second, we detail performance metrics tailored to the clinical requirements of malaria diagnosis, to guide development of ML models and evaluate model performance through the lens of clinical needs (versus a generic ML lens). We highlight the importance of a patient-level perspective, interpatient variability, false positive rates, limit of detection, and different types of error. We also discuss reasons why ROC curves, AUC, and F1, as commonly used in ML work, are poorly suited to this context. These findings also apply to other diseases involving parasite loads, including neglected tropical diseases (NTDs) such as schistosomiasis.


INTRODUCTION
Malaria and some neglected tropical diseases (e.g., schistosomiasis) involve parasite loads that can be detected in microscopy images of a substrate (e.g., blood or filtered urine).They are thus amenable, though difficult, targets for automated diagnosis via machine learning (ML) methods.These diseases are also very high-value ML targets: They are serious global health challenges affecting hundreds of millions of people, especially children, in underserved populations [1,2].However, ML methods developed for malaria diagnosis using Giemsa-stained blood films have so far largely failed to translate to useful deployment, for several reasons.
(i) The task is difficult: for example, malaria parasites are small; field blood films are highly variable and often full of distractor objects; and the low limits of detection required for clinical use result in low signal-to-noise ratios (e.g., one parasite per 30 large fields of view).
(ii) ML development has typically proceeded in a heavily ML-centric mindset, without careful attention to (or even knowledge of) the domain specifics, use cases, and clinical requirements of malaria.This yields algorithms that, almost by design, fail to meet clinical needs and cannot be built upon (see Figure 1).
(iii) ML development can only optimize what is measured, so a crucial prerequisite for successful development is a set of task-relevant metrics [3].These tailored metrics have largely been lacking for malaria, for which ML development has instead been guided by generic and ill-suited ML metrics such as object-level ROC curves.This paper seeks to accelerate the ML community's progress towards translatable Figure 1.Left: Effective ML (AI) solutions must interlock with domain requirements and will be shaped by non-ML pressures from the use case.Right: Solutions developed with a ML-centric approach, neglecting the use case, will fail to match clinical needs.("interlocking" metaphor due to Dr. Scott McClelland; jigsaw outline from [4]) solutions for malaria diagnosis, by describing tools and techniques which we have found to be essential for development of clinically effective ML algorithms.It captures lessons learned by our group over a decade of applying ML to malaria diagnosis.The resulting algorithms [5,6,7] represent, to our knowledge, the most effective and also most extensively field tested [8,9,10,11,12] ML algorithms yet built for fully automated diagnosis of malaria on Giemsa-stained blood films.These field trials showed that our algorithms, though state-of-the-art, still fall short of the clinical demands, and highlight the need for more robust algorithms to truly impact malaria diagnosis.
The paper is structured as follows: Section 2 details aspects of ML work that depend on a grasp of the clinical use-case (e.g., how the disease is diagnosed in the field), lists malaria documents especially relevant to ML work, and discusses other domain knowledge resources.Section 3 describes ML metrics tailored specifically to malaria and NTDs that can be applied during development of ML algorithms to optimize their clinical effectiveness, and also describes problems with some commonly-used ML metrics.
To be clinically useful an ML solution must fit into a larger, ML-independent context.It must interlock with other pieces that are shaped by clinician needs, site requirements, protocols currently in use, patient needs, business environment, etc. [13].This strong constraint to mesh with non-ML considerations is often overlooked by ML practitioners, leading to algorithms that are elegant (from an ML perspective) but useless (from a clinical perspective) [14].
In particular, a clinically useful ML algorithm must fit into an existing care structure and meet or exceed existing clinical performance targets.So understanding these clinical constraints is a basic prerequisite for algorithm development.(We set aside the complex case of a disruptive technology potentially altering existing care protocols.Such cases of course require careful analysis.) This section discusses some crucial points to consider, and lists resources for learning about malaria use cases.

Important domain specifics
Several domain-specific details are fundamental to effective algorithm development: Basic facts about the clinical needs: For example, what are the proper uses of thick vs. thin blood films for malaria?

Performance metrics relevant in the clinic:
Examples include patient-level sensitivity and specificity, and limit of detection (LoD).This knowledge enables ML researchers to tailor salient metrics to guide algorithm development (like those we give in Section 3), define objective functions, do internal assessment, and report algorithm results meaningfully.

Performance specifications:
Clinicians are unwilling to reduce patient care standards, so ML models must perform at least as well as current practice to be deployable.Field performance requirements are thus vital concerns, even if a particular model iteration does not attain them (since the work can then be built upon or extended).

Domain-specific obstacles and shortcuts:
Some difficult details need special treatment, and others allow for valuable shortcuts.For example, malaria parasites can exist at various depths of a thick blood film, so a single image plane will not capture all parasites in focus.On the plus side, the nuclei of white blood cells (WBCs) are plentiful in thick films and stain similarly to malaria parasite nuclei, so they can serve as a ready-made color reference for the rare (or absent) parasites.Shortcuts matter because generic methods applied as-is are unlikely to hit clinical performance requirements, which is a much harder task than simply outdoing another generic method in a ML-style comparison.

Structuring annotations and training sets:
Annotations and training data are central to ML success, and must be tailored to the task.For example, malaria ring forms (the youngest parasite stage) typically have both a round nucleus and a crescent-shaped cytoplasm (examples in Fig 4).However, after drug treatment the rings often lack visible cytoplasm, appearing in thick films as dark round dots which are very similar to a common distractor type.As a result, they have outsized impact on decision boundaries and require special care as to annotation and inclusion in training sets.
Avenues to acquire vital domain expertise include (i) documentation and (ii) connecting with domain experts.

Documentation
Effective ML solutions need to design in accommodations to non-ML (e.g., clinical) constraints.Therefore, literature review to inform ML work should extend well beyond ML methods and focus on the clinical use-case itself, without an ML-centric filter.Documentation of use cases and standards of care are published by various agencies, including the World Health Organization (WHO), ministries of health, and non-government organizations (e.g., the Bill and Melinda Gates Foundation, the Global Fund, and the Worldwide Antimalarial Resistance Network).
Below we list some references that are especially relevant to ML researchers designing algorithms for automated malaria diagnosis using Giemsa-stained blood films.

Appropriate evidence for ML:
• The WHO has issued guidelines on how to generate meaningful evidence for ML-based medical tools [15], especially Section 1.This document is important for ML as applied to any medical use case.Crucially, evidence of algorithm performance during development must be firmly grounded in the clinical use case.This requirement underpins the metrics described below in 3.2 -3.11.

Protocols for malaria microscopy:
Various groups have published diagnosis protocols which detail the clinical task.
• WHO's guidelines are an essential resource [16] and [17] (see especially SOPs 8 and 9 for diagnosis and quantitation).
• WWARN and the WHO have developed protocols tailored to research contexts (e.g., drug resistance sentinel sites) [20].

Evaluation tests:
• The WHO has developed a system to evaluate malaria microscopists.This uses a set of 56 blood slides with carefully specified parasitemias and species [21] (section 6).The "WHO 56" evaluation reflects the tasks and accuracies required in the clinic and is thus a valuable and challenging test for ML algorithms.Its difficulty gives an appreciation of the skills of human field microscopists.The defined competency levels offer clear and clinically meaningful performance targets for ML algorithms.Note that the "WHO 56" differs slightly from the previous version (the "WHO 55") found in [22].
• A similar but distinct evaluation set of blood slides, tailored to research rather than clinical contexts, is detailed in [20].

Neglected tropical diseases:
• The WHO has defined target product profiles, including sensitivity and specificity requirements, that are relevant to automated ML systems targeting schistosomiasis [23].
Other performance specifications: • The above documents also provide detail concerning other general product requirements relevant to any ML solution that aims for translation to clinics.These issues include time-to-result, throughput, electricity/battery constraints, price, and (implicitly) computational constraints.

ML publications:
• Some ML papers (e.g., [10,24]) cite non-ML documents relevant to use case, but this is not (yet) common practice.So ML-based literature search is insufficient.

Domain experts
Domain experts are a vital source of guidance and collaboration.They include field experts, i.e. those who work in field clinics or who do field-based research; and subject matter experts, such as WHO personnel and long-time researchers in the space (these groups overlap).The value of their experience and insight to effective algorithm development cannot be overstated.
As an example, our group's entire ML program for malaria diagnosis has depended absolutely upon expert input from a technical advisory panel, as well as on continued contacts and advice from field clinics.To the degree that our work has succeeded, this expert input has been the key ingredient (along with the closely entwined matter of data collection and curation).We would argue that ML development can only progress towards clinically useful algorithms when domain expertise is somehow integrated into the team (recent examples include [25,26,27,28,29,30] and for schistosomiasis [24,31]) Connecting with such experts is made easier by two things.First, people (on average) love to talk about their work.Second, field experts are often (again, on average) open to engaging with ML solutions and happy to co-author serious research.
Sources for contacts include: (i) published work, e.g., who is leading and authoring/co-authoring relevant studies; (ii) academic institutions with concentrations of research in the space; (iii) online interest groups, e.g., on LinkedIn; and (iv) non-ML conferences, their attendees, and proceedings, e.g., the American Society of Tropical Medicine and Hygiene.

SALIENT METRICS
Salient metrics are essential to ML work, both to guide development and to report results meaningfully.Unfortunately, the metrics routinely applied to ML work on malaria (e.g., object-level precision, recall, AUC, and F1 score) have disqualifying drawbacks in the malaria context.
A 2018 review of automated malaria detection papers [32] described serious problems (which still persist): reported metrics are incomplete and not comparable between studies; metrics are object-based (not patient-based) and are thus not relevant to the clinical task; train and test sets contain objects from the same patient, which contradicts the patient-level focus; and datasets are too small.In addition, incorrect assumptions are built into algorithms: for example, diagnosis on thin blood films is common in ML papers, despite being contrary to clinical practice due to practical obstacles [17,33] (though see recent work on thin film spreaders in [34,35]).
In this section, we first discuss some of the problems with commonly-used ML metrics (3.1).We then describe in detail some alternate metrics which have high clinical relevance for the malaria use case (3.2 -3.11).These metrics are effective tools both to guide ML development and to report meaningful performance results.They are suitable for ML models that target diseases involving parasite loads such as malaria, NTDs, or more generally any pathology where diagnosis is determined by the presence of a variable number of abnormal objects (e.g., pixels or cells in a histopathology slide).

Problems with ROCs, AUCs, and Precision
ML practitioners choose metrics to evaluate model performance by (i) what is customary, familiar, and convenient; (ii) what has been done by previous authors; (iii) what can generate the "state of the art" (SOTA) comparisons required for publication in the ML community; and (iv) what is acceptable to ML reviewers.This creates a closed loop which perpetuates the use of certain metrics without regard to their effectiveness.When entrenched metrics do not assess algorithm performance in a clinically relevant way, it blocks progress towards deployable solutions.
Several commonly-used ML metrics, including object-level ROC curves, AUC, object precision, and F1 score, appear frequently in the ML malaria literature.But in the malaria context these are flawed measures of performance, inappropriate as evidence per [15], and should therefore be avoided (they can be useful intermediate measures for internal algorithm work).

Object-level ROC curves and AUC
Object-level ROC curves, and the associated Area Under Curve (AUC), are routinely reported by ML research papers involving parasite detection.However, they have three key weaknesses in this context (except perhaps as intermediate measures for internal algorithm work).
First, they do not address the clinical need for patient-centric care.In particular, they ignore the crucial matter of patient-level variability of object-level accuracy (this variability is discussed in 3.4 and 3.3).
Second, real samples often have a large imbalance between distractors and positive objects, especially at parasitemias near clinical LoD.A common situation is a model that diagnoses malaria on thin films by labeling individual RBCs as infected or not.5 million RBCs/µL vs. 100 p/µL at LoD gives 50,000 negative objects for each positive object, so a 0.999 AUC can coexist with an average of 50 FPs per parasite at LoD (a very low SNR).Since one detected parasite and one FP object have equal impact on diagnosis (as determined by exceeding a threshold T ), FP noise will swamp the diagnostic signal of detected parasites.
In such cases with large class imbalance (say D:1), the leftmost 1 D th vertical sliver of the ROC curve, with y-axis rescaled to be full width, reflects a more meaningful (and more sobering) ROC, because this expanded sliver visually weights TP counts and FP counts equally, as shown in Figure 2.
Third, the object-level ROC curve depends heavily on how distractors are defined.For example, when using thick films to diagnose malaria, "distractor" can mean (i) only the most difficult objects that closely resemble parasites; or (ii) any dark blob; or even (iii) every pixel in an image.Figure 3 shows an example in which considering only "difficult" distractors (top) results in a low AUC, while considering additional, mostly "easy" distractors (bottom) gives a higher AUC with no change in actual performance as measured by FPR.
More informative than the object-level ROC is the Free ROC (FROC), which plots sensitivity vs FPR per cV .FROCs for object level are useful for development work: they clarify where gains can be made by favorably trading off object-level sensitivity for lower FP rates.When datasets lack sufficient numbers of patients, FROCs on pooled objects can provide some insight into algorithm performance, with the caveat that they ignore patient-level variability.

Patient-level ROCs
Patient-level ROCs can give a useful sense of algorithm behavior near the clinical performance requirements and are well worth reporting when sufficient data exists to plot them.However, there are two caveats.First, the only salient portion of a patient-level ROC is the region near clinically relevant operating points (e.g., specificity 90%).Second, because sensitivity is parasitemia-dependent (3.4), the ROC is dependent also.Thus, a given algorithm may have much higher AUROC on a population with primarily high parasitemias than on one with lower parasitemias.

Precision
Object-level Precision is the ratio of detected parasites over all detected objects, tp tp+f p , and often appears as an ML metric.This metric, as used, tends to badly underestimate the effects of parasite-to-distractor imbalances at the low LoDs required for clinical use, as follows.
In ML papers, precision is often calculated on datasets with the clinically unrealistic situation of roughly balanced parasite and distractor counts, either because the numbers of objects have been artificially balanced or because the positive samples had high parasitemias (i.e.many parasites per volume V ).Since FPs roughly scale with volume V , high parasitemia samples yield much more balanced TP:FP ratios, which tend to give precisions which do not generalize to low parasitemia samples.
For example, a precision of 0.99 calculated on samples with P ≈ 10, 000 p/µL corresponds to 100 FPs per µL (assuming perfect sensitivity).At the required LoD of 100 p/µL, these same 100 FPs correspond to 100 parasites, giving precision = 0.5, a much less attractive result.
The related metric F1, the harmonic mean of precision and object-level sensitivity (also problematic, as noted in 3.4), is a similarly misleading metric for reporting algorithm results, and in addition has no clinical utility.
The rest of this section (3.2 -3.11) discusses metrics that better reflect malaria's clinical use case.

Patient level metrics
The importance of assessing algorithm performance at the patient level cannot be over-emphasized.The basic unit of clinical care is the patient (we set aside population-level diagnostics such as for Vitamin A deficiency [36]), so the most relevant metrics are defined at the patient level, not the object level.Performance assessed across pooled objects can be a useful intermediate step during ML development, but it is fundamentally unrealistic, because (i) it does not match the clinical task; (ii) it ignores interpatient variability; and (iii) it is dominated by high parasitemia samples.For example, consider four malariapositive patients, with Patient 1: 50,000 parasites/µL (p/µL).Patients 2, 3, 4: each 300 p/µL Suppose the algorithm detects all parasites in {1}, and misses all parasites in {2,3,4} (a realistic scenario due to interslide variability).Then the object-level sensitivity is 98%, while patient-level sensitivity is 25%.
We have found that two metrics, each defined on a per-patient basis, are particularly useful: false positive rate (FPR) and sensitivity.Each is calculated separately for each patient, using algorithm accuracy on objects within that patient's sample.These are covered in 3.3 and 3.4, and underpin other metrics related to specificity 3.5, LoD 3.6, and quantitation 3.9.
Interpatient variability (as in Figure 4) poses great difficulty for ML, so it must be factored into algorithm evaluation.It is captured by the standard deviations of FPR and sensitivity (cf.3.3, 3.4), to the degree that the dataset captures interpatient diversity.
A related issue is interclinic variability.For example, clinics can use different stain variants (e.g., Giemsa, Field, and JSB).Even clinics with nominally identical protocols can differ substantially (see e.g., [11] and a detailed example in [8]).Besides variations in presentation, different clinics may produce populations of samples with differently distributed FPRs and sensitivities.Implications of this for tuning algorithms are covered in 3.5.

FP Rate
False positive rate (FPR) is the number of distractors mislabeled as parasites per clinically relevant unit of substrate, hereafter cV , e.g., 1 µL of blood (malaria), 10 mL urine (Schistosoma haematobium), 1 gram stool (other Schisto), a specified number of cells in a histological sample, etc; but not "per image tile", which generally has no clinical relevance (though image tiles can often be translated into the microscopy "Fields of View" used in protocols).Malaria ML papers with some FPR analysis include [37,25,6,7].Crucially, FPR is calculated separately for each patient.We denote the vector of FPRs for the population of patients as F.
FPR is not object-level specificity, which is a commonly reported but highly flawed measure in this context (see 3.1).
While FPR can be calculated for any sample, FPRs on positive samples may be erroneously boosted by mis-or unannotated parasites.Thus, the population's FPR distribution is best characterized using negative samples only.
Interpatient variability makes the standard deviation of FPR, σ(F), a crucial performance measure.The mean FPR µ(F) is less relevant because it can be subtracted out, as shown in 3.6 and 3.9.However, since it tends to scale roughly with σ(F), it can give a hint as to the relative magnitude of σ(F) (see e.g., [6,7]).
In datasets with insufficient numbers of patients, an FPR calculated over pooled objects has some value as a lower bound on F. In particular, it can be compared to the clinical LoD requirement.For example, a pooled-object FPR of 5,000/µL, vs. a required 100 p/µL LoD (malaria), is a clear sign that work is still needed.Multiple splits of a set of pooled objects does not simulate σ(F), because each split will include the full patient diversity.
Aside: Samples with high FPRs are sometimes criticized as being due to "poor sample preparation".However, except for extreme cases this is in the eye of the beholder: human clinicians readily and successfully diagnose "dirty" samples on which ML algorithms fail.Thus, the need to improve sample prep is to large degree a need to accommodate ML methods' struggles with handling highly variable sample presentations.See [11], and a detailed example in [8].

Sensitivity
Sensitivity (aka recall) the fraction of positive items in a set that are correctly labeled: Sensitivity = tp tp+f n , where tp = true positives, i.e. positive items labeled correctly, and fn = false negatives, i.e. positive items labeled as negative or missed.The "items" can be parasites (object-level) or malaria-positive patients (patient-level).

Pooled object sensitivity
Sensitivity over a pooled set of parasites from multiple patients has some value as an intermediate assessment metric during ML development (e.g., as a loss function for gradient descent training), if it is analysed carefully to avoid problems such as imbalanced parasitemias distorting the object pool (cf. the example given in 3.2).

A clinically realistic and useful version of object-level sensitivity measures each patient separately:
Per patient object-level sensitivity is the fraction of parasites in the examined volume V of a positive sample that are correctly labeled (e.g., by means of an object score threshold C), tp tp+f n where tp = parasites labeled correctly, and f n = parasites labeled as distractors (or missed).There is no constraint on the size of V or parasitemia, but sensitivities for patients with few parasites are less reliable (cf. the law of large numbers).Each patient's object-level sensitivity is calculated separately.We denote the vector of sensitivities for the (malaria-positive) population as S. S underpins metrics related to LoD 3.6 and quantitation 3.9.

Patient-level sensitivity
Patient-level sensitivity is sensitivity in the usual clinical sense of the fraction of positive patients correctly diagnosed (not S).It is of course a vital metric clinically, but is complex to interpret because it depends on two things: (i) The particular parasitemia distribution of the tested set: Patients with low parasitemias (close to the LoD) are harder to identify.In malaria for example (where LoD ≈ 100 p/µL), if all patients have parasitemias > 1000 p/µL, 100% sensitivity is (hopefully) trivial, while if all parasitemias are under 50 p/µL, very low sensitivity is likely.
(ii) The particular specificity: Sensitivity and specificity are paired and move in opposite directions, as seen in ROC curves.Thus, reporting patient-level sensitivity is uninformative and even misleading unless one also reports (i) the parasitemia distribution, and (ii) the associated specificity on negative samples.The WHO competency levels are an important example: These levels crucially assume the parasitemia distribution of the WHO 56 diagnosis slide set, viz 20 negative slides and 20 positive slides with parasitemias between 80 and 200 p/µL [21].WHO competency level ratings do not apply to results on distributions with higher parasitemia samples.
A principled way to maximize patient-level sensitivity is given in 3.7.

Effect of species on sensitivity
Algorithm sensitivity results should be broken down by species as well as by parasitemia, because malaria species has strong impact on patient-level sensitivity.This is due to the unique synchronization and sequestration behaviors of P. falciparum [38]: (i) In falciparum the large, distinctive late stage forms sequester out of the peripheral blood, leaving only the smaller ring forms that are harder to detect and disambiguate from distractor objects (especially in thick films).As a result, non-falciparum species (i.e.vivax, ovale, malariae, knowlesi) are much easier to diagnose (given equal parasitemias), which allows an algorithm to have lower LoD and higher patient-level sensitivity.
(ii) falciparum parasites tend to synchronize in peripheral blood, with the presenting parasites forming a narrow age distribution.This strongly impacts diagnostic methods that target the biomarker hemozoin: non-falciparum samples can be very sensitively detected due to the reliable presence of late-stage, highhemozoin parasites [39], but even high parasitemia falciparum samples can lack detectable hemozoin due to synchronized populations of early stage ring forms [40,41,42], resulting in drastically different sensitivities by species.(Hemozoin appears to be a sensitive biomarker for falciparum in cultured blood because synchronization is absent.)This is a high-stakes issue because falciparum is much more often fatal than non-falciparum species.

Specificity
Specificity is the fraction of negative items (distractor objects or patients) that are correctly diagnosed as negative: Specificity = tn tn+f p , where tn = true negatives (negative items in V labeled correctly), and f p = false positives (negative items labeled incorrectly).

Object-level specificity
Object-level specificity, even if calculated for each patient separately, has little usefulness and can be highly deceptive (see 3.1).

Patient-level specificity
Patient-level specificity, i.e. in the usual clinical sense, is highly salient.Clinical goals of high specificity include not overwhelming the health care system, avoiding excess treatments, and preventing misattribution.Thus, clinical use-cases generally require a high specificity (eg 90% for malaria diagnosis [21], 97.5% for schistosomiasis [23]).
Specificity is closely tied to FPR (3.3 above) and can be readily tuned for an algorithm that labels objects: Suppose that objects have been detected then labeled by some method (e.g., a threshold C on object scores), that F (from 3.3) is gaussian, and that patient diagnosis is determined by a threshold T on the number of positively-labeled objects per cV (i.e. a standard "detect, classify, count, then threshold" approach).To attain a target specificity K, one can set where α is found via the (one-sided) error function and K. Alternate formulations for the case of nongaussian F are given in section 3.8.
Negative samples are easier to obtain and trivial to annotate (assuming accurate patient-level ground truth), and specificity depends only on negative samples.So T can ideally be tuned on a separate, dedicated validation set of negatives that capture a sufficient range of FPRs (both "dirty" and "clean" samples).
Note that different clinics can have widely different FPR distributions F. Because σ(F) determines both specificity (Eqn 1) and LoD 3.6, different clinics may require different hyperparameters to hit the target patient specificity K, leading to different LoDs.Thus, tuning an algorithm for deployment may involve multiple validation sets of negatives (by clinic), with clinic-dependent trade offs between specificity and higher LoD.
LoD can be directly probed using holdout sets of low parasitemia positive samples.These are not as useful for training anyway, as they supply few parasite objects.However, this is impractical because it's hard to acquire enough malaria samples near the LoD.
We can calculate a useful estimate of LoD from F and S as follows: • Denote the putative LoD as L parasites per cV , and suppose that a patient is diagnosed as "positive" when N ≥ T , where N is the number of positively-labeled objects per cV .Note that N = T P + F P in positive patients, and N = F P in negative patients, where T P and F P denote counts per cV , so T P = tp cV V where tp is the number of parasites correctly labeled in V (similarly F P = f p cV V ).
• Make T high enough to ensure to enforce 95% specificity on negative samples as described in [6] by setting α to 1.65 std devs in Eqn 1: • Then for positive samples the worst case is a very "clean" sample with low FPR, such as the 5 th percentile of samples with F P = µ(F) − 1.65σ(F).In this case we must depend mostly on detected parasites to ensure N ≥ T for a positive diagnosis.Suppose for ease that the sample has average sensitivity = µ(S).Then a sample at LoD has T P = Lµ(S).
• To diagnose this positive sample correctly (but just barely, i.e.N = T ), we need So the estimated LoD (L per cV ) has • Optionally, +1 can be added to the numerator (i.e.require N = T + 1) to prevent unpredictable behavior should both σ(F) and µ(S) approach 0: We have found this estimate to be a good (slightly optimistic) proxy for actual LoD when assessing algorithms during development.It has the practical advantage that low parasitemia samples are unnecessary, because the vector S can be well characterized by high parasitemia samples.It also allows useful comparison of algorithms, as it directly addresses a key clinical requirement and is anchored to the relevant unit cV .
A more nuanced (and pessimistic) proxy could account for σ(S) by having a denominator = µ(S)−β σ(S) for some β.

Choosing operating points
Given a trained algorithm that uses the two hyperparameters C and T , {C, T } can be optimized in a principled way to maximize patient-level sensitivity, subject to the constraint of a fixed target specificity K: • Set aside a validation set of negative samples.If there are sufficient positive samples to spare, optionally set these aside also.

• For each C:
-Calculate F over the validation negatives, and µ(S) over the validation positives if available, or (less ideal but workable) over the training set positives.
-Determine T = T (C, K, F) which hits the target specificity K on the validation negatives, as in 3.5.
-Estimate LoD as in 3.6.
• Select the C with the lowest LoD.
• Use this {C, T } pair as algorithm hyperparameters to process test sets, and report patient-level specificity and sensitivity.

Modified LoD and operating point formulas
The methods for setting T in 3.5 and for estimating LoD in 3.6 both assume that the FPR vector F is gaussian.In our experience this is often not the case.Rather, the FPR distribution may be asymmetrical, with mostly low-FPR samples and a few high-FPR samples.This can be handled by modifying the methods in 3.5 and 3.6 as follows: • For µ(F), use the median of F instead of the mean of F. Similarly, if the vector S is non-gaussian, the median can be used instead of the mean for µ(S. • For σ(F), use one-sided std devs, which can be calculated by keeping only the points to the right (or left) of the median and reflecting them across the median as centerpoint to create a symmetric distribution.This gives, for the FPR distribution above, a large right std dev σ R (F) and a small left std dev σ L (F).
• Then the new versions of Eqns 1 and 3 are Two other methods of calculating T from F may be useful: 1. Set T based on the K th percentile of F.
2. Manually choose T based on a scatterplot of the F P counts in the validation negative samples.
For both these methods, the detected objects are assumed to be already classified.If a threshold C on object scores was used, then first T must be calculated for each C, before choosing the best {C, T } pair as in 3.7.
The manual method of choosing {C, T } takes time, but it can yield the best results in a field deployment because it is most closely tailored to the empirical FPR distribution.

Measuring quantitation accuracy
Quantitation accuracy should be reported at the patient level due to high interpatient variability.For plotting quantitation error per patient, Bland-Altman plots are preferable because relative quantitation error is generally most important [20].
Reporting the R 2 value of a linear fit of estimated vs. true (i.e.P vs. P ) is unsuitable when parasitemias range over orders of magnitude (common in malaria), because effects of the L 2 norm almost guarantee that high parasitemia samples will lay on the fitted line while high relative errors on low parasitemia samples will be downplayed, giving an illusion of strong fit.Fitting the log(P ) rather than P values helps to reduce this illusion.

Estimating parasitemia
As described in [7], we can estimate the parasitemia P for a given patient by n = number of alleged parasites found in V , F = expected FPR (e.g., µ(F)), Ŝ = expected sensitivity (e.g., µ(S)), cV = clinically relevant volume of substrate, V = estimate of the volume examined.
Three types of error affect Eqn 7: irreducible Poisson, estimates of examined volume, and counts of alleged parasites.

Irreducible Poisson error:
This is discussed below in 3.10.

Examined volume error:
Error in estimating V impacts quantitation accuracy via the cV V term of Eqn. 7.For example, thick film blood volume V is typically estimated by counting WBCs [17].Any error in the WBC count causes proportional quantitation error.This error type can be compartmentalized, for performance evaluation purposes only, as follows: • Manually count WBCs on a test set to ensure oracle V estimates and use these counts to calculate V , ensuring zero error of this type.
• Separately report the patient-level error statistics of the WBC counter.

Parasite counting errors:
Errors in parasite count stem from patient-level variations in sensitivity and FPR, as follows: • The number of alleged parasites per cV in the sample is (tp + f p) cV V = T P + F P .
• Let P is the true parasite count per cV .Then ŜP is the expected number of correctly labeled true parasites per cV , and the difference between T P and ŜP is due to deviation of the sample's sensitivity from the expected Ŝ.
• Similarly, the difference between F P and F (the expected FPR) is due to the deviation of this sample's FPR from expected.σ(S) and σ(F) quantify these deviations over the population.
• A figure of merit to assess parasite counting error, derived and discussed in [7], is thus While the FPR term is usually hardest to control, it also shrinks as 1/P , so for large P the sensitivity term dominates.This effect can be leveraged by using different operating points according to whether initial estimated parasitemia is low or high, to favor FPR or sensitivity.In particular, different operating points are indicated for diagnosis (since the hard cases have low parasitemia, where FPR dominates) and for quantitation (high parasitemias, where sensitivity dominates).
We note that parasitemia estimates based on manual microscopy are also subject to these three error types.This complicates assessment of a model's quantitation accuracy against microscopy ground truth.

Effect of Poisson statistics
Poisson statistics for rare events give variation in the actual number of parasites in a particular sample with volume V , given a fixed true parasitemia P over the whole sample.The variation is most visible at low parasitemias, e.g., at 100 p/µL, where each RBC has a 1/50,000 chance of containing a parasite in thin film, or each WBC has a 1/80 chance of corresponding to a nearby parasite in thick film.This variability has two main impacts: (i) For diagnosis, a low LoD requires that a large volume V be examined to ensure that at least a couple true parasites are present at all.Otherwise, for a statistically predictable subset of positive patients the examined volume will contain 0 parasites, reducing patient-level sensitivity from the start.For malaria, with LoD of 100 p/µL, this means that at least ≈0.05 µL of blood should be examined, equivalent to 200 WBCs in thick film, or 250,000 RBCs in thin film (the difficulty of finding this many acceptable RBCs, and the long processing time required, are two reasons why thin films are not standard protocol for manual field diagnosis).The miLab platform [34] examines ≈ 200k RBCs on thin film (close to the green curve in Fig 5), and the Autoscope/EasyScan GO [6,11] examines >0.1 µL of thick film (the red curve in Fig 5).
(ii) For quantitation, a sufficiently high volume V (depending on P ) must be examined to control irreducible error.For more detail see S.I. of [7].
In both cases, automated systems hold a strong advantage because they can scan higher volumes than human technicians, who often by necessity work in a high Poisson error regime (S.I. of [7]).Manual microscopy protocols average multiple readers' estimates (when available) to reduce quantitation error [20,50].
When reporting results on datasets of small size, authors should understand how Poisson variability limits their estimates of algorithm performance.Low parasitemia samples may present as negative (i.e.zero parasites) if V is too small.

Malaria species identification metrics
Identification of malaria species is one of the three tasks assessed by the WHO 56 evaluation system [21].Correct species ID matters clinically because (i) falciparum infections are much more likely to be fatal; and (ii) treatment plans differ by species [51], since (for example) the hypnozoites of vivax and ovale species require special care.
Because not all species ID errors are equal from a clinical perspective, reported results should preferably include a confusion matrix as in [7].
Aside: It is relatively straightforward to distinguish falciparum vs non-falciparum on thick film alone [9,8,11,27], and even mixed species infections that include falciparum can often be identified on thick film by comparing the ring stage and late stage parasite counts [10].However, thin films are still typically needed to distinguish between the various non-falciparum species, unless the clinical use case allows geographical priors to be leveraged.A method to distinguish non-falciparum species on thick film would yield clinical benefit by eliminating the need for thin films, due to (i) the ease of thick-only workflows [52,53], and (ii) thin film problems with quality [33] and difficulty of species ID at low parasitemias ( [54]).
Staging parasites (as ring, later trophozoite, schizont, or gametocyte) is not part of the WHO evaluation, and is not generally useful clinically, except as used during species identification or when quantitating asexual forms in non-falciparum species.In falciparum (the main target of quantitation, for drug resistance studies) the difference between ring and gametocyte is glaring.

DISCUSSION
Malaria and NTDs are amenable though difficult targets for ML methods, and successful development of translatable ML solutions would yield tremendous health care benefits for currently underserved populations.
Unfortunately communal ML progress, in which researchers build on each others' work to reach a performance goal, is handicapped for malaria by lack of attention to clinical needs, and by widespread use of ill-suited evaluation metrics.As a result, the synergistic power of the ML community is not being applied with full force to this important task, as many papers present methods that cannot be usefully extended.
Individual ML research teams can radically improve the situation by grounding their ML work in an understanding of the use case, and by tailoring metrics to the clinical needs.We have described such metrics here: variation in FPR, per-patient sensitivity, LoD, patient-level sensitivity and specificity, and a figure of merit for quantitation.We have also listed some essential technical background reading from the WHO and others.
Peer reviewers play a special role in determining the success or failure of the communal ML effort: (i) reviewers can assess algorithms and performance results according to whether they incorporate the requirements of the clinical use case.(ii) when authors present new metrics, well-grounded in the use-case, this can be more valuable than a comparison based on customary but inferior metrics.By recognizing when this is the case, reviewers can disrupt the cycle that perpetuates a counterproductive status quo.
With attention to the clinical use case and deliberate choice of metrics, the ML community can better equip itself to successfully address automated malaria and NTD diagnosis, and thus deliver concrete benefit to the populations suffering the dire effects of these illnesses.

Figure 2 .
Figure 2.For a 20:1 distractor-to-parasite ratio, stretching the left vertical sliver gives a more meaningful ROC curve.Diagonal red lines show operating points that give equal numbers of TPs and FPs.

Figure 3 .
Figure 3.For unchanged algorithm and FPR, ROCs are artificially improved by increasing the number of easy distractor objects.Left: object scores (positives are blue, negatives are red).Right: associated ROC curves.

Figure 5 .
Figure 5. Poisson distributions of parasite counts for various examined volumes V , assuming 100 p/µL.Low parasitemia samples may present as negative (i.e.zero parasites) if V is too small.