Development and validation of a random forest algorithm for source attribution of animal and human Salmonella Typhimurium and monophasic variants of S. Typhimurium isolates in England and Wales utilising whole genome sequencing data

Source attribution has traditionally involved combining epidemiological data with different pathogen characterisation methods, including 7-gene multi locus sequence typing (MLST) or serotyping, however, these approaches have limited resolution. In contrast, whole genome sequencing data provide an overview of the whole genome that can be used by attribution algorithms. Here, we applied a random forest (RF) algorithm to predict the primary sources of human clinical Salmonella Typhimurium (S. Typhimurium) and monophasic variants (monophasic S. Typhimurium) isolates. To this end, we utilised single nucleotide polymorphism diversity in the core genome MLST alleles obtained from 1,061 laboratory-confirmed human and animal S. Typhimurium and monophasic S. Typhimurium isolates as inputs into a RF model. The algorithm was used for supervised learning to classify 399 animal S. Typhimurium and monophasic S. Typhimurium isolates into one of eight distinct primary source classes comprising common livestock and pet animal species: cattle, pigs, sheep, other mammals (pets: mostly dogs and horses), broilers, layers, turkeys, and game birds (pheasants, quail, and pigeons). When applied to the training set animal isolates, model accuracy was 0.929 and kappa 0.905, whereas for the test set animal isolates, for which the primary source class information was withheld from the model, the accuracy was 0.779 and kappa 0.700. Subsequently, the model was applied to assign 662 human clinical cases to the eight primary source classes. In the dataset, 60/399 (15.0%) of the animal and 141/662 (21.3%) of the human isolates were associated with a known outbreak of S. Typhimurium definitive type (DT) 104. All but two of the 141 DT104 outbreak linked human isolates were correctly attributed by the model to the primary source classes identified as the origin of the DT104 outbreak. A model that was run without the clonal DT104 animal isolates produced largely congruent outputs (training set accuracy 0.989 and kappa 0.985; test set accuracy 0.781 and kappa 0.663). Overall, our results show that RF offers considerable promise as a suitable methodology for epidemiological tracking and source attribution for foodborne pathogens.

Source attribution has traditionally involved combining epidemiological data with different pathogen characterisation methods, including 7-gene multi locus sequence typing (MLST) or serotyping, however, these approaches have limited resolution.In contrast, whole genome sequencing data provide an overview of the whole genome that can be used by attribution algorithms.Here, we applied a random forest (RF) algorithm to predict the primary sources of human clinical Salmonella Typhimurium (S.Typhimurium) and monophasic variants (monophasic S. Typhimurium) isolates.To this end, we utilised single nucleotide polymorphism diversity in the core genome MLST alleles obtained from 1,061 laboratory-confirmed human and animal S. Typhimurium and monophasic S. Typhimurium isolates as inputs into a RF model.The algorithm was used for supervised learning to classify 399 animal S. Typhimurium and monophasic S. Typhimurium isolates into one of eight distinct primary source classes comprising common livestock and pet animal species: cattle, pigs, sheep, other mammals (pets: mostly dogs and horses), broilers, layers, turkeys, and game birds (pheasants, quail, and pigeons).When applied to the training set animal isolates, model accuracy was 0.929 and kappa 0.905, whereas for the test set animal isolates, for which the primary source class information was withheld from the model, the accuracy was 0.779 and kappa 0.700.Subsequently, the model was applied to assign 662 human clinical cases to the eight primary source classes.In the dataset, 60/399 (15.0%) of the animal and 141/662 (21.3%) of the human isolates were associated with a known outbreak of S. Typhimurium definitive type (DT) 104.All but two of the 141 DT104 outbreak linked human isolates were correctly attributed by the model to the primary source classes identified as the origin of the DT104 outbreak.A model that was run without the clonal DT104 animal isolates produced largely congruent outputs (training set accuracy 0.989 and kappa 0.985; test set accuracy 0.781 and kappa 0.663).Overall, our results show that RF offers considerable promise as a suitable methodology for epidemiological tracking and source attribution for foodborne pathogens.

Introduction
Salmonellosis, one of the most common food-borne illnesses in both, the developed and developing countries (Majowicz et al., 2010;Scallan et al., 2011;Pires et al., 2021), is a disease that is associated with diarrhoea, fever and abdominal pains that occasionally can lead to death (Fabrega and Vila, 2013;Andino and Hanning, 2015).Salmonellosis was the second most reported zoonotic disease in the EU in 2020 [European Food Safety Authority (EFSA) and European Centre for Disease Prevention and Control (ECDC), 2021] and second most reported bacterial enteric disease in the US in 2022 [Centers for Disease Control and Prevention (CDC), 2023].The annual costs associated with salmonellosis in 2010 in the US were estimated to be in excess of 2.5 billion USD for 1.4 million cases (Scallan et al., 2011;Andino and Hanning, 2015).
Current classification divides the genus Salmonella into two species: enterica and bongori.Salmonella enterica is further divided into six well defined subspecies that comprise over 2,600 distinct serovars (Issenhuth-Jeanjean et al., 2014).Salmonella enterica subspecies enterica (I) is responsible for the majority of Salmonella infections in warm blooded animals (Porwollik et al., 2004), although S. Enterica subspecies diarizonae (IIIb) serovar 61:k:1,5,( 7) is host adapted and endemic in Sheep in multiple countries (Davies et al., 2001;Alvseike et al., 2004;Sörén et al., 2015;Methner and Moog, 2018) and S. Enterica subspecies arizonae (IIIa) can infect avian and mammalian host species (Katribe et al., 2009).The majority of human salmonellosis cases are caused by a minority of the described Salmonella serovars.For example, in the US in 2016 just 20 serovars were reported as a cause of >80% of human infections with over one-third of cases due to just three serovars: S. Typhimurium, S. Enteritidis, and S. Newport [Centers for Disease Control and Prevention (CDC), 2018].Similarly, in the United Kingdom (UK), S. Typhimurium and S. Enteritidis were responsible for approximately 50% of non-typhoidal Salmonella infections in England in 2019 (UKHSA, 2021).Worldwide, World Health Organization (WHO) data reported that S. Enteritidis and S. Typhimurium are the two serovars most frequently isolated in clinical practice (Fabrega and Vila, 2013).
The main cause of human non-typhoidal salmonellosis is the ingestion of contaminated food, or, especially in low to middle income countries, contaminated water (Fabrega and Vila, 2013).The source of such contamination is typically Salmonella in faeces of an infected primary animal host (or, more rarely, human host) contaminating the water supply or plant based foodstuffs, or food products obtained from an infected primary animal host, including meat (typically pork, beef, poultry, or mutton/lamb), eggs, or diary (Hald, 2013).Cross-contamination at the different stages of the food production chain (e.g., at an abattoir or a food processing plant) can also be a significant cause of contamination of foodstuffs and hence human salmonellosis infection (Andino and Hanning, 2015).Additionally, S. Typhimurium and monophasic S. Typhimurium have been shown to persist in farm environments for extended periods of time and have also been isolated from animal feed and feed ingredients (Andino and Hanning, 2015;Gosling et al., 2018;Harrison et al., 2022).Salmonella Typhimurium and monophasic variants of S. Typhimurium can infect a wide range of animal species, of which the most relevant primary sources in terms of the potential for human infection are various livestock animals and poultry, companion animals and pets (horses, dogs, and cats), and wild game mammals and birds.Whether the primary host displays any symptoms of infection is dependent on the host species and the Salmonella serovar.Primary host can often act as a reservoir of infection where the bacterium lives and multiplies in the large intestine and associated lymphoid tissue.Given the diverse range of potential primary animal hosts, and thus the numerous and complicated transmission pathways of these zoonotic pathogens (Hald, 2013), it can be difficult to determine the primary source of the S. Typhimurium and monophasic S. Typhimurium human infections for both sporadic cases and outbreaks.This information is critical for formulating efficient strategies for mitigating S. Typhimurium and monophasic S. Typhimurium infection spread in the human population.Hence, development of attribution methodologies to better understand pathogen transmission to humans is crucial.
Historically, source attribution efforts have relied on frequencymatching models [e.g., the Dutch and Hald ("Danish") models], which rely on the one-to-one matching of microbial subtypes, defined either by phenotyping (e.g., serotyping) or genotyping (e.g., 7-gene MLST), in humans and potential sources, or on probabilistic population genetics approaches that utilize genetic markers derived from genotypic subtyping methods.These methodologies have been reviewed in several recent publications (e.g., Pires et al., 2009Pires et al., , 2014;;Mughini-Gras et al., 2018, 2019).Pires et al. (2014) reviewed the utility of these approaches for attribution of human salmonellosis cases.The high-throughput sequencing of bacterial strains has been increasingly used for routine surveillance and outbreak investigations.Generated whole genome sequencing (WGS) data can additionally be of use to accurately discriminate between human infecting pathogens originating from different primary sources thus allowing for development and application of ever more sophisticated attribution models (Franz et al., 2016).
Machine learning (ML) models are computer algorithms that improve with experience and have been increasingly applied to analyse various large and complex genetic and genomic datasets (Libbrecht and Noble, 2015).Recently, there has been a proliferation of studies applying ML algorithms to WGS data of zoonotic bacterial isolates to answer questions related to attribution [e.g., primary host species of S. Typhimurium (Zhang et al., 2019, Munck et al., 2020a), food source of Listeria monocytogenes (Tanui et al., 2022), geographic origin of S. Enteritidis (Bayliss et al., 2023)], disease risk in humans (Njage et al., 2019a), or host disease severity (Karanth et al., 2022).RF models are widely used supervised classification ML algorithms that have been applied in a range of different research fields and are particularly useful for making predictions based on the WGS data 10.3389/fmicb.2023.1254860Frontiers in Microbiology 03 frontiersin.org(Ogutu et al., 2011).A RF algorithm generates multiple decision trees and subsequently aggregates the output produced by each individual decision tree to arrive at the consensus set of predictions.Importantly, the different decision trees are uncorrelated as each tree is exposed to a random subset of the data (variables and model features), which minimizes bias and error.Using this approach, here we describe an application of a supervised classification RF algorithm on a WGS derived set of core genome MLST (cgMLST) genetic markers to assign the primary sources to 662 S. Typhimurium and monophasic S. Typhimurium sporadic and outbreak human clinical cases detected between 2012 and 2018 in England and Wales.
2 Materials and methods

Strains and sequencing
Prior to sequence quality control (QC), the animal isolate dataset comprised the WGS data of 463 S. Typhimurium and monophasic S. Typhimurium sequence type (ST)19, ST34, ST128, ST313, ST323, ST568, and ST2105 isolates.All STs belonged to eBurst group (eBG) 1, with the exception of ST2105 that belonged to eBG167.The analysed animal isolates were collected by the Animal and Plant Health Agency (APHA) between 2012 and 2020 (majority of these isolates were from 2013-2018) as part of routine surveillance of livestock farms across England and Wales, monitoring, control programs, outbreak investigations, and for research projects.The isolates originated from eight primary source classes (animal species or groups of animal species).Grouping of primary hosts into the distinct primary sources was performed as described in Munck et al. (2020b) for the UK animal isolates: Cattle, Pigs, Sheep, OtherMammals (companion animals that were mostly dogs and horses), Broilers, Layers (egg laying hens), Turkey, Game (game birds: pheasant, quail, pigeon).
The pre-QC human isolate dataset comprised the WGS data of 852 S. Typhimurium and monophasic S. Typhimurium ST19, ST34, ST213, ST313, ST323, and ST3235 (all eBG1) isolates collected from salmonellosis patients in England and Wales between 2012 and 2018.Only a few human isolates in this dataset were from prior to 2014.The WGS data and the metadata of the human isolates were provided by the United Kingdom Health Security Agency (UKHSA).
Animal isolates were sequenced at APHA Weybridge using either the MiSeq or NextSeq benchtop Illumina sequencers.Paired-end libraries were prepared with the Illumina Nextera XT DNA Library Preparation Kit from DNA extracted with the MagMAX CORE Nucleic Acid Purification Kit (ThermoFisher Scientific, Applied Biosystems, Foster City, CA) following the manufacturer's instructions.Human isolates were sequenced at UKHSA as previously described (Chattaway et al., 2019).The fastq files of the 852 human isolates were downloaded from the NCBI GenBank Sequence Read Archive (BioProject PRJNA248792) using fasterq-dump of SRA Toolkit v2.9.6 1 .Both the animal and human isolate datasets included samples that were linked with the 2015-2018 S. Typhimurium DT104 outbreak in England and Wales [Animal and Plant Health Agency (APHA), 2017].

Quality control of the WGS data
The whole genome sequences of the 1,315 animal and human S. Typhimurium and monophasic S. Typhimurium isolates were subjected to rigorous filtering prior to usage in the downstream analyses.BBDuk software (Bushnell, 2014) was used for removing adapter sequence and terminal bases with PHRED scores below 20 from each of the reads.Trimmed reads below 50 bases were filtered out.If just one of a pair of reads was under 50 bases, the other read in the pair was also removed.FastQC (Andrews, 2010) was run on the WGS data before and after read trimming to assess improvements in sequence quality.De novo genome assemblies were generated from the trimmed fastq reads using shovill v0.9.0 2 and analysed with QualiMap 2 (Okonechnikov et al., 2016) and Quast v5.0.2 (Gurevich et al., 2013) to obtain the mean coverage across the genome and evaluate the quality metrics (based on contigs of size 500 bases or larger).Only isolates with mean depth of sequence data post read filtering of at least 30X, genome assembly size between 4,750,000 and 5,250,000 bases, N50 >30,000, and the number of assembled contigs <500, were retained for cgMLST allele calling.The final dataset comprised 1,244 isolates, of which 435 were animal and 809 human isolates.

Scoring of cgMLST alleles
MentaLiST (Feijao et al., 2018) was used to call the cgMLST alleles against the 3,002 locus cgMLST EnteroBase scheme (version from September 2019) (Alikhan et al., 2018;Zhou et al., 2020) from the trimmed R1 and R2 fastq files of the 1,244 retained isolates.Default MentaLiST parameters were used but the minimum kmer depth required to call an allele was set to five.Novel alleles detected after the first MentaLiST run were introduced into the cgMLST scheme following the steps outlined in the MentaLiST manual.The −t parameter was set to one, and the -m parameter was set to 10. Second MentaLiST run with the updated cgMLST scheme produced several novel alleles that were generated from the novel alleles introduced after the first MentaLiST run at three different cgMLST loci in 10 different isolates.Novel alleles identified after the second MentaLiST run were treated as missing data.
Where there was an indication of multiple possible alleles (i.e., more than one allele with 100% kmer coverage), the allele calls with the highest number of votes were accepted and included in the downstream analyses.The cgMLST alleles with kmer coverage below 100% of the minimum kmer depth required to call an allele were treated as missing data.If an isolate had missing data at greater than 5% of the 3,002 cgMLST loci, it was not included in the subsequent analyses.Using these criteria, a further 183 isolates were removed from the dataset.The 1,061 retained isolates comprised 399 (ST19, ST34, ST128, ST313, ST323, ST568-all eBG1) animal and 662 (ST19, ST34, ST213, ST313, ST323-all eBG1) human S. Typhimurium and monophasic S. Typhimurium isolates (Supplementary Table S1).Missing data (1.1% of all allele calls) within the retained dataset was imputed utilising an iterative imputation method based on a random forest implemented in the missForest R package (Stekhoven and Bühlmann, 2012).Imputation was performed on all 1,061 isolates with the default parameters.

Phylogenetic tree construction and hierarchical clustering of 399 animal isolates
Phylogenetic analyses were carried out to investigate clustering according to primary source of the 399 animal isolates originating from eight primary source classes: 77 isolates from Cattle, 165 from Pigs, 47 from Sheep, 56 from OtherMammals (including 1 from ferret, 4 from cats, 20 from dogs, 31 from horses), 19 from Broilers, 7 from Layers, 11 from Turkey, and 17 from Game (including 4 from pigeon, 5 from pheasant, 8 from quail) (Supplementary Table S1).
According to the metadata recorded in the APHA LIMS database, the host type of the 399 animal isolates was an animal species as specified above.However, inspecting the farm sampling sheets indicated that 285 isolates were sampled from an actual animal (including animal post-mortem samples but also faeces sampled from a pen floor or poultry house boot swab samples) and 27 isolates were sampled from the farm environment (including samples from mud puddle, farm equipment, dust).For 87 isolates there was no data on whether these isolates were sampled from animal hosts or farm environment although many of the postcodes where these samples were obtained from indicated these isolates were sampled from a livestock farm (Supplementary Table S1).Six of the unspecified source isolates were collected from farms from which one or more of the actual animal source isolates were also collected from.For the 46 unspecified source OtherMammals isolates the sampling locality situation was somewhat different as many of the sampling postcodes that these isolates were obtained from indicated individual pet owner or veterinary surgery addresses but given this class of hosts (i.e., pet animals such as dogs or horses) it is highly plausible that all those isolates were sampled from individual pets.
A multiple sequence alignment (MSA) was computed with snippy v 4.4.5 3 from the trimmed WGS data of 399 animal isolates and the outgroup strain S. Typhimurium eBG138, ST36, SRR8820637 against the reference strain S. Typhimurium eBG1, ST19, LT2 AE006468.Recombination events were removed using Gubbins (Croucher et al., 2015), and polymorphic sites were extracted from the filtered MSA with SNP-sites (Page et al., 2016).RAxML-NG v0.9.0 (Kozlov et al., 2019) was used for phylogeny construction based on the resulting core single nucleotide polymorphism (SNP) alignment, which comprised 5,683 sites.RAxML-NG was run with the generalized time-reversible (GTR) nucleotide substitution model plus gamma correction, searching 100 trees (50 random and 50 parsimony-based starting trees) to find a tree with the best scoring topology.Branch support was computed via 2,500 bootstrap replicates (Felsenstein's bootstrap proportions).The Newick file of the best scoring maximum likelihood tree with the bootstrap support values was imported into iTol (Letunic and Bork, 2019) for tree display and annotation.The tree was rooted at the SRR8820637 outgroup strain.
3 https://github.com/tseemann/snippyA Bayesian clustering algorithm (BAPS) that inferred the population genetic structure of the 399 animal isolates was implemented through the rhierbaps R package (Cheng et al., 2013;Tonkin-Hill et al., 2018).The program was run on the SNP alignment after removal of the reference and the outgroup strains.The resulting alignment comprised 5,336 polymorphic sites.Clustering was performed with three hierarchical levels and 40 initial clusters.The n.extra.roundsparameter was set to 100,000,000 to ensure convergence of the algorithm.

SNP address
SNP address strain level nomenclature (Dallman et al., 2018) was employed to assign a SNP address to the each of the 1,061 retained isolates using SnapperDB.The SNP address 60.11.15.16.458.459.x was used to define the 201 DT104 outbreak related isolates (eBG1, ST19) at the 5 SNP threshold.
A genetic relationship amongst the 60 animal (26 from Cattle, 1 from Pigs, 20 from Sheep, and 13 from OtherMammals) and 141 human DT104 outbreak related isolates was explored by computing a phylogenetic tree following the steps outlined above.The core SNP alignment for this dataset comprised 868 variable sites.

Feature selection (data pre-processing and feature selection algorithms)
Feature selection was performed on the 3,002 cgMLST loci prior to their utilization as the predictors or model features to minimize the redundancy, and hence model running time, and to avoid overfitting.Feature elimination was performed on the 399 animal isolate dataset in several steps, by progressively reducing the number of cgMLST loci to retain only those that exhibited allele calls that were the most useful for distinguishing between isolates from the different primary source classes.First, the caret R package (Kuhn, 2008) was used to eliminate 835 zero variance loci (loci monomorphic within the 399 isolates) and 333 near zero variance loci (loci with two alleles only, one allele appearing in 398 out of 399 isolates).Next, the remaining 1,834 loci were checked for correlation with the findCorrelation function of caret.Nine hundred and thirty two loci with an absolute correlation value of minimum 0.9 were eliminated and 902 loci with an absolute correlation below 0.9 were retained.
The 902 remaining cgMLST loci were subjected to two different feature selection algorithms: the rfe function of caret and the Boruta function of the Boruta R package (Kursa and Rudnicki, 2010), with the set of loci retained as model inputs based on the combined outputs of both algorithms.Prior to running the feature selection algorithms, the 399 animal isolate dataset was split in a randomised manner into the model training and test sets, with 80% of the isolates used as the training set and 20% as the test set.The 80:20 split ratio was also maintained for each of the eight primary source classes.Hence, the training set comprised a total of 322 animal isolates: 62 Cattle, 132 Pigs, 38 Sheep, 45 OtherMammals, 16 Broilers, 6 Layers, 9 Turkey, and 14 Game isolates; whereas the test set comprised a total of 77 animal isolates of which 15 were Cattle, 33 Pigs, 9 Sheep, 11 OtherMammals, 3 Broilers, 1 Layers, 2 Turkey, and 3 Game isolates (Table 1 Phylogenetic trees were constructed for the 322 training and 77 test set animal isolates utilising only the variable sites from the 163 cgMLST loci retained as ML model input.The chimeric reference genome for both trees was constructed by concatenating the sequences of allele "1, " as per the September 2019 version of the Salmonella cgMLST EnteroBase scheme, for each of the 163 retained cgMLST loci.Computation of the MSA, the core SNP alignment, and the phylogenetic trees was performed as described above.The training set core SNP alignment had 678 sites, and the test set core SNP alignment had 449 sites.For both trees, branch support was computed via 10,000 bootstrap replicates.

Random forest models applied to the full dataset
Three RF models differentiated by the model tuning procedure: RF1 and RF2 (both ran in randomForest R package; Liaw and Wiener, 2002) and RF3 (ran in ranger R package; Wright and Ziegler, 2017) (Table 2) were applied to predict the primary source classes of the training and test set animal isolates (Table 1).Subsequently, the outputs of the three models were analysed and compared.Selection between the RF1, RF2, and RF3 models was performed by comparing the accuracy (percentage of correctly classified isolates) and kappa (a measure similar to accuracy that also takes into account the possibility of the correct classification occurring by chance) of the training and test set predictions obtained for each tuned model (Table 2), and also by investigating the incorrectly assigned isolates (such as which primary source class an isolate was incorrectly assigned to).The selected model (see Results) was then applied to predict the primary source for each of the 662 human isolates (Table 1).The source with the highest probability of assignment was considered the model predicted source for each human isolate.For each primary source class, the sum of probabilities of assignment indicated the number of human isolates that were assigned to a source (Munck et al., 2020a).One hundred and forty one human isolates related to the DT104 outbreak were used to validate model performance by contrasting the model predicted primary sources against the epidemiologically linked primary sources.The caret R package was used for all modelling work (details in the Supplementary material).

Random forest model without the clonal DT104 animal isolates
To assess the influence of the clonal DT104 animal isolates on model performance, a RF1-no DT104 model was run without the 60 DT104 outbreak related animal isolates (Table 1).The RF1-no DT104 model utilized the same random forest algorithm and resampling methodology as the RF1 model (Table 2).The RF1-no DT104 model used 421 cgMLST loci as model features that were retained after applying the rfe and Boruta feature selection algorithms as described above.After model training and tuning, the RF1-no DT104 model (Table 2) was subsequently applied to predict the primary sources of the 662 human isolates.

Phylogenetic relationship of the 399 animal isolates
Phylogenetic analysis of the 399 animal isolates revealed varying degrees of genetic relatedness (Figure 1).As expected, the 60 DT104  1).All 60 isolates were assigned to BAPS cluster 1 together with another 80 isolates at the first hierarchical level of the BAPS clustering algorithm.The 140 BAPS cluster 1 isolates were genetically identical at the 100 SNP threshold and they occupied neighbouring clades on the phylogenetic tree (Figure 1).At the second hierarchical level of BAPS, the 60 DT104 outbreak related isolates were all assigned to a single cluster not shared with other isolates.The 322 training and 77 test set isolates (Table 1) were evenly distributed throughout the phylogenetic tree (Figure 1).There was reasonably well defined clustering by host type amongst the 399 animal isolates (Figure 1).The Pigs isolates were largely confined to two phylogenetic tree clades that overlapped with BAPS clusters 4 and 3 (Figure 1).BAPS cluster 4 comprised almost exclusively monophasic S. Typhimurium isolates.BAPS cluster 6 comprised Sheep isolates, BAPS cluster 5 largely comprised Game isolates, and BAPS cluster 1 isolates were mostly from the Cattle, OtherMammals, and Sheep primary sources (Figure 1).
The structures of both the animal isolate training (Figure 2A) and test (Figure 2B) set phylogenetic trees based on the variable sites from the 163 cgMLST loci used as model features were highly concordant with the tree constructed from the core genome variable sites (Figure 1).Thus, selection of model features was not biased to specific primary sources.

Classification of the human isolates
The hyperparameter tuned RF1, RF2, and RF3 models were highly congruent in their ability to correctly predict the primary source classes of the training and the test set animal isolates as evidenced by the highly similar training and test set accuracy and kappa values produced by these models (Table 2).Details of the assignments of the training and the test set isolates to the different sources by the tuned RF1, RF2, RF3 models that led to the selection of the tuned RF1 model (with mtry = 109; Tables 3, 4) for prediction of the primary sources of 662 human isolates are provided in the Supplementary material and the Supplementary Tables S2-S5.
Applying the RF1 model to the 322 animal training set isolates, 4.7% isolates sampled from animal sources, 8.7% sampled from farm environment, and 14.9% sampled from sources of unspecified origin were incorrectly assigned to their actual primary source class.Of the 77 animal test set isolates, 17.0% sampled from animal sources, and 40.0%sampled from sources of unspecified origin were incorrectly assigned, whereas 100.0% of the farm environment isolates were correctly assigned.Therefore, assignment of isolates obtained from sources of unspecified origin had the lowest accuracy.For the entire animal isolate dataset, there were 18 incorrectly assigned isolates sampled from sources of unspecified origin, the majority of which (n = 13) were OtherMammals isolates incorrectly assigned to the Cattle or Pigs primary source classes.Five of these 13 isolates, all incorrectly assigned to the Cattle, were DT104 outbreak related (SNP address: 60.11.15.16.458.459.x) and hence it is possible that this was the reason why the RF1 model misassigned those five isolates and not because they were sampled from sources of unspecified origin.Interestingly, of the other eight of the 13 incorrectly assigned OtherMammals isolates, one was assigned to Cattle and seven assigned to Pigs.This result indicated that the companion animals could possibly be acting as secondary hosts that were infected by the farm animals.Furthermore, on 13 different occasions when the farm environment isolates and isolates sampled from sources of unspecified origin were collected from the same premises as the animal source isolates, the unknown sampling type isolates shared the same SNP address up to the 5 SNP threshold with isolates of known provenance.Thus, for example, a Pigs farm environment isolate sampled from rat faeces shared a SNP address with three isolates sampled from the same premise from Pigs (and all were assigned to the Pigs primary source class).In the entire animal isolate dataset, only two isolates that were sourced from the farm environment were assigned by the RF1 model to an incorrect primary source.Overall, the results of the RF1 model were acceptable even if the provenance of all isolates included in the animal training and test sets was not 100% clear.
Overall, the RF1 generated assignments of human isolates to animal primary sources correlated well with the known epidemiological data.In this dataset, 141 (21.3%) of the 662 human isolates were related to the S. Typhimurium DT104 outbreak that was primarily linked to the consumption of beef and mutton that became contaminated due to poor farm and abattoir practices (Figure 4).Of these, 136 (96.5%) isolates were assigned to Cattle, 3 (2.1%) to Sheep, and 2 (1.4%) to OtherMammals (Figure 4 and Supplementary Table S6).Therefore, out of a total of 314 human isolates that were assigned to Cattle, 136 (43.3%) were the DT104 outbreak related isolates.There was a marked contrast in how confidently the tuned RF1 model assigned these two groups of human isolates (the DT104 outbreak related isolates and the    The values along the diagonal (in bold) indicate the number of isolates correctly assigned by the model to their actual primary source class.The values above and below the diagonal indicate the number of isolates incorrectly classed by the model not to their actual primary source class (column headers) but to the model predicted source (row names).The isolates were assigned to the primary source class with the highest model computed probability of assignment.Balanced accuracy is the average of the sensitivity (true positive rate) and specificity (true negative rate) values for each primary source class.The values along the diagonal (in bold) indicate the number of isolates correctly assigned by the model to their actual primary source class.The values above and below the diagonal indicate the number of isolates incorrectly classed by the model not to their actual primary source class (column headers) but to the model predicted source (row names).The isolates were assigned to the primary source class with the highest model computed probability of assignment.Balanced accuracy is the average of the sensitivity (true positive rate) and specificity (true negative rate) values for each primary source class. 10.3389/fmicb.2023.1254860 Frontiers in Microbiology 10 frontiersin.organd of the 55 isolates assigned to Sheep, 47 (85.5%) isolates were of DT104 (including isolates with different SNP address to the DT104 outbreak related isolates) (Supplementary Table S6).Taken together, the above results strongly indicate that the human isolates for which there was epidemiological primary source data were assigned with high confidence to the expected primary source classes.Therefore, these outputs gave credence to the overall performance of the RF1 model.
Furthermore, 214 human isolates that were not part of the DT104 outbreak were previously analysed as part of the Horizon2020 COMPARE project (Munck et al., 2020b) where the Pigs primary source class was found to be the largest contributor to human infection (Arnold et al., 2021).The outputs of RF1 correlated with this finding as of the 163 human isolates assigned to Pigs,126 (77.3%) were the COMPARE project isolates (Supplementary Table S6).(A) The tuned RF1 model based assignment of 662 human isolates to eight primary source classes.The isolates are ordered by their probability of assignment to the Cattle (314 assigned isolates), followed by the Pigs (163 assigned isolates), OtherMammals (85 assigned isolates), Sheep (55 assigned isolates), Broilers (33 assigned isolates), Game (eight assigned isolates), and Layers (four assigned isolates) primary sources.(B) The tuned RF1-no DT104 model based assignment of 662 human isolates to eight primary source classes.The isolates are ordered by their probability of assignment to the Cattle (355 assigned isolates), followed by the Pigs (149 assigned isolates), OtherMammals (72 assigned isolates), Sheep (51 assigned isolates), Broilers (26 assigned isolates), Game (seven assigned isolates), and Layers (two assigned isolates) primary sources.No human isolates were assigned to the Turkey primary source class by either of the two models.Each vertical bar represents a single human isolate.The colour composition of each bar reflects the probability of assignment of an isolate to each of the eight primary sources.The more uniform the colour the higher the probability of assignment of an isolate to a single primary source class.

Feature importance for RF1
The top 15 RF1 model features (cgMLST loci), ranked in accordance with mean decrease in accuracy, are presented in Table 5. cgMLST locus STMMW_21601 was the feature that ranked the highest overall.STMMW_21601 was the most important feature for correct classification of the Cattle, Pigs, Sheep, and Broilers primary sources.For OtherMammals it was the fifth most important feature, for Layers the fourth most important feature, and for Turkey and Game the second most important feature.Locus STMMW_21601 represents gene yegO, a multidrug transporter subunit MdtC (Table 5).

RF1-no DT104 model performance
The tuned RF1-no DT104 model, which was run without 60 DT104 outbreak related animal isolates, produced training set accuracy of 0.989 (95% CI: 0.969-0.998)and kappa of 0.985 (Table 2).RF1-no DT104 incorrectly assigned 3/275 training set isolates (Table 6).The test set accuracy of the tuned RF1-no DT104 model was 0.781 (95% CI: 0.660-0.875)and kappa 0.663, values that were comparable to the test set accuracy and kappa produced by the tuned RF1 model (Table 2).In total, 14 of the RF1-no DT104 test set isolates were assigned by that model to an incorrect primary source class (Table 7).Four of these isolates were also part of the test set for the RF1 model and all four were assigned to incorrect primary sources by that model.
Of the 275 animal training set isolates, 1.0% sampled from animal sources and 1.9% sampled from sources of unspecified origin were incorrectly assigned by the RF1-no DT104 model.For the 64 test set animal isolates, 18.2% obtained from animal sources and 41.2% of isolates obtained from sources of unspecified origin were attributed to an incorrect primary source class by the RF1-no DT104 model.All training and test set isolates sampled from the farm environment were assigned correctly.
There were broad similarities in how the tuned RF1 (Figure 3A) and RF1-no DT104 (Figure 3B) models classed the 662 human isolates (Supplementary Table S6).RF1-no DT104 assigned 355 (53.6%) human isolates to Cattle, 149 (22.5%) to Pigs, 72 (10.9%) to OtherMammals, 51 (7.7%) to Sheep, 26 (3.9%) to Broilers, 7 (1.1%) to Game, and 2 (0.3%) to Layers (Figure 3B and Supplementary Table S6).Attributing human isolates to primary source classes based on the sum Phylogenetic tree of the 60 animal and 141 human RF1, RF2, and RF3 model S. Typhimurium isolates that were linked to the DT104 outbreak via the 60.11.15.16.458.459.xSNP address.Innermost annotation ring (Human isolate) specifies whether an isolate was collected from a human salmonellosis patient, and second annotation ring (Host) specifies the primary source class of each of the animal isolates and the RF1 assigned primary source of each of the human isolates.Bootstrap branch support values between 80% and 100% are shown on the tree.The tree is rooted at the outgroup strain SRR8820637.S6).All but one of the 141 DT104 outbreak related (SNP address: 60.11.15.16.458.459.x)human isolates were assigned by the RF1-no DT104 model to the Cattle primary source class; the one isolate was assigned to OtherMammals (according to the RF1 model output, that isolate was assigned to Cattle) (Supplementary Table S6).RF1-no DT104 model assigned 125/662 (18.9%) human isolates to a primary source with a low confidence probability of assignment of below 0.500.This included 64/125 (51.2%) isolates assigned to Cattle, 40/125 (32.0%) to OtherMammals, and 15/125 (12.0%) to Pigs (Figure 3B and Supplementary Table S6).
Twenty one of the 125 RF1-no DT104 model low confidence probability of assignment human isolates were related to the DT104 outbreak (Supplementary Table S6).Of the 355 human isolates that the tuned RF1-no DT104 model assigned to Cattle, 354 (99.7%) were of DT104 (including isolates with different SNP address to the DT104 outbreak related isolates), and of the 51 isolates assigned to Sheep, 44 (86.3%) isolates were of DT104 (including isolates with different SNP address to the DT104 outbreak related isolates) (Supplementary Table S6).Of the 149 human isolates assigned to the Pigs primary source by the RF1-no DT104 model, 146 (97.9%) were the COMPARE project isolates that were analysed as part of a previous study which found Pigs to be the main contributor to human infection (Arnold et al., 2021) (Supplementary Table S6).Therefore, these classification patterns were highly congruent with how these isolates    The values along the diagonal (in bold) indicate the number of isolates correctly assigned by the model to their actual primary source class.The values above and below the diagonal indicate the number of isolates incorrectly classed by the model not to their actual primary source class (column headers) but to the model predicted source (row names).The isolates were assigned to the primary source class with the highest model computed probability of assignment.Balanced accuracy is the average of the sensitivity (true positive rate) and specificity (true negative rate) values for each primary source class. 10.3389/fmicb.2023.1254860 Frontiers in Microbiology 14 frontiersin.org were classed by the RF1 model and overlapped closely with the expectations based on the epidemiological data or outputs of other studies.cgMLST locus STMMW_21601 was also the overall most important feature for the RF1-no DT104 model (Table 8).

Discussion
In this study, utilising 163 cgMLST loci as model features, three distinct RF models were trained, tuned, and evaluated on 399 S. Typhimurium and monophasic S. Typhimurium animal isolates.Subsequently, the best performing model was applied to predict primary source of 662 S. Typhimurium and monophasic S. Typhimurium human clinical cases.Supervised classification algorithms, including RF, exhibit properties highly suited for attribution of foodborne pathogens.Such models first learn to associate patterns in the provided genomic data of, for instance, S. Typhimurium isolates originating from different primary sources (i.e., animal host species) with a specific source.Subsequently, when applied to predict the sources of human clinical cases, the algorithm will seek out the genomic data patterns it had previously learnt to recognize and use that information to assign a primary source to each of the analysed human isolates (Munck et al., 2020a).The more data a ML algorithm has been exposed to, the more accurate it should become in making such predictions, as it has the ability to learn from the patterns in the data and hence to improve its decision making capabilities (Libbrecht and Noble, 2015).
The majority of the training set isolates (n = 322) used in the three RF models were from the following primary sources: Pigs (41.0%),Cattle (19.3%),OtherMammals (14.0%), and Sheep (11.8%).Highly unbalanced training data has previously been noted to result in a potential bias in favor of the majority class in the ML model generated predictions (Velez et al., 2007;Njage et al., 2019b).The best performing model, RF1, assigned 47.4% of the 662 S. Typhimurium and monophasic S. Typhimurium human isolates to Cattle, 24.6% to Pigs, 12.8% to OtherMammals, and 8.3% to Sheep.Thus, it cannot be concluded that the RF1 model assigned majority of the human isolates to sources which were overrepresented in the training set.While best practice is to use a balanced training dataset when implementing a RF model, this is an idealized scenario, and such datasets can be difficult to obtain.Only a proportion of the infected animals are detected by surveillance, and many primary hosts infected with S. Typhimurium or monophasic S. Typhimurium are asymptomatic and act as a reservoir of infection (Arnold et al., 2021).Furthermore, if the collection of isolates was biased towards the livestock and farm animals and there is lack of isolate collection from other potential S. Typhimurium and monophasic S. Typhimurium reservoirs, such as wild birds or animals (Skov et al., 2008), this will have a strong impact on the ability of attribution models to inform if the isolates from the rarer sources infected the human population.
In the analysed animal isolate dataset, 163 of the 399 isolates were S. Typhimurium (and in two cases monophasic S. Typhimurium) of DT104, of which 60 were clonal isolates related to the known 2015-2018 DT104 outbreak in England and Wales [Animal and Plant Health Agency (APHA), 2017].Even though S. Typhimurium of DT104 can reside in numerous host species, it is considered primarily a cattle pathogen (Poppe et al., 1998).Indeed, of the 165 isolates of DT104, 145 (87.9%) were from four mammalian primary source classes: Cattle (n = 59), OtherMammals (n = 34), Pigs (n = 28), and Sheep (n = 20).The intentional inclusion of the clonal, DT104 outbreak related isolates in the RF1 model facilitated the validation of model performance by using confirmed primary sources of human infections as model inputs.Although presence of the clonal isolates may have influenced the RF1 model outputs, the proportion of the DT104 outbreak related isolates did not exceed 50% for any of the eight primary sources.Previous studies emphasised minimizing the proportion of clonal genomes in the model training set in order to avoid artificially inflating source prediction accuracy (Zhang et al., 2019).We tested the potential model confusion by running the RF1 model without the clonal DT104 isolates (the RF1-no DT104 model) and compared the model outputs.The RF1-no DT104 model performed slightly better than RF1 at the training stage (accuracy: 0.989 vs. 0.929, kappa: 0.985 vs. 0.905) and at the test stage (accuracy: 0.781 vs. 0.779, but not kappa: 0.663 vs. 0.700), however, these metrics were highly similar thus indicating the robustness of the RF1 model to the presence of clonal isolates in the training set.The value of the kappa statistic in the range of "0.61-0.80" is indicative of "substantial" model performance (Landis and Koch, 1977) and in the range of "0.40-0.75" of "fair to good" model performance (Fleiss et al., 2003).Therefore, both models performed adequately and comparably to ML attribution models described in several recently published studies (Zhang et al., 2019;Munck et al., 2020a;Tanui et al., 2022).
A closer inspection of how the training and test isolates were attributed by the RF1 and RF1-no DT104 models revealed that of the 23 RF1 model primary source misassignments at the training stage, 15 were the DT104 outbreak related isolates.The number of training set misassignments was lower for the RF1-no DT104 model with only three incorrectly assigned isolates.There were 12 DT104 outbreak related isolates in the RF1 model test set, however, of the 17 primary source class misassignments only four were the DT104 outbreak related isolates (all incorrectly assigned to Cattle).Unlike for the training set, the majority (nine) misassignments were from another primary source class (Cattle, OtherMammals, Broilers, or Layers) to Pigs.The overrepresentation of the Pigs primary source class isolates in the training set might have been the reason for these misassignments.However, all nine isolates, of which seven were monophasic S. Typhimurium, clustered in clades that largely comprised Pigs isolates on the 163 cgMLST locus based test set phylogenetic tree.Therefore, it was more likely that the misassignment of these isolates was due to their genetic closeness to isolates representative of the Pigs source.Similarly, of the 14 test set isolates that were misassigned by the RF1-no DT104 model, 10 (originating from Broilers, Cattle, OtherMammals, Sheep, and Turkey) were incorrectly classed as Pigs.These 10 isolates clustered with Pigs isolates on the core genome alignment based phylogenetic tree, and eight were monophasic S. Typhimurium.One possible explanation for the observed test set misassignments could be that the isolates from primary sources other than Pigs were classed as Pigs by both models as these isolates did in fact originate from the Pigs primary source class that cross-infected a different primary host.
The fact that both, the RF1 and RF-no DT104 models, assigned 99.0% of the 141 DT104 outbreak related human isolates to the presumed primary sources based on the known epidemiological data supported the applicability of RF to attribution of human Salmonella  The feature importance for each of the eight primary sources the RF1-no DT104 model was trained to recognize (Cattle, OtherMammals, Pigs, Sheep, Broilers, Layers, Turkey, Game) is also specified.The EnteroBase derived gene name and gene function description is specified for each of the 15 features.Note that the top ranking feature for a specific primary source may not be shown in the table as the top 15 cgMLST loci are ordered by MeanDecreaseAccuracy. 10.3389/fmicb.2023.1254860 Frontiers in Microbiology 17 frontiersin.orgaccessory genome of the analysed isolates.With the ever-accelerating sequencing of high quality genomic data of bacterial pathogens, both those objectives ought to be very much achievable.

FIGURE 1
FIGURE 1Phylogenetic tree of the 399 animal training and test set RF1, RF2, and RF3 model isolates.Innermost annotation ring (Serovar) specifies whether an isolate was S. Typhimurium or a monophasic variant of S. Typhimurium, second annotation ring (DT104 outbreak link) specifies whether an isolate exhibited the DT104 outbreak specific SNP address of60.11.15.16.458.459.x,third annotation ring (ML model dataset) specifies whether an isolate belonged to the training or the test machine learning model dataset, fourth annotation ring (Host) specifies the primary source class of each of the isolates, and fifth annotation ring (BAPS level 1) specifies the BAPS cluster each isolate was assigned to at the first hierarchical level.Bootstrap branch support values between 80% and 100% are shown on the tree.The tree is rooted at the outgroup strain SRR8820637.

FIGURE 2 (
FIGURE 2 (A) Phylogenetic tree of the 322 animal training set RF1, RF2, and RF3 model isolates constructed from variable sites from the 163 cgMLST loci retained as machine learning model inputs.Innermost annotation ring (Serovar) specifies whether an isolate was S. Typhimurium or a monophasic variant of S. Typhimurium, and second annotation ring (Host) specifies the primary source class of each of the isolates.Bootstrap branch support values between (Continued)

FIGURE 4
FIGURE 4 ).Details of running the rfe (backwards feature selection) and Boruta (top-down feature selection) algorithms are provided in the Supplementary material.The final set of retained features comprised 130 rfe selected cgMLST loci plus additional 33 Boruta selected cgMLST loci for a total set of 163 cgMLST loci used as model features.

TABLE 1
Description of the machine learning model datasets.SNP address 60.11.15.16.458.459.x)clustered in a single clade of the phylogenetic tree (Figure

TABLE 2
Comparison of the parameters (random forest algorithm, resampling methodology, model hyperparameters, number of hyperparameter combinations tested) and the outputs (optimal model accuracy and hyperparameters, training set accuracy, training set kappa, test set accuracy, test set kappa) for three models RF1, RF2, RF3 performed on the full animal isolate dataset and the RF1-no DT104 model performed on a dataset without the clonal DT104 animal isolates.
80% and 100% are shown on the tree.The tree is rooted at the outgroup strain SRR8820637.(B) Phylogenetic tree of the 77 animal test set RF1, RF2, and RF3 model isolates constructed from variable sites from the 163 cgMLST loci retained as machine learning model inputs.Innermost annotation ring (Serovar) specifies whether an isolate was S. Typhimurium or a monophasic variant of S. Typhimurium, and second annotation ring (Host) specifies the primary source class of each of the isolates.Bootstrap branch support values between 80% and 100% are shown on the tree.The tree is rooted at the outgroup strain SRR8820637.
remainder) to the Cattle and Sheep primary sources.All 139 DT104 outbreak related human isolates were assigned to either Cattle or Sheep with high confidence probability of assignment values of above 0.500.By contrast, of the 178 human isolates with different SNP address assigned to Cattle, 15 (8.4%) were assigned with a low confidence probability of assignment of below 0.500.Equally, of the 52 human isolates not related to the DT104 outbreak assigned to Sheep, 45 (86.5%) were assigned with a low confidence probability of assignment (Figure3Aand Supplementary TableS6).Additionally, of the 314 human isolates that the tuned RF1 model assigned to Cattle, 311 (99.0%) were of DT104 (including isolates with different SNP address to the DT104 outbreak related isolates),

TABLE 3
The tuned RF1 machine learning model confusion matrix for the assignment of 322 training set animal origin S. Typhimurium and monophasic S. Typhimurium isolates to eight primary source classes.

TABLE 4
The tuned RF1 machine learning model confusion matrix for the assignment of 77 test set animal origin S. Typhimurium and monophasic S. Typhimurium isolates to eight primary source classes.

TABLE 5
The top 15 RF1 model features (cgMLST loci) ranked by the mean decrease in accuracy-an overall feature importance measure across all eight primary source classes (MeanDecreaseAccuracy).The feature importance for each of the eight primary sources the RF1 model was trained to recognize (Cattle, OtherMammals, Pigs, Sheep, Broilers, Layers, Turkey, Game) is also specified.The EnteroBase derived gene name and gene function description is specified for each of the 15 features.Note that the top ranking feature for a specific primary source may not be shown in the table as the top 15 cgMLST loci are ordered by MeanDecreaseAccuracy.

TABLE 6
The tuned RF1-no DT104 machine learning model confusion matrix for the assignment of 275 training set animal origin S. Typhimurium and monophasic S. Typhimurium isolates to eight primary source classes, excluding all isolates with the clonal DT104 outbreak SNP address of60.11.15.16.458.459.x.The values along the diagonal (in bold) indicate the number of isolates correctly assigned by the model to their actual primary source class.The values above and below the diagonal indicate the number of isolates incorrectly classed by the model not to their actual primary source class (column headers) but to the model predicted source (row names).The isolates were assigned to the primary source class with the highest model computed probability of assignment.Balanced accuracy is the average of the sensitivity (true positive rate) and specificity (true negative rate) values for each primary source class.

TABLE 7
The tuned RF1-no DT104 machine learning model confusion matrix for the assignment of the 64 test set animal origin S. Typhimurium and monophasic S. Typhimurium isolates to eight primary source classes, excluding all isolates with the clonal DT104 outbreak SNP address of60.11.15.16.458.459.x.

TABLE 8
The top 15 RF1-no DT104 model features (cgMLST loci) ranked by the mean decrease in accuracy-an overall feature importance measure across all eight primary source classes (MeanDecreaseAccuracy).