Neural machine translation of clinical text: an empirical investigation into multilingual pre-trained language models and transfer-learning

Clinical text and documents contain very rich information and knowledge in healthcare, and their processing using state-of-the-art language technology becomes very important for building intelligent systems for supporting healthcare and social good. This processing includes creating language understanding models and translating resources into other natural languages to share domain-specific cross-lingual knowledge. In this work, we conduct investigations on clinical text machine translation by examining multilingual neural network models using deep learning such as Transformer based structures. Furthermore, to address the language resource imbalance issue, we also carry out experiments using a transfer learning methodology based on massive multilingual pre-trained language models (MMPLMs). The experimental results on three sub-tasks including (1) clinical case (CC), (2) clinical terminology (CT), and (3) ontological concept (OC) show that our models achieved top-level performances in the ClinSpEn-2022 shared task on English-Spanish clinical domain data. Furthermore, our expert-based human evaluations demonstrate that the small-sized pre-trained language model (PLM) outperformed the other two extra-large language models by a large margin in the clinical domain fine-tuning, which finding was never reported in the field. Finally, the transfer learning method works well in our experimental setting using the WMT21fb model to accommodate a new language space Spanish that was not seen at the pre-training stage within WMT21fb itself, which deserves more exploitation for clinical knowledge transformation, e.g. to investigate into more languages. These research findings can shed some light on domain-specific machine translation development, especially in clinical and healthcare fields. Further research projects can be carried out based on our work to improve healthcare text analytics and knowledge transformation. Our data is openly available for research purposes at: https://github.com/HECTA-UoM/ClinicalNMT.


Introduction
In recent years, Healthcare Text Analytics (HECTA) have gained more attention from researchers across different disciplines, due to their impact on clinical treatment, decision-making, hospital operation, and their recently empowered capabilities.These developments have much to do with the latest development of powerful language models (LMs), advanced machine-learning (ML) technologies, and increasingly available digital healthcare data from the social media (Griciūtė et al., 2023;Oyebode et al., 2021;Luo et al., 2022) and discharged outpatient letters from hospital settings (Henry et al., 2020;Spasic and Nenadic, 2020;Percha, 2021).
Intelligent healthcare systems have been deployed in some hospitals to support the clinicians' diagnostics and decision-making regarding patients and their health problems (Noor et al., 2022;Qian et al., 2021).Such usages include key information extraction (IE) from electronic health records (EHRs), normalisation to medical terminologies, knowledge graph (KG) construction, and relation extraction (RE) between symptoms (problems), diagnoses, treatments, and adverse drug events (Wu et al., 2022;Nguyen et al., 2023;Belkadi et al., 2023).Some of these digital healthcare systems can also help patients self-diagnose in situations where no General Practitioners (GPs) and professional doctors are available (Wroge et al., 2018;Zhu et al., 2021).
However, due to the language barriers and inequal accessibility of digital resources across languages, there is an urgent need for knowledge transfer, such as from one human language to another (Costajussà et al., 2022;Khoong and Rodriguez, 2022).Thus, to help address digital health disparity, machine translation (MT) technologies can be of good use.
MT is one of the earliest artificial intelligence (AI) branches dating back to the 1950s, and it has boomed in recent years along with other natural language processing (NLP) tasks due to the newly designed powerful Transformers learning model (Weaver, 1955;Vaswani et al., 2017;Devlin et al., 2018;Han et al., 2021a).Several attention mechanisms designed in Transformer deep neural models have proven themselves capable of better learning from a large amount of available digital data compared to traditional statistical and neural network-based models (Han, 2022;Kuang et al., 2018;Han and Kuang, 2018).
In this work, we investigate the state-of-the-art Transformer-based Neural MT (NMT) models in connection with clinical domain text translation, to facilitate digital healthcare and knowledge transfer with the workflow drawn in Figure 1.Being aware of some current development in the competition of language model sizes in the NLP field, we set up the following base models for comparison study: 1) a small-sized multilingual pre-trained Marian language model (s-MPLM), which was developed by researchers at the Adam Mickiewicz University in Poznan and by the NLP group at the University of Edinburgh (Junczys-Dowmunt et al., 2018a;Junczys-Dowmunt et al., 2018b); and 2) a massive-sized multilingual pre-trained NLLB LM (MMPLM/xL-MPLM), developed by Meta-AI covering more than 200 languages (Costa-jussà et al., 2022).In addition to this, we set up a third model to investigate the possibility of transfer learning in the clinical domain MT: 3) the WMT21fb model which is another MM-PLM from Meta-AI but with a limited amount of pre-trained language pairs including from English to Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese, and the opposite (Tran et al., 2021).
The testing language pairs of these translation models in our work are English ↔ Spanish.As far as we know, there are no other language pairs of openly available resources in the clinical domain MT.We use the international shared task challenge data from ClinSpEn2022 "clinical domain Spanish-English MT 2022" for this purpose1 .ClinSpEn2022 was a sub-task of the BioMedical MT track at WMT2022 (Neves et al., 2022).There are three translation tasks inside ClinSpEn2022 including i) clinical cases report; ii) clinical terms, and iii) ontological concepts from the biomedical domain.
Regarding the evaluation of these LMs, we used the evaluation platform offered by the ClinSpEn2022 shared task including several automatic metrics such as BLEU, METEOR, ROUGE, COMET.However, the automatic evaluation results did not give any apparent differentiation between the models on some of the tasks.Furthermore, there are issues like inconsistency regarding model ranking across automatic metrics.To address these issues and give a high-quality evaluation, we performed an expert-based human evaluation on the three models using outputs of Task one "clinical case report".
Our experimental investigation shows that 1) the extra-large MMPLM does not necessarily win over the small-sized MPLM on clinical domain MT via fine-tuning; 2) our transfer-learning model works successfully for clinical domain MT task on language pairs that were not pre-trained for, but added with fine-tuning.The first finding can shed some light on the idea that in clinical domain-specific MT, it is better to do more data cleaning and fine-tuning rather than build extra large LMs.Our second finding tells us the capability of MMPLMs in generating a new language pair knowledge space for translating clinical domain text even though this language pair was unseen in the pre-training stage with our experimental settings.This can be useful to low-resource NLP, such as the work by (Almansor and Al-Ani, 2018;Islam et al., 2021  The rest of this article is organised as below: Section 2 surveys other works related to ours, including clinical domain MT and NLP, large LMs, and transfer learning.Section 3 details the three LMs we deployed for comparison study.Section 4 introduces the experimental work we carried out and automatic evaluation outcomes.Section 5 follows up with expert-based human evaluation and the results.Finally, Section 6 concludes our work with discussion.

Related Work
Applying NLP models to clinical healthcare has attracted much attention of researchers, such as the work on disease status prediction using discharge summaries by Yang et al. (2009), temporal expressions and events extraction from clinical narratives using combined methods of rules and machine learning by Kovačević et al. (2013) and recent deep-learning models by (Tu et al., 2023), Temporal Relation modelling on treatments using prompt engineering on GPT models by (Cui et al., 2023), using knowledgebased and data-driven methods for de-identification task in clinical narratives by Dehghan et al. (2015), systematic reviews on clinical text mining and healthcare by Spasic and Nenadic (2020) and Elbattah et al. (2021), etc.
However, using MT to help translate clinical text for knowledge transfer and improved clinical decision-making is still a relative novelty (Khoong and Rodriguez, 2022), even though it has proven its usefulness for assisting health communication especially with post-editing strategies (Dew et al., 2018).This is partially the result of the sensitive nature of domain and high risk in clinical settings (Randhawa et al., 2013).Some of the recent progress on using MT for clinical texts includes the work by Soto et al. (2019) which leverages SNOMED-CT terms (Donnelly, 2006) and relations for MT between Basque and Spanish languages; Mujjiga et al. (2019) which applies NMT model to identify semantic concepts in "abundant interchangeable words" in clinical domain and their experimental result shows that NMT model can greatly improve the efficiency on extracting UMLS (Bodenreider, 2004) concepts from a single document by using 30 milliseconds in comparison to traditional regular expression-based methods which take 3 seconds; and Finley et al. (2018) which uses NMT to simplify the typical multi-stage workflow on clinical report dictation and even correct the errors from speech recognition.
With the prevalence of multilingual PLMs (MPLMs) developed from NLP fields, it becomes necessary to test their performances in the clinical domain of NMT.MPLMs have been adopted for many NLP tasks since the first emergence of the Transformer-based learning structure (Vaswani et al., 2017).Among these, Marian is a small-sized MPLM led by Microsoft Translator based upon Nenatus NMT (Sennrich et al., 2017) with around 7.6 million parameters (Junczys-Dowmunt et al., 2018a).At the same time, different research and development teams have been competing in recent years in terms of the size of their LMs such as the massive MPLMs (MMPLMs) WMT21fb and NLLB by Meta-AI that have the number of parameters set at 4.7 billion and 54 billion respectively (Tran et al., 2021;Costa-jussà et al., 2022).To investigate the performances of these different models with varied model sizes towards clinical domain NMT with fine-tuning, we set up all three of these as our base models.To the best of our knowledge, our work is the first to compare small-size and extra-large MPLMs in the clinical domain of NMT.
Close to the clinical domain, there is a biomedical domain MT challengethat has been organised along with the Annual Conference of MT (WMT) since 2016 (Bojar et al., 2016;Yeganova et al., 2021).The historical biomedical MT tasks have covered corpus of biomedical terminologies, scientific abstracts from Medline, summaries of proposals for animal experiments, etc.In 2022, it was the first time that this Biomedical-MT shared task introduced clinical domain data for Spanish-English language pairs (Neves et al., 2022).
As the WMT21fb model does not include Spanish in its pre-training, we also examined the transfer learning technology into the clinical domain NMT towards Spanish-English using the WMT21fb model.Transfer-learning (Alyafeai et al., 2020) has proved useful for text classification and relation extraction (Pomares-Quimbaya et al., 2021;Peng et al., 2019), and low-resource MT (Jiang et al., 2022) fields.However, to the best of our knowledge, we are the first to test clinical domain NMT via transfer learning using MMPLMs.

Multilingual Marian NMT
First, we draw a training diagram of the original Marian model on its pre-training steps in Figure 2 according to (Junczys-Dowmunt et al., 2018a).The pre-processing step includes tokenisation, truecasing, and Byte-Pair Encoding (BPE) for sub-words.The shallow training is to teach a mid-phase translation model to produce temporary target outputs for back-translation.Then, the back-translation step produces the same amount of input source sentences to enlarge the corpus.The deep-training step first uses four left-to-right models which can be RNN (Sennrich et al., 2017) or Transformer (Vaswani et al., 2017) structures, which is followed by four right-to-left models in the opposite direction.The ensemble-decoding step will generate the n-best hypothesis translations for each source input segment, which will be re-ranked using a re-scoring mechanism.Finally, in Marian NMT, there is an automatic post-editing step taken before the final output is produced.This step is also based on an end-to-end neural structure by modelling the set(MT-output, source sentence)→"post-edited output" as introduced by Junczys-Dowmunt and Grundkiewicz (2017).
The Marian NMT model we deployed is from the Language Technology Research Group at the University of Helsinki led by Tiedemann and Thottingal (2020).It is based on the original Marian model but continuously trained on the multilingual OPUS corpus (Tiedemann, 2012) to make the model available to more languages.It includes Spanish↔English (es↔en) pre-trained models and has 7.6 million parameters for fine-tuning.3

Extra-Large Multilingual WMT21fb and NLLB
Instead of the optional RNN structure used in the Marian model, both WMT21fb and NLLB massivesized multilingual PLMs (MMPLMs) adopted Transformer as the main methodology.As shown in Fig- wise fully connected feed-forward network.We employ a residual connection [10] around each of the two sub-layers, followed by layer normalization [1].That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension d model = 512.
Decoder: The decoder is also composed of a stack of N = 6 identical layers.In addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.Similar to the encoder, we employ residual connections around each of the sub-layers, followed by layer normalization.We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions.This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.Furthermore, this structure design still needs language-specific training, such as English-to-other and other-to-English used by WMT21fb.

Scaled Dot-Product Attention
To further improve on this, the NLLB model designed a Conditional MoE Routing layer inspired by Zhang et al. (2021) to ask the MoE model to decide which tokens to dropout based on how computationally intensive/resource-heavy they are to process or based on their routing efficiency.This is achieved by a binary gate, which assigns weights to dense FNN F F N shared or MoE Gating, as in Figure 5.The Conditional MoE also removes language-specific parameters for learning.
In summary, the WMT21fb and NLLB models share very similar learning structures, the biggest difference being that WMT21fb used language-specific constrained learning.The WMT21fb model we applied is 'wmt21-dense-24-wide.En-X' (and X-En direction) which has 4.7 billion parameters4  and contains the language pairs English ↔ Chinese, Czech, German, Hausa, Icelandic, Japanese, and Russian.The full NLLB model includes 200+ languages and has 54.5 billion parameters.Due to the computational restriction, we applied the distilled model of NLLB, i.e.NLLB-distilled, which has 1.3 billion parameters.The WMT21fb model does not have Spanish among its trained language pairs, while NLLB includes Spanish as a high-resource language.This is a perfect setting for us to examine the transfer-learning technology on the clinical domain NMT by fine-tuning a translation model for the Spanish language on the WMT21fb model and comparing the output with the NLLB model (Spanish version).

Domain Fine-tuning Corpus
To fine-tune the three MPLMs for English ↔ Spanish language pair towards the clinical domain, we used the medical bilingual corpus MeSpEn from Villegas et al. (2018), which contains sentences, glossaries, and terminologies.We performed data cleaning and extracted around 250K pairs of segments in this language pair for domain fine-tuning of the three models.These extracted 250K pairs of segments are randomly chosen from the original MeSpEn corpus and we divided them into 9:1 ratio for training and development purposes.Because the WMT21fb pre-trained model did not include Spanish as one of the pre-trained language models, we could not use < 2es > (to-Spanish) indicator for fine-tuning.As a solution, we used < 2ru > as the indicator for this purpose (to-Spanish).This means a transfer learning challenge to investigate if the extra-large multilingual PLM (xL-PLM) WMT21fb has created a semantic space to accommodate a new language pair for translation modelling using the 250K size of corpus we extracted.

Model Parameter Settings
Some parameter settings for s-MPLM Marian model fine-tuning are listed below.The last activation function for generative model is a linear layer.Within the decoder and encoder, we used the Sigmoid Linear Units (SiLU) activation function.More detailed parameter and layer settings are displayed in Figure 12 (Appendix).
• learning rate = 2e-5 The fine-tuning parameters for WMT21fb model are the same as the NLLB-200-distilled, except for the batch size value which is set as 2. This is because the model is too large and we would get out-ofmemory (OOM) errors if we increase the batch size to anything larger than 2.More details on M2M-100 parameters and layer settings for Conditional Generation Structure (Fan et al., 2021) we used for xL-MPLM WMT21fb and NLLB-200 can be found in Figure 13 (Appendix).

Test Sets and Automatic Evaluations
The evaluation corpus we used is from the ClinSpEn-2022 shared task challenge data organised as part of the Biomedical MT track in WMT2022 (Neves et al., 2022).It has three sub-tasks: 1) EN→ES translation of 202 COVID19 clinical case reports; 2) ES→EN translation of 19K clinical terms from biomedical literature and EHRs; and 3) EN→ES 2K ontological concept from biomedical ontology.
The automatic evaluation metrics used for testing include BLEU (HuggingFace) (Papineni et al., 2002), ROUGE-L-F1 (Lin, 2004), METEOR (Banerjee and Lavie, 2005), SACREBLEU (Post, 2018), and COMET (Rei et al., 2020), hosted by the ClinSpEn-2022 platform5 .The metric scores are reported in Table 1 for three translation tasks.In the table, the parameter 'plm.es' is a question mark asking if the Spanish language was already included in the original off-the-shelf PLMs.For this question, both Marian and NLLB have Spanish in their PLMs, while WMT21fb does not, which indicates that Clinical-WMT21fb is a transfer learning model for EN↔ES language pair.
From this automatic evaluation result, the first surprising finding is that the much smaller Clinical-Marian model had most of the highest scores across the three tasks, as indicated by italics.The second finding concerns the two xL-MPLMs: even though the transfer-learning model Clinical-WMT21fb has a certain score gap to Clinical-NLLB on Task 1, it almost catches up with Clinical-NLLB for Task 2 and 3 even winning one of the scores, the COMET for Task 3 (0.9908 vs 0.9180).This means that the xL-MPLM has the capacity to create a multilingual semantic space and the capability to generate a new language model as long as there is a sufficeint amount of fine-tuning corpus for this new language.Third, there are issues with automatic metrics.This includes the confidence level on score difference (significance test), such as the very closely related scores for Task 1 on the first two winner models.In addition, the winner models change across Task 2 and 3 via different metrics.
We also observed that there are 4 percent of Russian tokens in the EN → ES output from the Clinical WMT21fb model.This indicates that the model keeps Russian tokens when it does not know how to translate the English token into Spanish.This is very interesting since the Russian tokens reserved in the text are not a nonsense -instead, they are tokens with correct meaning, only in a different language.This might be the reason why COMET generated higher score for Clinical-WMT21fb model than Clinical-NLLB on Task-3 'ontological concept' since COMET is a neural metric that calculates the semantic similarity on an embedding space, ignoring the word surface form.
To improve the trustworthiness of our empirical investigation and generate a clearer evaluation output across the three models, we perform human expert-based evaluations in the next section.

Comparisons
To compare our much smaller Clinical-Marian model with other existing work on this shared task data, such as Optum (Manchanda and Bhagwat, 2022) and Huawei (Wang et al., 2022), we list the automatic evaluation scores in Table 2 where Optum attended all three sub-tasks, while Huawei only attended Task 2: Clinical Terminology (CT).From the comparison scores using automatic metrics, we can see that the much smaller Clinical-Marian wins some metrics in each of the tasks.In addition, Optum used their in-house clinical data as extra training resources in addition to WMT-offered training set, while the 250K training set we used for Clinical-Marian is extracted only using WMT data.Huawei's model only wins one metric (COMET) out of five metrics on Task 2 (CT), however, both Clinical-Marian and Optum win two metrics out of five.This means that Huawei's performance on this task is not much better even though they have much greater online resources and computational support.

Human Evaluation
As observed in the last section, we had two reasons to set up the expert-based human evaluation: 1) it is really surprising that the much smaller MPLM (s-MPLM) Clinical-Marian performs better than the xL-MPLMs Clinical-NLLB and Clinical-WMT21fb; 2) to verify the automatic evaluation hypothesis that Clinical-Marian really does have the best performance.

Human Evaluation Setup
To achieve both the qualitative and quantitative human evaluation, we deployed a human-centric expertbased post-editing quality evaluation metric called HOPE by Gladkoff and Han (2022) (it is also called LOGIPEM and invented by Logrus Global LLC, a language service provider).The HOPE evaluation metric has 8 predefined error types and each error type has corresponding different levels of penalty points according to the severity level.The sentence level and system level HOPE score is a comprehensive score reflecting the overall quality of outputs.
First, we recruited five human evaluators who have the backgrounds in professional translation, linguistics, and biomedical research.For the evaluation data set, we took all the test set output from Task 1 'clinical case' reports since this is the only task with full sentences.For the other two tasks on term and ontology level translation, MT engines can produce relatively good outcomes even without an effective encoder-decoder neural model, e.g.via a well-prepared bilingual dictionary.We prepared 100 strings for each set and delivered all the sets to five professional evaluators6 .The tasks consisted of strings of medical cases going in order one by one, so the context of each case is clear to the evaluator.
Each one of them was given three files for evaluation from different engines, and instructions were given on both the online Perfectionist tool that was used for evaluation and the HOPE metrics.Then, to ensure the human evaluation quality, we have also asked the strictest reviewer/evaluator to validate the work of other evaluators.The strictest reviewer is one of our experts from the language service provider industry and has our trust according to their long-term experiences in post-editing MT outputs and selecting MT engines in real world projects.The strictest reviewer made better distinctions between all three evaluated models, while the less-strict reviewers sometimes gave similar scores to these models without picking their errors rigorously.

Human Evaluation Output
The results of the evaluation can be seen in the online Perfectionist tool that was used for this purpose, as downloaded from the tool in the form of familiar Excel scorecards.They are tallied in Figure 6 and Table 3.The human evaluation clearly shows which model is the best demonstrating a large score gap in-between: the Clinical-Marian has a score of 0.801625, followed by Clinical-NLLB and Clinical-WMT21fb with scores of 0.768125 and 0.692429 respectively.
To compare the human evaluation outputs with the automatic metric scores, we also added two metrics, METEOR and ROUGE, and their average score into the figure.The reason we chose these two particular metrics is that they have a relatively positive correlation to human judgements.For the other three metrics, there are several issues that prevented their use.First, BLEU shows NLLB as being better for terms and concepts, which does not correspond to the human judgement.Moreover, BLEU shows WMT21fb concepts to be better than those of the Marian Helsinki model, which is completely incorrect.Second, COMET score for the NLLB model is higher than 1, which is clearly caused by the fact that this implementation of COMET was not normalised by the Sigmoid function.Also, this COMET score for NLLB is higher than the one for Marian Helsinki.Another error is that the COMET score for clinical cases is much better than for both Marian and NLLB, which is completely impossible due to the presence of foreign language tokens in WMT21fb output.Finally, when we see COMET scores like 0.99 and 0.949 for Concepts, the score 0.42, 0.40 and 0.34 for Cases look clearly out of line.The BLEU-HF scores for all content types are ridiculously low on the scale of [0, 1] for both Cases and especially for Terms.
Below is the list of findings made from the comparisons.
• Most importantly, all human evaluators consistently showed positive correlation with preliminary human judgement of the MT output quality.Some of them gave more rigorous evaluations than the others, but all of them rated the worst model as the worst and the best model as the best with only one exception.Results of human evaluation fully confirm initial hypothesis about the quality of outputs of different engines, which is based on initial holistic spot-check human evaluation.
• The LOGIPEM/HOPE metric shows a much greater difference in output quality than any of the automated metrics.Where the automatic score shows a 6 percent difference, human evaluation gives 14 percent.In other words, the human linguists clearly see a significant difference between output quality of different engines.Even the less-trained evaluators show a positive correlation with the hypothesis.
• Even for those automatic metrics that correlate with human judgement, the score values do not seem to be representations of the uniform interval of [ The COMET or ROUGE score of 0.6 means that MT has generated words that are different from those in the reference, and this in turn means that even a perfect translation which is different from the reference would be rated much lower than 1.This is a huge distortion of linearity, which is metric-specific because all scores for different metrics live in their own ranges.Automatic scores appear to live on some sort of non-uniform scale of their own, which is yet another reason why they are not comparable to each other.The scale is compressed, and the difference between samples becomes statistically insignificant.
• The margin of error for all three engines is about 6%, which is about the same as the difference between the mean of the measurements for different engines.This means that the difference between measurement is statistically significant, but a lot depends on the subjectivity of the reviewer, and the difference between reviewers' positions may negate the difference in scores.However, even despite the reviewers' subjectivity, the groups of measurements for different engines appear to provide a statistically and visually significant difference.
• In general, human evaluators have to be trained / highly experienced, and need to maintain a certain level of rigour.The desired target quality should be stipulated quite clearly by customer specifications, as defined in ISO 11669 and ASTM F2575.To avoid incorrect (inflated) scores and decrease Inter-Rater Reliability (IRR), the linguists must be either tested prior to doing evaluations or crossvalidated afterwards.
• One evaluation task only takes 1 hour.There were 24 evaluation tasks in total, each task with 100 segments.It does not require setting up any data processing, software development, reference "golden standard" data or model-trained evaluation metric.It is clearly faster, more cost-effective and reliable than the research on whether an automatic metric can even pass the positive correlation test with human judgement (3 out of 5 did not in our case).While individual human measurements have variance, they are all valid and all correlate with human judgement if done with minimal training and rigour.
• Automatic metrics are not comparable across different engines, different data sets, different languages and different domains.On the contrary, human measurement is the golden universal standard that provides the least common denominator between these scenarios.In other words, if Rouge is 0.67 for En-Fr for medical text, and Rouge is 0.82 for En-De for automotive text, we can't compare these numbers.In contrast, LOGIPEM/HOPE score would mean one and the same thing across the board.
All of the above confirms the validity and interoperability of our human evaluation using LOGIPEM/HOPE metrics (Gladkoff and Han, 2022), which can be used as a single quick and easy validator of automatic metrics, and the ultimate fast and easy way to carry out analytic quality measurement to compare the engines and evaluate the quality of translation and post-editing.

Inter-Rater-Reliability
To measure the inter-rater-reliability (IRR) of the human evaluation we carried out, we summarise the evaluation output from five human evaluators on three models in Figure 7.The summaries include the average scores for each model, the score difference between these three models, and the average scores from the three models, from each person.
In this case we have continuous ratings (ranging from 0 to 1) rather than categorical ratings.Therefore, Cohen's Kappa or Fleiss' Kappa are not the most appropriate measures for this work.The Intraclass Correlation Coefficient (ICC) which measures reliability of ratings by comparing variability of different ratings of the same subject to the total variation across all ratings and all subjects would also not be appropriate here because there is a greater variation within the ratings of the same MT engine than between different MT engines.
However, we can compute standard deviations of the evaluations by different reviewers for each engine as follows: • Marian: approximately 0.101 • NLLB: approximately 0.100 • WMT21: approximately 0.125 These values represent the amount of variability in the ratings given by different reviewers for each engine.The confidence intervals for these measurements for confidence level of 80% are: This can be visualised in Figure 8.These intervals indeed overlap; however, Marian is reliably better than NLLB, and it is of course extremely surprising that WMT21fb rating is that high, considering that this result has been achieved with transfer learning by fine-tuning the engine without English-Spanish in the original PLM training dataset!As we can see, for some reviewers who are quite tolerant to errors (e.g.Evaluator-1) the quality of all the engines is almost the same.The more proficient and knowledgeable the reviewer is, the higher is the difference in their ratings.

Error Analysis
We list sampled error analyses on the outputs from the fine-tuned WMT21fb and NLLB models in Figure 9, 10, and 11 for the three tasks on translations of sentences, terms, and concepts.The preferred translations are highlighted in green colour and "both sounds ok" is marked in orange.
From the comparisons of sampled output sentences, we discovered that the most frequent errors in a fine-grained analysis include literal translations, oral vs written languages, translation inconsistency, inaccuracy of terms, hallucination/made-up words, and gender-related errors such as feminine vs masculine, in addition to the standard fluency and adequacy that have been commonly used by traditional MT researchers (Han et al., 2021b).For instance, in Figure 9, the first two sentences (line 0 and 1) from clinical-WMT21fb model are more written Spanish than the clinical-NLLB model whose outputs are more oral Spanish.However, line 6 from clinical-WMT21fb model includes the words "fuertes" which means "strong" that is not as accurate as "severas/severe" from the other model.In addition, "de manana" in the same line is less natural than "matinal" from clinical-NLLB.Regarding gender-related issues, we can see the examples also in line 6, where clinical-WMT21fb produced "el paciente" in masculine while clinical-NLLB produced "la paciente" in feminine.However, the source did not say what gender is "the patient".Regarding literal translation examples, we can see in Figure 11, line ont-19 shows that clinical-WMT21fb gives more literal translation "Mal función vesical" than the preferred one "Función vesical deficiente" by clinical-NLLB when translating "Poor bladder function".The neural model output hallucinations can also be found in Figure 11, e.g."Vejícula" does not exist and is likely a mix of "vejiga" and "vesicula" in Line ont-27; similarly, in Line ont-2, "multicística" is a mix of Spanish and English, because the correct Spanish shall be "multiquística".
As we mentioned in Section 4, there are 4% Russian tokens in the English-to-Spanish translation outputs from the Clinical-WMT21fb model which can be observed in Figures 9 and 11.However, they are meaningful tokens, not some nonsense, e.g. the Russian tokens in Figure 9 from line n-4 means "soon" and in Figure 11 means "type of" from ont-11.To boost the knowledge transfer for digital healthcare and get the most knowledge out of available clinical resources, we explored the state-of-the-art neural language models regarding their performances in clinical machine translation.We investigated a smaller multilingual pre-trained language model (s-MPLM) Marian from the Helsinki NLP group, in comparison to two extra-large MPLM (xL-MPLM) NLLB and WMT21fb from Meta-AI.We also investigated the transfer-learning possibility in clinical domain translation using xL-MPLM WMT21fb.We carried out data cleaning and fine-tuning in the clinical domain.We evaluated our work using both automatic evaluation metrics and human expert-based evaluation using the HOPE (Gladkoff and Han, 2022) framework.
The experiment has led to some far-reaching conclusions about MT models and their design, test, and applications, in particular: 1) The bigger size of the model does not mean that the quality is better.This premise proved to be false, evidently because researchers need vast amounts of data to train very large models and very often such data is not clear enough.On the contrary, when we clean the data very well for fine-tuning, we can bring the model quality to much higher levels in specific domains, e.g.clinical text.We reached the point where the data quality was more important than the model's size.One key takeaway for researchers and practitioners from this is that if they can get 250,000 clean segments in a new low-resource language, they can fine-tune large language models (LLMs) and get a good enough engine in this language.Then, the next step is to continue to get clean data by post-editing translation output from that engine.This is a very important implication for "low resource languages." 2) The automated metrics deliver an illusion of measurement -they are a good tool for iterative

Figure 1 :
Figure 1: Illustration of the Investigation Workflow

Figure 6 :
Figure 6: Comparison of Automatic Evaluations against Human Evaluation (HOPE)

Figure 7 :
Figure 7: Summary of Human Expert-Based Evaluations

Table 1 :
Automatic Evaluation of Three MPLMs using ClinSpEn-2022 Platform.'plm.es'means if the Spanish language is included in PLMs.

Table 2 :
Model Comparisons on 3 Tasks between Clinical-Marian and Others.

Table 3 :
Automatic Evaluations vs Human Evaluations (HOPE) on Three MPLMs exactly 1 if the segments, in the reviewer's opinion, do not have to be edited, and LOGIPEM/HOPE score of 0.8 means only about 20% by total wordcount of work left to be done on the text with that score, since the LOGIPEM/HOPE scoring model is designed with productivity assumptions in mind for various degrees of quality.