Professional Education
-
Doctor of Philosophy, Durham University (2025)
-
Bachelor of Science, University Of Kent At Canterbury (2021)
-
PhD, Durham University, Computer Science (2025)
All Publications
-
Generalizable multilingual medical text anonymization using generative instruction tuning.
Communications medicine
2026
Abstract
Medical research depends on access to high quality data that protects patient privacy. Free text in health records contains valuable clinical detail, yet it often includes sensitive personal information that must be removed before use. Current approaches rely on manually created training data and focus mainly on narrow domains. They are difficult to scale to new medical fields and languages. This study aims to address these limitations by developing a framework that supports privacy-preserving use of medical text across diverse settings.The study introduces an annotation-free framework for training and adapting LLM-based anonymization models across diverse medical domains. Our reproducible framework includes the development of a generative medical anonymization model, leveraging synthetic data and instruction tuning of generative LLMs. Performance is evaluated on both synthetic test sets and on patient requests from a digital triage service. Accuracy, recall, precision, and the ability to maintain the original meaning of non-sensitive text are assessed.Here we show that generative models trained with the synthetic framework reach performance that exceeds strong baseline systems across several medical domains. The models preserve non-sensitive text with high fidelity and anonymize sensitive information with high accuracy. They perform well even when trained on small datasets, generalize to unseen clinical fields, and support anonymization in multiple languages without requiring additional training data in those languages.The study presents a reproducible, annotation-free approach that enables the development of effective anonymization models for medical text. The framework reduces reliance on real patient data, lowers the cost of adaptation to new settings, and supports wider use of unstructured clinical information for research and service improvement.
View details for DOI 10.1038/s43856-026-01682-8
View details for PubMedID 42288664
-
Disease and Health Surveillance in Companion Animals Using Artificial Intelligence and Machine Learning.
The Veterinary clinics of North America. Small animal practice
2026
Abstract
Companion animal disease surveillance now benefits from collated databases of electronic health records and artificial intelligence. This review examines computational approaches for analyzing unstructured veterinary clinical text, from rule-based systems through traditional neural networks to modern transformer models. Domain-adapted encoders like PetBERT enable efficient disease coding and syndromic surveillance, while generative models offer new capabilities. Topic modeling provides unsupervised pattern discovery. Key challenges include model generalization across clinical settings, privacy protection through deidentification, standardized evaluation frameworks, and environmental sustainability. Strategic deployment of appropriately sized models can advance One Health surveillance while respecting environmental responsibility.
View details for DOI 10.1016/j.cvsm.2026.03.016
View details for PubMedID 42161756
-
Comprehensive representation of health-related phenotypes in one million dogs using topic modelling of electronic health records
JOURNAL OF BIG DATA
2026; 13 (1): 50
Abstract
Historically, veterinary studies screening for breed, age and sex predisposition to disease have relied on collating small-scale studies of clinical datasets. The availability of larger datasets through groups such as the Small Animal Veterinary Surveillance Network (SAVSNET) promise access to information regarding a wide range of clinical presentations at scale, however, methodological limitations surrounding the extraction of specific disease information or screening for disease predispositions result in a substantial reduction in the number of animals studied. These studies often address very focused hypotheses - only leveraging a small fraction of the intrinsic value of the data at any one time. Here, we implemented an unsupervised machine learning methodology, creating a representation of a large volume of clinical notes collected by SAVSNET from veterinary practices across the UK. We utilise BERTopic, a topic-modelling tool based on Bidirectional Encoder Representations using Transformers (BERT) architecture, and show it is able to surface known phenotypes, such as breed predispositions to hypoadrenocorticism, diabetes mellitus and mitral valve disease, as well as potential novel patterns of disease phenotypes. This scalable and granular modelling technique facilitates the rapid interrogation of large clinical datasets, enabling the identification of a broad range of phenotypes within the population and the early detection of temporal changes indicative of emerging infectious or environmental diseases.The online version contains supplementary material available at 10.1186/s40537-026-01365-0.
View details for DOI 10.1186/s40537-026-01365-0
View details for Web of Science ID 001729160700002
View details for PubMedID 41924019
View details for PubMedCentralID PMC13035608
-
PetEVAL: A veterinary free text electronic health records benchmark
edited by Demner-Fushman, D., Ananiadou, S., Miwa, M., Tsujii, J.
ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2025: 341-353
View details for Web of Science ID 001616252100029
-
Premature mortality analysis of 52,000 deceased cats and dogs exposes socioeconomic disparities
SCIENTIFIC REPORTS
2024; 14 (1): 28763
Abstract
Monitoring mortality rates offers crucial insights into public health by uncovering the hidden impacts of diseases, identifying emerging trends, optimising resource allocation, and informing effective policy decisions. Here, we present a novel approach to analysing premature mortality in companion animals, utilising data from 28,159 deceased dogs and 24,006 deceased cats across the United Kingdom. By employing PetBERT-ICD, an automated large language model (LLM) based International Classification of Disease 11 syndromic classifier, we reveal critical insights into the causes and patterns of premature deaths. Our findings highlight the significant impact of behavioural conditions on premature euthanasia in dogs, particularly in ages one to six. We also identify a 19% increased risk of premature mortality in brachycephalic dog breeds, raising important animal welfare concerns. Our research establishes a strong correlation between socioeconomic status and premature mortality in cats and dogs. Areas with the lowest Index of Multiple Deprivation (IMD) scores show nearly a 50% reduction in the risk of premature mortality across cats and dogs, underscoring the powerful impact that socioeconomic factors can have on pet health and longevity. This research underscores the necessity of examining the socioeconomic disparities affecting animal health outcomes. By addressing these inequities, we can better safeguard the well-being of our companion animals.
View details for DOI 10.1038/s41598-024-77385-8
View details for Web of Science ID 001361325400010
View details for PubMedID 39567516
View details for PubMedCentralID PMC11579424
-
Text mining for disease surveillance in veterinary clinical data: part two, training computers to identify features in clinical text
FRONTIERS IN VETERINARY SCIENCE
2024; 11: 1352726
Abstract
In part two of this mini-series, we evaluate the range of machine-learning tools now available for application to veterinary clinical text-mining. These tools will be vital to automate extraction of information from large datasets of veterinary clinical narratives curated by projects such as the Small Animal Veterinary Surveillance Network (SAVSNET) and VetCompass, where volumes of millions of records preclude reading records and the complexities of clinical notes limit usefulness of more "traditional" text-mining approaches. We discuss the application of various machine learning techniques ranging from simple models for identifying words and phrases with similar meanings to expand lexicons for keyword searching, to the use of more complex language models. Specifically, we describe the use of language models for record annotation, unsupervised approaches for identifying topics within large datasets, and discuss more recent developments in the area of generative models (such as ChatGPT). As these models become increasingly complex it is pertinent that researchers and clinicians work together to ensure that the outputs of these models are explainable in order to instill confidence in any conclusions drawn from them.
View details for DOI 10.3389/fvets.2024.1352726
View details for Web of Science ID 001304911800001
View details for PubMedID 39239390
View details for PubMedCentralID PMC11376235
-
Explainable text-tabular models for predicting mortality risk in companion animals
SCIENTIFIC REPORTS
2024; 14 (1): 14217
Abstract
As interest in using machine learning models to support clinical decision-making increases, explainability is an unequivocal priority for clinicians, researchers and regulators to comprehend and trust their results. With many clinical datasets containing a range of modalities, from the free-text of clinician notes to structured tabular data entries, there is a need for frameworks capable of providing comprehensive explanation values across diverse modalities. Here, we present a multimodal masking framework to extend the reach of SHapley Additive exPlanations (SHAP) to text and tabular datasets to identify risk factors for companion animal mortality in first-opinion veterinary electronic health records (EHRs) from across the United Kingdom. The framework is designed to treat each modality consistently, ensuring uniform and consistent treatment of features and thereby fostering predictability in unimodal and multimodal contexts. We present five multimodality approaches, with the best-performing method utilising PetBERT, a language model pre-trained on a veterinary dataset. Utilising our framework, we shed light for the first time on the reasons each model makes its decision and identify the inclination of PetBERT towards a more pronounced engagement with free-text narratives compared to BERT-base's predominant emphasis on tabular data. The investigation also explores the important features on a more granular level, identifying distinct words and phrases that substantially influenced an animal's life status prediction. PetBERT showcased a heightened ability to grasp phrases associated with veterinary clinical nomenclature, signalling the productivity of additional pre-training of language models.
View details for DOI 10.1038/s41598-024-64551-1
View details for Web of Science ID 001252132200035
View details for PubMedID 38902282
View details for PubMedCentralID PMC11190214
-
Text mining for disease surveillance in veterinary clinical data: part one, the language of veterinary clinical records and searching for words
FRONTIERS IN VETERINARY SCIENCE
2024; 11: 1352239
Abstract
The development of natural language processing techniques for deriving useful information from unstructured clinical narratives is a fast-paced and rapidly evolving area of machine learning research. Large volumes of veterinary clinical narratives now exist curated by projects such as the Small Animal Veterinary Surveillance Network (SAVSNET) and VetCompass, and the application of such techniques to these datasets is already (and will continue to) improve our understanding of disease and disease patterns within veterinary medicine. In part one of this two part article series, we discuss the importance of understanding the lexical structure of clinical records and discuss the use of basic tools for filtering records based on key words and more complex rule based pattern matching approaches. We discuss the strengths and weaknesses of these approaches highlighting the on-going potential value in using these "traditional" approaches but ultimately recognizing that these approaches constrain how effectively information retrieval can be automated. This sets the scene for the introduction of machine-learning methodologies and the plethora of opportunities for automation of information extraction these present which is discussed in part two of the series.
View details for DOI 10.3389/fvets.2024.1352239
View details for Web of Science ID 001156197300001
View details for PubMedID 38322169
View details for PubMedCentralID PMC10844486
-
Evaluating ChatGPT text mining of clinical records for companion animal obesity monitoring
VETERINARY RECORD
2024; 194 (3): e3669
Abstract
Veterinary clinical narratives remain a largely untapped resource for addressing complex diseases. Here we compare the ability of a large language model (ChatGPT) and a previously developed regular expression (RegexT) to identify overweight body condition scores (BCS) in veterinary narratives pertaining to companion animals.BCS values were extracted from 4415 anonymised clinical narratives using either RegexT or by appending the narrative to a prompt sent to ChatGPT, prompting the model to return the BCS information. Data were manually reviewed for comparison.The precision of RegexT was higher (100%, 95% confidence interval [CI] 94.81%-100%) than that of ChatGPT (89.3%, 95% CI 82.75%-93.64%). However, the recall of ChatGPT (100%, 95% CI 96.18%-100%) was considerably higher than that of RegexT (72.6%, 95% CI 63.92%-79.94%).Prior anonymisation and subtle prompt engineering are needed to improve ChatGPT output.Large language models create diverse opportunities and, while complex, present an intuitive interface to information. However, they require careful implementation to avoid unpredictable errors.
View details for DOI 10.1002/vetr.3669
View details for Web of Science ID 001114664100001
View details for PubMedID 38058223
View details for PubMedCentralID PMC10952314
-
PetBERT: automated ICD-11 syndromic disease coding for outbreak detection in first opinion veterinary electronic health records
SCIENTIFIC REPORTS
2023; 13 (1): 18015
Abstract
Effective public health surveillance requires consistent monitoring of disease signals such that researchers and decision-makers can react dynamically to changes in disease occurrence. However, whilst surveillance initiatives exist in production animal veterinary medicine, comparable frameworks for companion animals are lacking. First-opinion veterinary electronic health records (EHRs) have the potential to reveal disease signals and often represent the initial reporting of clinical syndromes in animals presenting for medical attention, highlighting their possible significance in early disease detection. Yet despite their availability, there are limitations surrounding their free text-based nature, inhibiting the ability for national-level mortality and morbidity statistics to occur. This paper presents PetBERT, a large language model trained on over 500 million words from 5.1 million EHRs across the UK. PetBERT-ICD is the additional training of PetBERT as a multi-label classifier for the automated coding of veterinary clinical EHRs with the International Classification of Disease 11 framework, achieving F1 scores exceeding 83% across 20 disease codings with minimal annotations. PetBERT-ICD effectively identifies disease outbreaks, outperforming current clinician-assigned point-of-care labelling strategies up to 3 weeks earlier. The potential for PetBERT-ICD to enhance disease surveillance in veterinary medicine represents a promising avenue for advancing animal health and improving public health outcomes.
View details for DOI 10.1038/s41598-023-45155-7
View details for Web of Science ID 001094273200034
View details for PubMedID 37865683
View details for PubMedCentralID PMC10590382
https://orcid.org/0000-0002-1358-4979