Improving Police Recorded Crime Data for Domestic Violence and Abuse through Natural Language Processing

Cook, Darren; Weir, Ruth; Humphreys, Leslie

doi:10.3389/fsoc.2025.1686632

ORIGINAL RESEARCH article

Front. Sociol.

Sec. Medical Sociology

This article is part of the Research TopicEnhancing Data Collection and Integration to Reduce Health Harms and Inequalities Linked to ViolenceView all 5 articles

Improving Police Recorded Crime Data for Domestic Violence and Abuse through Natural Language Processing

Provisionally accepted

Darren Cook^1*

Ruth Weir¹

Leslie Humphreys²

¹City University of London, London, United Kingdom
²University of Lancashire, Preston, United Kingdom

The final, formatted version of the article will be published soon.

Introduction: Domestic Violence and Abuse (DVA) is a growing public health and safeguarding concern in the UK, compounded by long-standing data quality issues in police records. Incomplete or inaccurate recording of key variables undermines the ability of police, health services, and partner agencies to assess risk, allocate resources, and design effective interventions. Methods: We evaluated two machine learning models (Random Forest and DistilBERT) for classifying the type of victim/offender relationship (ex-partner, current partner, and family) from approximately 19,000 DVA incidents recorded by a UK police force. Models were benchmarked against a static rule-based classifier and assessed using precision, recall, and F1-score. To reduce false positives in the most challenging relationship categories, we implemented a selective classification strategy that abstained from low-confidence predictions. Results: Both machine learning models outperformed the baseline across all metrics, with average absolute gains of 11 percentage points in precision and 16 in recall. Ex-partner cases were classified most accurately, while current partner cases were classified with the least accuracy. Selective classification substantially improved precision for underperforming categories, albeit at the expense of reduced coverage. Discussion: These findings demonstrate that computational tools can enhance the completeness and reliability of police DVA data, provided their use balances predictive accuracy, interpretability, and safeguarding risks.

Keywords: Natural Language Processing, police recorded crime, Domestic violence (DV), Text classication, Supervised machine learning, DistilBERT, Free text

Received: 15 Aug 2025; Accepted: 27 Oct 2025.

Copyright: © 2025 Cook, Weir and Humphreys. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Darren Cook, darren.cook@city.ac.uk

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.