The untapped potential of clinical free-text: understanding privacy risk in electronic health records
Arlene Casey, Dunhill Medical Trust Research Fellow | Principal NLP Data Scientist
At Scotland’s Health Research and Innovation Conference, I recently gave a presentation (awarded best talk) to delve into our Scottish Safe Haven Network (SSHN) collaboration addressing the challenges of creating an access pathway to clinical free-text.
Clinical free-text data (typed notes from medical practitioners) is a treasure trove that constitutes 70% to 80% of health information, but often remains inaccessible for research purposes. Through our efforts, we aim to revolutionise health research, and ultimately patient care, by creating a pathway for data access.
Privacy Concerns: a major hurdle for access to clinical free-text
Despite the wealth of information embedded in unstructured clinical reports, the majority of health research relies on structured data.
One of the significant hurdles in providing clinical free-text for research lies in privacy concerns. Traditionally, manual redaction processes have been required to ensure patient confidentiality. However, these processes are time-consuming, and therefore may not able to be undertaken at the scale required or always completed within standard research project timescales.
Our work addresses this critical issue head-on, focussing on the removal of identifiers – both direct (names, dates of birth) and indirect (potentially revealing too much information, or an accumulation of information, that increases the risk of a patient being identified, such as where a medical event took place and/or detailed family information) – to pave the way for more efficient and secure access to unstructured data.
Understanding Public Perspectives
Public perceptions around using free-text data for research and managing privacy risks is integral to our work and is a foundational element of our approach.
We combined insights across our engagement work to provide a nuanced understanding of participant deliberations concerning free-text patient data and Trusted Research Environment (TRE) reliability. These public perspectives have fed directly into our approach to understanding and identifying privacy risks in free-text.
Direct Privacy Risk: a collaborative approach
Whilst there is existing research in AI models to address direct risks, often these models are not publicly available or are built on non-UK based data meaning they do not work well when applied to NHS data. The SSHN has worked together to propose a national standard on what direct identifiers are and are labelling a large set of hospital records that will be used to train a de-identification model to work across the Scottish population.
Indirect Privacy Risks: a novel approach
In a first-of-its-kind effort, our project has explored indirect privacy risks within clinical free-text — an area largely unexplored in previous research. By extracting one year's worth of hospital discharge summaries from major NHS Lothian hospitals, the team applied innovative Natural Language Processing (NLP) approaches to identify and understand potential privacy risks associated with indirect identifiers.
Through meticulous review and analysis, the team identified six main categories of indirect privacy risks: Unique Medical Events, Social, Locations and Organisations, Mental Health, Police, and Crime. The occurrence of these risks was not isolated: often, where one category surfaced within a sentence, a cascade of others could be found in subsequent reports. The age range of patients and the frequency of hospital attendance emerged as crucial factors influencing the cumulative risk of identifiability. This comprehensive categorisation underscores the necessity of extending privacy considerations beyond the scope of current research on direct identifiers.
From Data Extraction to Visualisation: a comprehensive approach
The team developed a prototype dashboard aiming to revolutionise the process at the data access stage, effectively de-risking free-text when it is shared for research purposes. The prototype helped specialists explore how they could apply a proportionate-based risk approach to better understand data privacy risks and make more-informed decisions.
The team continue to refine the prototype dashboard. Future iterations will encapsulate and track the de-risk stages, incorporating valuable insights from public engagement to ensure that semi-automated tools maintain a balance with human checks and audit trails.
The Road Ahead: bridging a gap in health data access for research
As the work successfully achieves its aims — exploring NLP-based approaches to understand risk, developing a prototype dashboard, and understanding public perspectives — we envision a future where clinical free-text is a valuable resource made securely accessible for innovative health research leading to improved patient care.
Our initiative not only breaks down barriers in accessing unstructured health data but also sets a precedent for future research endeavours seeking to harness the full potential of clinical notes in an ethical and informed manner.
Acknowledgement
This work would not have been possible without funding from DARE UK and Research Data Scotland as well as our collaborative partners across the Scottish Safe Haven Network and our academic partners at the universities of Edinburgh, Aberdeen, and Sussex.