Our Natural Language Processing (NLP) work is focussed on gaining further value from free-text health and social care records while preserving confidentiality.

Our team, led by Dr Arlene Casey, aims to deliver practical insights that drive transformative improvements in health and care services. At the same time, we aim to secure efficiencies within health and social care systems, paving the way for more effective solutions for the benefit of all.

Graphic showing examples of direct identifiers - e.g. medical record numbers, telephone numbers - and indirect identifiers - e.g. rare diseases, arrest details, family relationships. There's also the statement: Indirect identifiers are rare and hard to find, and not well defined.

The current position

Health data can be seen as either structured or unstructured. Structured data have a standard format – such as dates of birth and codes for conditions and treatments – and are currently the foundation for research once they have been de-identified. Unstructured data, which make up the majority of health data, covers free-text notes written by clinicians. There is significant additional value in unstructured data, since they hold more nuanced details for understanding disease which cannot be found in structured data. However, these unstructured data often contain information that can make someone identifiable, such as names and details about family members.

Before allowing researchers to access data extracts, we must ensure confidentiality and this involves addressing privacy risks found in clinical notes. These risks go beyond names, dates etc., (called direct identifiers), and can include details that we call indirect identifiers, such as family details, locations of where accidents happen, or possibly crimes.

Currently, due to the potential identifiability of individuals through free-text, no such extracts are accessed by researchers through the DataLoch service.

 

Our core NLP priorities

Our NLP programme of work has two main objectives:

1. Establishing Mechanisms for Researcher Access: Developing processes that enable DataLoch to provide researchers with secure access to clinical free-text data while ensuring robust measures are in place to maintain confidentiality, through redaction or omission of direct and indirect identifiers.

2. Producing Structured Data from Clinical Notes: Developing methods to transform unstructured clinical notes into structured data to support research, such as by coding conditions, symptoms or lifestyle factors, so that researchers do not need to access free text.

The team is currently actively working on these priorities. We are open to further collaborations to develop NLP solutions that would extract and convert relevant information from free-text into structured data.

Second text section

Initial NLP projects and collaborations

Later-Life: The provision of effective care in later life is a major challenge for existing health and social care services. The complexity of needs and multiple conditions experienced by older people, as well as their diverse social contexts are vital for understanding health and care for later life. However, this information is mostly recorded in clinical free-text (GP/inpatient notes and letters/discharge summaries). We are working on developing methods to extract useful information from clinical notes that can be used securely for research studies or in understanding health and care service needs.

Leveraging NLP for Breast Cancer Patient Care: With increasing pressure on the NHS, there is a need to reduce inefficiencies. One example is the significant clinician time required to gather relevant patient data prior to cancer-clinic appointments. To address this issue, we will develop methods  to extract information about a patient's medical history, treatments and side effects during breast cancer treatment from clinical notes. This will be used in further research working with the Edinburgh Cancer Informatics Team (CIT) to develop summaries of care for both patients and clinicians.

AMBER: Antidepressant Medications: Biology, Exposure & Response: Use of health data recorded throughout patient journeys provides valuable insights into how symptoms and response to treatments vary between individuals. By using health data, we can improve our understanding of depression and treatment response. Information about treatment response is often recorded in clinical notes and, through collaboration with the AMBER team, we are developing NLP methods to extract and code these data. Through DataLoch, this coded information will then be used by researchers as part of the wider AMBER project that aims to improve the lives of individuals living with depression.

If you are interested in collaborating around our NLP priorities, please contact the team.

Connect with Us