Linked health records data

Information about the linked health records data in the Our Future Health resource, including the scope and structure of these data and how the data were generated and processed.

Linked health records data are provided to Our Future Health through a data sharing agreement with NHS England (NHSE). NHSE routinely provides data products for research through their Data Access Request Services, provided applications have gone through a strict review and approvals process. These data products use data that has been provided by patients and collected by the NHS as part of their care and support. The data are collated, maintained and quality assured by the National Disease Registration Service, which is part of NHS England.


The linkage process

How does Our Future Health access linked health data?

When joining Our Future Health, participants provide consent for Our Future Health to hold information from their health records. Our Future Health has a Data Sharing Agreement (DSA) with NHS England (NHSE) which enables us to request pseudonymised health records for consented participants and link them to other data we hold. Further details about the contents of the DSA, including purpose statements, data sets and data releases are publicly available to view by searching for ‘Our Future Health’ on NHS England's Data Uses Register (external link).

Which participants are eligible for linkage?

We send participant details to NHS England for linkage every quarter. We only send details for participants who have completed and submitted a baseline health questionnaire. Due to the time required to prepare and send participant information to NHSE, complete the linkage process and receive linked data extracts back from NHSE, the participants who joined Our Future Health most recently will not yet be linked. The linked health records data set available in the Trusted Research Environment is therefore approximately one release behind the questionnaire data. For specific information on the latest data available see the Data release page.

How are participants linked to their health records?

Our Future Health sends participant information to NHS England through the Patient Validation Engine (external link) (PAVE), the NHSE cohort management system. PAVE is used to link participants to their NHS records. Our Future Health uploads the details for the eligible set of participants to PAVE to allow for validation, tracing and retention of those participants’ details for future linkage to NHSE data sets listed within our Data Sharing Agreement. Records are returned to Our Future Health every 3 months.

Our Future Health uploads the participant’s given name, family name, sex, date of birth and postcode to PAVE. We do not collect NHS Numbers from participants. Sex is based on sex registered at birth provided by participants in the questionnaire. Due to differences between the categories collected by Our Future Health and those used for linkage via PAVE, all responses other than ‘female’ or ‘male’ are submitted as NULL. Other details are provided by participants at registration to join Our Future Health.

NHSE links participants to their health records using the Master Person Service (MPS) (external link) which matches participants to records in the Personal Demographics Service (PDS) (external link). The PDS is a database of all people in England and Wales who have ever registered with a GP or interacted with the NHS.

If a person is linked, their pseudonymised health records are returned to Our Future Health on a quarterly basis. Name, date of birth, sex and address are not returned as part of the NHS records. Records are only returned with a pseudonymised identifier, which allows researchers to link this data to other Our Future Health data sets.

Will all participants be linked to their health records?

We are currently only linking participants to health records held by NHS England, so only participants who have interacted with the NHS in England can be successfully linked to their NHS number.

In addition, participants may not be linked if the combination of details provided to NHSE do not match any records within the PDS or do not meet the linkage criteria specified by the MPS algorithm. The MPS linkage algorithm first attempts a deterministic match of participant details to PDS records. If no exact match is found, a score-based method is used to rank the best-matching records, if any. If there are multiple, equally-ranked candidate records, no match is returned. More information on the linkage criteria used by PAVE and NHSE can be found in the MPS User Guidebook (external link).

As of the most recent release, we have attempted linkage to health records data for 1,781,135 participants. 1,703,250 (95.6%) of the 1,781,135 participants sent to NHSE were successfully linked to an NHS number. For more details on the latest linkage statistics see the Data release page.

Data content and processing

What information does the linked health records data contain?

For information on the data sets and fields included in the linked health records data, please refer to the Data release page.

How is the linked health records data organised in the Trusted Research Environment (TRE)?

Each data set is labelled as a separate entity. The data release includes 9 entities, organised as follows:

  • Hospital Episode Statistics

    • nhse_eng_inpat (Admitted Patient Care)

    • nhse_eng_ed (Accident and Emergency)

    • nhse_eng_outpat (Outpatients)

    • nhse_eng_ecds (Emergency Care Dataset)

  • Civil Registrations of Death

    • nhse_engwal_deaths

  • National Disease Registration Service Cancer Data (NDRS)

    • nhse_eng_canpat (Cancer Pathways)

    • nhse_eng_canreg_pattumour (Cancer Registry Patient Tumour)

    • nhse_eng_canreg_treat (Cancer Registry Cancer Treatment)

  • Linked Participants

    • participant_nhs_linked

See the Dataset Descriptions for more information in each entity.

How do I interpret the file structure of the linked data sets?

All the linked data sets use the PID field as the global key. All data sets can be joined using this field. Each data set also contains a unique row-level identifier to use in combination with the primary key. See the data dictionary for information on the global keys and unique row identifiers.

How do we process the data for each release?

We conducted minimal processing to create the data for the current release. We filtered the fields and years to those selected for release. The data from all years selected for release are combined into a single file for each data set which is available in the TRE. For further information on the years and fields in the current release see the Data release page and data dictionary on the Data and cohort page of our website (external link).

We conducted validation on variable format and coding against the specifications in the HES data dictionary (external link) and NDRS data dictionary (external link). We also excluded some data based on data quality criteria. More information on exclusions and any changes to the data are noted on the Data release page, in the section: 'What should I be aware of when working with the linked health records data in this release?'.

How do we de-identify the data to minimise risks of identifying participants?

The pseudonymised participant identifiers that we receive from NHSE are replaced with a new set of participant identifiers (PIDs). Our Future Health does not receive any directly identifiable information from NHSE, including name, date of birth, sex or address.

We assessed the re-identification risk for all fields planned for release. Before releasing the data, we suppressed certain sensitive values if they were present for fewer than 10 records. These data were replaced with a suppression code to differentiate them from missing values. For a table outlining the fields to which we applied value suppression see the de-identification section on the Data release page.

The release does not include healthcare provider names or codes. From Release 8, we provide a pseudonymised version of provider codes instead. We used the NHS Organisation Data Service API to generate pseudonymised codes for a current list of healthcare providers and the Archived Closed Organisation data set to obtain codes for closed providers. Any healthcare provider that is not listed in the API or the archive is labelled as 'Unknown.' The fields containing pseudocodes will contain the '_OFH' suffix to indicate the field has been created by Our Future Health.

More information on specific de-identification steps applied in the current release can be found on the Data release page, including a table of numeric codes and textual labels for suppressed values.

How do I interpret the field names and codes?

We have retained the field names and codes from the original NHSE HES and NDRS data sets. Any fields derived by Our Future Health contain the '_OFH' suffix in the field name. Researchers may refer to the HES documentation (external link), ECDS documentation (external link) and NDRS documentation (external link) if needed. For information on which data dictionary version was used in creating the latest release see the Data release page.

Some variable names appear in more than one linked health records data set. For example, MAINSPEF (main speciality) appears in both the nhse_eng_inpat and nhse_eng_outpat. To ensure the meanings are consistent across entities refer to the data dictionary and corresponding coding file on the Data and cohort page of our website (external link).

We aim to keep column content and format consistent across releases. Any change in column content or format resulting from new derivations or changes in the data model will be communicated and documented on the release page. When using the data, be sure to reference columns by name and not index.

Each dataset contains an integer field called ROW_ID. This field is generated to create unique row identifiers within the TRE. ROW_ID is solely generated to facilitate uploading the data to the TRE and therefore should not be used for analytical purposes.


Linked data set descriptions

HES Admitted Patient Care (APC)

Admitted Patient Care (APC) contains administrative information about episodes of care where a participant is admitted into hospital, including regular day or night attendances in England. Details include dates and methods of admission and discharge, the main and treatment specialities of the consultant responsible for the patient during the episode, recorded diagnoses, and types of operations and associated dates. Electronic patient records are not included as part of the APC data set.

HES Accident & Emergency (A&E)

Accident and Emergency (A&E) contains administrative information about attendances recorded at major A&E departments, single specialty A&E departments, walk-in centres and minor injury units in England. Details include dates and times of arrival, initial treatment and departure, source of referral and attendance disposal, investigations carried out and treatments provided. This resource was retired on 31 March 2020.

HES Outpatient

Outpatient (OP) contains administrative information about outpatient appointments in England. Details include the appointment date, the source of referral, whether the patient attended, the main and treatment specialities of the consultant responsible for the patient and the types of procedures undertaken.

HES Emergency Care Dataset (ECDS)

The Emergency Care Dataset (ECDS) contains administrative information about attendances recorded at major A&E departments, single specialty A&E departments, walk-in centres and minor injury units in England. Compared to HES Accident & Emergency (A&E), ECDS contains more information on complexity and acuity of the attending patient, more granular diagnostic information, more details on resource usage and costings. This resource is a continuation of the HES A&E dataset. Data in the ECDS dataset is available from 1 April 2020 to the present. ECDS contains the most recent records on emergency room attendances, including provisional data. All diagnostic and investigative fields are coded using SNOMED-CT.

Office of National Statistics Death Registration

The Death Registration data set includes death registration and mortality data, such as date of death and cause of death presented using ICD-10 codes.

NDRS Cancer Registry

Cancer Registry is a collated data set of all registrable tumours as defined by National Cancer Registration and Analysis Service (NCRAS). The NCRAS is used to build a picture of a patient's treatment from diagnosis. The data includes information on patient diagnosis, the tumour and any treatment events.

NDRS Cancer Pathways

Cancer Pathways contains a summary of patient pathways from diagnosis to treatment and follow-up. Cancer pathways has coverage of cancer pathway events from tumours diagnosed from 1 January 2013 onwards.

Linked Participants

This is a table derived by Our Future Health. The linked participants table (participant_nhs_linked) lists all the participants who were successfully linked to an NHS number. This table is available alongside any approved linked health record data set and will help in identifying which participants have been linked to an NHS number but have no recorded healthcare contacts in any of the data sets provided.

In the United Kingdom, anyone who is registered for care with the NHS is assigned an NHS number. The NHS number is assigned either at birth or when NHS care is first received. This number is valid for life and only reassigned in specific circumstances like adoption or gender reassignment. For more information on NHS numbers see the NHS website. As a result, it is possible to be linked to an NHS number yet have no healthcare contact beyond primary care.

The linked participants table is generated from a data product provided by NHSE which lists the total number of successfully linked participants. There are limitations to generating this table, including instances where a linked participant has a secondary care record but does not appear in the list of successfully linked participants. For more information on this issue and recommendations for working with the data see the section on known data issues.

Last updated