Release 10
Information about the data released on 20 March 2025
What data is included in Release 10?
All 1,594,707 participants are included in this release. Of those, 1,594,706 participants have completed and submitted the baseline questionnaire. For 651,031 of these individuals we have generated genotype array data. 1,358,955 participants were successfully linked to an NHS number of which 1,295,748 participants have at least one secondary care or death registration record.
Participant data
The participant table includes information from 1,594,707 participants who have registered and consented to join the Our Future Health programme, and submitted a complete questionnaire on or before 21 January 2025.
Questionnaire data
Release 10 of the Questionnaire data includes 1,594,706 participants who have completed either v1, v2, v2.1 or v2.2 of the Our Future Health baseline questionnaire. This includes participants who joined during the initial pilots from 2021 and after the main recruitment period began in October 2022.
participants who started the questionnaire on or after 24 May 2021 will have completed v1 of the questionnaire (N = 52,239 participants)
participants who started the questionnaire on or after 20 November 2022 will have completed v2 of the questionnaire (N = 728,881 participants)
participants who started the questionnaire on or after 21 December 2023 will have completed v2.1 of the questionnaire (N = 367,514 participants)
participants who started the questionnaire on or after 13 June 2024 will have completed v2.2 of the questionnaire (N = 446,072 participants)
Clinic measurements data
As of February 2024, over 1.3 million participants have attended an Our Future Health Clinic appointment. The current release includes a subset of 1,169,699 participants who have both completed and submitted a questionnaire and attended an appointment (both on or before 21 January 2025).
Genotype array data
The genotype data release contains information on 707,522 variants for 651,031 participants. This data was obtained using a custom Illumina Infinium Excalibur beadchip array, designed by Our Future Health in collaboration with Illumina.
Linked health records data
In total, we have attempted linkage to health records data for 1,413,393 participants, who completed their questionnaire before 15 October 2024. 1,358,955 (96.1%) of the 1,413,393 participants sent to were successfully linked to an NHS number. 1,295,748 participants (95% of all linked participants) have at least one secondary care or death registration record in one or more of the linked health records data tables.
Linked Health Records data from this release includes participants that completed their questionnaire before 15 October 2024 and, therefore, contains fewer participants than the current Questionnaire data release. This is due to lag between the submission of participant details to and the data being received, quality assured and processed.
Participant and Questionnaire data
What information does the Participant and Questionnaire data contain?
For details on what information is included in the Participant and Questionnaire data, see our Participant data and Questionnaire data pages . These pages cover how we:
de-identify data
manage re-identification risk
version control
tailor questionnaire journeys
store the data in the TRE
Due to slightly different timings of data exports, a small number of participants in the Participant data table may not have a corresponding record in the Questionnaire data table. There is one such participant in the current data release, whose questionnaire data is expected to be available in the next release.
What changes have been made as part of this release?
Up until Release 9, all participants in the Participant data table were also included in the Questionnaire data table. However, due to different timings in data exports, a small number of participants may be unique to the Participant table. Participants unique to the Participant data table will have completed and submitted a questionnaire, but their records will only be included in a future release.
There are no other changes in this release. Participants who have withdrawn from the program have been removed from Release 10. Version v2.2 of the questionnaire remains the active live version.
What should I be aware of when working with the participant and questionnaire data in this release?
Technical data loss
A suspected system issue that occurred prior to October 2022 resulted in a small number of questionnaires submitted around that time to have missing data for some questions. The missing data cannot be explained by errors in dynamic logic. We are analysing the impact and will provide further information in future releases.
Implausible age and year combinations
Responses to questions about age or year of birth are initially validated against the participant's date of birth. However, if a participant later updates their date of birth, their previous answers to these questions are not re-validated. As a result, some responses may become inconsistent with the updated date of birth. This issue affects only a small number of cases, but we plan to address it in a future release.
Updating responses to parent questions
Due to how data capture currently works, there are instances where participants update their response to a parent question which overwrites the previous one. However, the responses to downstream dynamic questions linked to the old parent response may persist. This can create a conflict with the expected logic. We are actively working on a solution to this issue, which impacts a very small percentage (less than 0.1%) of submissions across all versions.
An example of this issue occurs with sex-specific conditions. A small number of records show inconsistent combinations of self-reported sex and the corresponding sex-specific questions. This inconsistency may arise when participants change their answer to the question "What sex were you registered with at birth?" (DEMOG_SEX_1_1
or DEMOG_SEX_2_1
) after they have already answered sex-specific questions. As a result, erroneous responses to questions intended for the opposite sex are retained, instead of being removed based on the updated questionnaire path.
Errors in questionnaire configuration
For comprehensive documentation on all historical bugs related to errors in the implementation of dynamic logic, please refer to Change log for Questionnaire versions. Please note that errors in logic may persist across releases, even after they have been fixed for the affected version.
Duplicate participants
We are aware that some individuals may have registered multiple times. Participants must register with a unique email address, but may otherwise enter the same personal information in multiple registrations, and we do not currently check identification documents or otherwise prevent this. Therefore, some participants may have signed up, submitted questionnaires or attended in-person appointments multiple times.
Individuals who have registered multiple times will have been assigned a unique participant identifier (PID) for each registration, and have not been excluded from the data release. Currently, it is not possible to identify these duplicates from the participant or questionnaire data alone, although researchers may assess whether questionnaire responses are largely or entirely identical between two records. Where an approved application includes linked data, lack of linkage or duplicate HES or other records can also facilitate exclusion of records of multiple registrations that have been linked to the same NHS number (see What should I be aware of when working with the linked health records data in this release?).
We are actively investigating this issue and aim to provide more information in future updates. This issue is estimated to affect approximately 0.5% of all records across datasets.
Updating records between releases
In exceptional cases, a participant’s record may appear to be modified between releases. For example, if a participant mistakenly completes a questionnaire intended for their partner, the incorrect record is deleted to allow the correct individual to submit their responses. Such cases are extremely rare, affecting fewer than 0.001% of records.
During preparations for Release 9, one questionnaire record was found to have been updated compared to previous releases. Our pipelines were adjusted to retain only the most recent record. However, due to a bug, the Release 9 record mistakenly contained a mix of old and new responses. This issue has been fully resolved in Release 10 with a scalable fix and affected only a single participant.
Clinic measurements data
What changes have been made as part of this release?
There are no other changes to this release. Participants who withdrew from the programme have been removed from Release 10. For information on how appointments are conducted, see Procedure for Clinic measurements
What should I be aware of when working with the Clinic measurements data in this release?
Un-versioned updates to the appointments process
The current versioning approach applied to the Clinic Measurements data table includes only two major versions, which can be used to identify whether or not a participant had an appointment that included heart rhythm or third heart readings. These updates include things such as:
introducing XS and XL blood pressure cuffs
changes to the order of measurements collected
addition of specific instructions for obtaining readings from pregnant individuals
For more details on versioning, please refer to the section on Change log for Clinic measurements appointment processes
Duplicate participants
As described in the Participant and Questionnaire data section, we are aware that some individuals may have registered multiple times. Participants must register with a unique email address, but may otherwise enter the same personal information in multiple registrations, and we do not currently check identification documents or otherwise prevent this.
We are also aware of a small number of instances where the same individual appears to have attended multiple in-person appointments under different registrations. In these cases, the same person may have multiple records, with different Participant IDs (PIDs), and with different Clinic Measurements data reflecting the distinct measurements taken at the different appointments. There may also be rare instances of technical errors in data capture which result in data from a single appointment being recorded under multiple PIDs (for example from successive appointments).
We are actively investigating this issue and aim to provide more information in future updates. Currently, it is not possible to identify these duplicates from the clinic measurements data alone. Where an approved application includes linked data, however, lack of linkage or duplicate HES or other records can also facilitate exclusion of records of multiple registrations that have been linked to the same NHS number (see What should I be aware of when working with the linked health records data in this release?). The issue is estimated to affect approximately 0.5% of all records across datasets.
Multiple measurements obtained for heart readings
During the original appointment process (version 1), the protocol for heart readings was to obtain only two measurements. However, in version 1, it was reported that clinicians occasionally took multiple readings and re-entered values for the first two measurements, attempting to achieve more typical results. To mitigate this, version 2 introduced the option for a third reading if abnormal measurements were recorded for the first two readings.
Missing data for third heart readings
Due to technical issues, software updates, or rare system failures, there may be isolated cases of data capture inconsistencies. As of appointment version 2, participants who have abnormal readings recorded for their first and second set of heart measurements are offered the opportunity to provide a third set of measurements, as described in the sectionDo all participants provide every measurement?
However, we note two exceptions:
criteria met but data missing (false negative data): participants who meet the criteria for a third readings, but have no data for third readings
criteria not met but data provided (false positive data): participants who do not meet the criteria but do have data for a third reading
This discrepancy affects fewer than 0.01% of records. The vast majority of participants who meet the criteria for third readings in version 2 have data recorded as expected.
Data capture for height, weight and waist measurements
During appointments, the following ranges are allowed for height, weight, and waist measurements:
height: Between 90 and 299 centimetres
weight: Between 20 and 400 kilograms
waist circumference: Between 30 and 200 centimetres
These ranges are intentionally broad and may not always reflect biologically plausible measurements. The same boundaries are applied to both height and weight in the Our Future Health Baseline Questionnaire.
We have identified infrequent outliers in the clinic measurements data that suggest occasional human error during data capture, affecting less than 1% of observations. These errors are likely to include:
waist circumference may have been entered in inches instead of centimetres
height and weight measurements may have been reversed, with height entered in the weight field and vice versa
the same values may have been erroneously entered for multiple fields (e.g., height and weight, or height, weight, and waist)
No mitigation has been applied in the current release, meaning these issues will persist in the data.
To ensure accurate measurements are recorded, our data capture application and associated Standard Operating Procedures (SOPs) are continually updated with guidelines and prompts to assist in precise data collection. We are committed to addressing these data issues and may update our data cleansing rules in future releases.
Genotype array data
There are three categories of files included in the current release: two sets of files containing participant genotypes and one file containing sample-level information. Each set is provided as a separate data ‘entity’ within the Trusted Research Environment. Each participant is represented by a single sample in each file in the genotype data. The primary identifier is the participant ID (PID) which can be used to link these files with the Participant and Questionnaire tables.
The two sets of genotype files, pVCF and BGEN files, contain the same genotypes for the same participants and genetic variants. Each file set is split across regions for each chromosome (22 autosomal chromosomes, two sex chromosomes 'X', 'Y' and mitochondrial 'MT'), across 160 separate files. Each genotype file has an associated pVCF index file (.tbi), or BGEN index file (.bgi) specific for that chromosome region, in addition to an accompanying BGEN .sample file (.sample). The pVCF contains additional genotype metadata that is not present in the BGEN file. We provide both types of files for convenience and to improve the experience of researchers using the data. The sample-level information file contains information useful for quality control (QC) purposes, such as batch, estimated genetic sex and call rate. The regional index BED file contains the chromosome and genomic coordinates of variants present within each pVCF or BGEN file.

Table 1 - File names for array genotype data
SNV pVCF
pVCF
VCF 4.1
snv_pvcf
ofh_snv.v6.chrZ-bXXXX.vcf.gz
160
pVCF containing SNV genotypes and metadata
SNV pVCF
pVCF index file
-
snv_pvcf
ofh_snv.v6.chrZ-bXXXX.vcf.gz.tbi
160
pVCF-associated index file
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v6.chrZ-bXXXX.bgen
160
BGEN file containing SNV genotypes
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v6.chrZ-bXXXX.sample
160
BGEN-associated sample file
BGEN
BGEN index file
-
snv_bgen
ofh_snv.v6.chrZ-bXXXX.bgen.bgi
160
BGEN-associated index file
Sample QC
QC metrics
-
sample_qc_metrics
ofh_sample_qc_ metrics.v6.tsv
1
Plain-text tabular file with sample-level information
Regions index BED file
BED file
-
snv_resources
ofh_snv_regions.v6.bed
1
Plain-text tabular file in BED file format
File names in this data release include the following components:
an indicator that the data comes from Our Future Health participants (“ofh_”)
an indicator of file contents (“snv” or “sample_qc_metrics”)
the version number (“.v6”) of the genotype data release, to be incremented with each release
the chromosome number (“.chrZ”) for the genetic variants contained within that file, where Z can be any of 1-22, 'X', 'Y' or 'MT'
the region identifier (-bXXXX) which maps to the genomic coordinates in the BED file
a suffix representing the file type, e.g. “.vcf.gz”, per the specifications of pVCF or BGEN file sets, or “.tsv” for tab-separated values
What information do the genotype files contain?
Both the pVCF and BGEN file sets contain genotypes from all participants in the release and include genotype calls for all 22 autosomal chromosomes, X, Y and MT (mitochondrial). All genotypes are for single-nucleotide polymorphisms (SNPs) or small insertion-deletions (INDELs) aligned to the GRCh38 human reference genome. Multi-allelic variants have been split into separate biallelic records. The pVCF file metadata includes GenCall score, Log R Ratio, and B-allele Frequency, all of which are available in the FORMAT field. In the BGEN fileset, estimated genetic sex is included in the BGEN-associated sample file.
For more information on the fields present in each genotype file, please refer to the genotype data tab of the Our Future Health data dictionary. For more information on the exact genetic variants present in each genotype file, refer to the CPRA variant list. Both these files can be found on the Data and cohort page of our website (external link).
What information does the sample QC file contain?
The sample QC file contains basic sample-level information useful for QC purposes or batch effect adjustments. For more information on the fields present in the file, please refer to the genotype tab of the Our Future Health data dictionary located on our Data and cohort page (external link).
What should I be aware of when working with the genotype data in this release?
This data release does not include sample QC results, other than limited outputs from the genotype calling process including call rate, genetic sex, and related aggregate genotyping QC metrics. No variant QC results are provided. We have, however, conducted several targeted statistical checks on the data, including assessment of principal components, sample heterozygosity, sample relatedness and variant Hardy-Weinberg statistics. We plan to include the outputs of these analyses in a future release. Users are reminded to conduct their own comprehensive statistical QC appropriately tailored to their research question.
We note the following issues with the current data release which we aim to address in a future release:
Estimated call rate is based on all chromosomes, including both the X and Y chromosomes (median call rates for females may be lower than for males due to Y chromosome missingness).
The date string in the batch identifier within the sample QC metrics file is the date of genotype calling and not the date on which the genotyping assay started in the laboratory. Genotype calling using the intensity data files of some samples occurred considerably later than the laboratory assay was completed, so this date cannot be used for statistical adjustment of between-day variation introduced at genotyping stage.
The presence of non-haploid genotypes ('0/1', '0/0' and '1/1') was observed for Y and MT chromosome variants, affecting ~0.25% of Y chromosome genotype calls and a smaller proportion for MT, for both female and male samples, arising from low or noisy probe intensities for some genetic variants. Non-haploid genotypes occur outside of the pseudoautosomal regions (PARs) for the Y chromosome. These non-haploid genotype calls for haploid chromosomes should be treated as missing (no call). Note that some tools, such as plink2 or qctool may error or display unexpected behaviour when processing Y or MT chromosome files, due to the presence of these non-haploid genotypes. In future releases, non-haploid calls for Y and MT genetic variants will be set to missing in both the pVCF and BGEN files prior to data release.
A small number of genetic variants were found to have been incorrectly reported in the pVCF and BGEN files, resulting from multi-mapping probes (where a probe sequence maps to multiple locations in the genome) or multi-base SNP targets which were misaligned to the reference genome during genotype calling. These genetic variants should be excluded from analysis. We provide a list of these variants by way of an indicator column "inaccurate annotation" in the CPRA variant list file to facilitate their exclusion. You can download this file from our Data and cohort page (external link). We aim to resolve this issue in future releases of genotype data.
Changes in laboratory reagents aimed at optimising genotyping as well as continual improvement in laboratory processes mean that some variation in the call rate distribution is evident between batches and across time. Future further optimised cluster files for genotype re-calling will likely reduce the magnitude of these differences.
A small number of samples (581) were estimated to have an implausibly large number of third-degree (or closer) relatives in the genotype data release. Researchers should aim to further investigate and appropriately action any sample exclusions based on calculated genetic relatedness in the cohort when conducting their own statistical QC. Similarly, any samples detected as genetic duplicates should be treated with caution and may arise from the same participant registering multiple times. These samples should not be considered to be identical twins/triplets without further confirmation. We aim to further investigate this small number of samples and provide recommendations for appropriate actions in future data releases.
What changes have been made as part of this release?
Mixed-manifest release
A change in genotyping chemical regents was made in order to help reduce noise in the resulting genotype data. This has led to a change in the genotype array manifest and cluster files to ensure optimal performance. As a result, this release includes a merge of 297,958 samples that have been called using the A1 genotype array manifest file (700,138 variants) and A1_v1 cluster, in addition to 353,073 samples that have been called using the C2 manifest file (701,345 variants) with the IHX_C2 cluster file. The A1 manifest has now been phased out and all genotyping going forward will use the C2 manifest. All samples previously called using the A1 manifest will eventually be re-called using the C2 manifest. An additional column has been added to the sample_qc_metrics file indicating which manifest version was used, where the A1 manifest is referred to as v1, and C2 as v2.
There are 7,384 variants (1.04%) that are unique to the new C2 manifest, 6,177 variants (0.87%) that are unique to the old A1 manifest, and 693,961 variants (98.08%) that are common between both. The 693,961 variants which are common between both manifests are indicated by the “intersecting variant” column in the CPRA variant list file. Allele frequencies for these variants have been compared to ensure there is strong concordance between both versions. We do note a ceiling effect for the allele frequencies for some of the multi-allelic variants that are unique to the new C2 manifest, which will be investigated further for future releases. We recommend that researchers conduct their own comparison analyses between samples called using different manifests, and either adjust analyses for manifest version as a covariate or split analyses by manifest version.
Linked health records data
Data quality incident impacting HES Admitted Patient Care and HES Outpatient Provisional Data
We have identified a number of participants where both deprecated and updated versions of the same episodes are present in the provisional HES Admitted Patient Care (APC) and HES Outpatients (OP) data for Release 10 and Release 11. This has resulted in the same episode potentially being listed multiple times in the provisional data. This issue has been amended from Release 12 onwards.
If you have used this data to run any analyses on episode- or appointment-level outcomes, then your results may be affected.We recommend performing one of the following actions for any impacted studies:
Re-run any analyses that use the Release 10 or Release 11 HES APC/OP provisional data but first removing:
Any appointments in the Outpatient data that occurred from 1 April 2024 onwards
Any hospital episodes in the Admitted Patient Care data that finished after 31 March 2024 onwards
Use Release 12 from the 18th September 2025, or the most recent release, to re-run your analyses noting that the number of participants and episodes will have changed due to new participants, removal of withdrawn participants and an additional quarter of episodes in the data
For any new analyses, we advise using the most recent release data.Please reach out to access@ourfuturehealth.org.uk with any questions or concerns regarding your study.
What information does the linked health records data contain?
This release contains linked health records from Hospital Episodes Statistics (HES), the National Disease Registration Service (NDRS), and Office of National Statistics (ONS) Death Registration.
The HES and NDRS data sets provide a wide range of information on patient admissions to NHS facilities, including clinical, administrative, and geographic information. The HES data sets do not contain electronic patient health records or information on medicines and dosages. For more information on how these data sets are collected and processed, please refer to the HES Data Collection page (external link) and NDRS Access Page (external link). The HES and NDRS data sets only include records collected by NHS England (NHSE), meaning these data contain only care records from NHS providers in England.
The data contains linked heath records from selected Hospital Episodes Statistics (HES) data sets, Admitted Patient Care (APC), Accident & Emergency (A&E), and Outpatient, and selected cancer data sets from the National Disease Registration Service (NDRS), Cancer Registry and Cancer Pathways data sets. For more details on each data set see the section on the linked health records data page entitled Linked data set descriptions.
We used the HES data dictionary v2.03 and NDRS data dictionary v5.2 for validation on variable format and codes.
All linked health records data have been provided by NHS England.
What changes have been made as part of this release?
We have updated the pseudonymised provider code list to incorporate any new providers. In total, we mapped 926 providers. 10 providers (1.1% of all providers in data) were mapped to unknown because they did not appear in either the NHS Organisation Data Service API or the Archived Closed Organisation data set.
We have added one new field to the ONS Death Registration records S_UNDERLYING_COD_ICD10
. It contains the ICD-10 code for underlying cause of death.
How are the data sets structured?
The data release includes 8 entities, organised as follows:
Hospital Episode Statistics
nhse_eng_inpat (Admitted Patient Care)
nhse_eng_ed (Accident and Emergency)
nhse_eng_outpat (Outpatients)
Civil Registrations of Death
nhse_engwal_deaths
National Disease Registration Service Cancer Data (NDRS)
nhse_eng_canpat (Cancer Pathways)
nhse_eng_canreg_pattumour (Cancer Registry Patient Tumour)
nhse_eng_canreg_treat (Cancer Registry Cancer Treatment)
Linked Participants
participant_nhs_linked
The table below summarises the available data, including number of available fields and dates for each data set. Further descriptions can be found on the Linked data set descriptions page.
The entity names indicate the data source, geographic data coverage, and name of the data set. For example, nhse_engwal_death
indicates the data source is NHS England, the entity includes data from England and Wales, and the data set is for deaths.
HES Admitted Patient Care
Episodes of in-patient care
1 April 2007 to 31 October 2024
108
HES Accident & Emergency
Attendance of major A&E department
1 April 2007 to 31 March 2020
91
HES Outpatient
Outpatient appointments
1 April 2007 to 31 October 2024
55
ONS Death Registration
Death registration and mortality data
9 June 2022 to 27 November 2024
20
NDRS Cancer Pathways
Cancer pathways data
1 January 2013 to 21 July 2024
12
NDRS Cancer Registry Patient Tumour
Cancer treatment data at tumour-level
1 January 1995 to 31 December 2022
49
NDRS Cancer Registry Cancer Treatment
Cancer data by treatment event at given tumour
1 January 1995 to 20 June 2024
22
Linked Participants
Participants successfully linked to an NHS number
All participants who submitted questionnaire before 15 October 2024
2
How did we de-identify the linked health records data to minimise risks of identifying participants?
For categorical fields with a higher risk of re-identification, we suppressed categories which included fewer than 10 participants. To avoid the suppressed category being deduced by elimination, the next smallest category was also suppressed. Categories were suppressed by replacing the coded entries for corresponding participants with the suppression code, -999.
The following fields had suppression applied: admission source (ADMISORC
), admission method (ADMIMETH
), and discharge destination (DISDEST
) in HES Admitted Patient Care. The table below shows which codes are suppressed in each column.
In the NDRS Cancer Registry Treatment data set, we are releasing a field which lists chemotherapy drugs received during treatment (CHEMO_ALL_DRUGS
). This field also contains the name of any clinical trials a participant was enrolled in during treatment. To mitigate re-identification risk, we replaced the name of the clinical trial with ANONYMISED CLINICAL TRIAL
.
ADMIMETH
2C = Baby born at home as intended;
25 = Admission via Mental Health Crisis Resolution Team;
83 = Baby born outside the Health Care Provider except when born at home as intended
ADMISORC
38=Penal establishment: police station;
39=Penal establishment, court or police station / police custody suite;
40=Penal establishment;
41=Court;
42=Police Station / Police Custody Suite;
48=High security psychiatric hospital, Scotland;
49=high security psychiatric accommodation in an NHS hospital provider;
50=NHS other hospital provider: medium secure unit
DISDEST
38=Penal establishment: police station;
39=Penal establishment, court or police station / police custody suite;
40=Penal establishment;
42=Police Station / Police Custody Suite;
48=High security psychiatric hospital, Scotland;
49=high security psychiatric accommodation in an NHS hospital provider;
50=NHS other hospital provider: medium secure unit
What should I be aware of when working with the linked health records data in this release?
Further information on known data quality issues in the NHSE data sets can be found in the NHSE HES Data Quality Reports (external link)
Discrepancies in number of participants the Linked Participants entity
Please note that there are 11 participants with a linked health record who do not have their PID
s listed in the Linked Participants entity. We are working with NHSE to solve this discrepancy.
The Linked Participants entity lists all the successfully linked participants who submitted a questionnaire prior to 15 October 2024.
Using ICD-10 and OPCS-4 codes in the cohort browser
The cohort browser in the Trusted Research Environment can only filter numeric or categorical data. In the NDRS Cancer data sets, some fields with diagnosis and procedures information like ICD-10 and OPCS-4 codes are entered as strings. Therefore, it is not possible to use the cohort browser to filter by specific diagnosis and procedure codes in those data sets. To filter for specific ICD-10 or OPCS-4 codes, we recommend loading and filtering the data using a Jupyter Notebook. It is possible to access ICD-10 and OPCS-4 information in the cohort browser for the HES and ONS Death Registration data sets.
Duplicate participants
As described in the Participant and Questionnaire data section, we are aware that some individuals may have registered multiple times. Participants must register with a unique email address, but may otherwise enter the same personal information across multiple registrations. We do not currently check identification documents or otherwise prevent this. Individuals who have registered multiple times will have been assigned a unique participant identifier (PID) for each registration. These duplicate participants have not been excluded from the data release. Where the participant has been successfully linked to an NHS number for multiple registrations, these records appear as duplicate linked health records that have different PIDs.
These duplicate participants can be identified in the HES and NDRS data sets by locating duplicate row-level identifiers; for example, duplicate EPIKEY
entries in Admitted Patient Care (APC).
We are actively investigating this issue and aim to provide more information in future updates. The issue is estimated to affect a small number of records in each linked data set (about 0.6% of records in each HES data set, and 0.5% of records in each NDRS data set).
HES provisional data may change between releases
The HES Admitted Patient Care and Outpatient data include some provisional records. These are the most recent admissions and appointments that were available for the cohort at the time the data was supplied by NHS England, but the records have not been finalised. Therefore, the data entered in these records could change slightly in future releases. Once a year, the latest full financial year of provisional data is finalised and made available to Our Future Health by NHS England.
In the current release:
any appointments in the Outpatient data that occurred from 1 April 2024 onwards are likely provisional data and subject to change in future releases
any hospital episodes in the Admitted Patient Care data that finished after 31 March 2024 onwards are likely provisional data and subject to change in future releases
any appointments that occurred before 1 April 2024 or hospital episodes which finished prior to 1 April 2024 are likely finalised data and are not subject to change
What metadata is available to help document the data release?
We provide the following data files on our Data and cohort page (external link):
Data dictionary – which defines the raw data fields and metadata information, such as labels, descriptions and units of measurements
Participant, Questionnaire and Clinic measurements coding file – which contains the details of how participant and questionnaire variables are coded within the data
Clinic measurements coding file – which contains the details of how clinic measurement variables are coded within the data
NHS England linked health records coding file - which contains coded values for fields within each of the linked health records data sets
CPRA variant list – which contains a list of genetic variant IDs which map to the genetic variants available in our genotype files
If using Microsoft Excel to browse these files, for an optimal viewing experience, ensure the encoding settings are set to UTF-8.
In the Questionnaire data Questionnaire datasection, we also provide the following files:
Human-readable versions of both version 1 and version 2 of the questionnaire - which are text copies of the baseline health questionnaire
Questionnaire logic codebook – which represents dynamic logic implemented for v2.2 of the baseline health questionnaire and can be used in conjunction with v2 of the human-readable questionnaire
On the Clinic measurements dataDo all participants provide every measurement? section, we also provide the following files:
Clinic measurements logic codebook - which includes the conditions required per measurements for the field to be not null in the data table
Last updated