Release 8
Information about the data released on 18 September 2024
What data is included in Release 8?
All 1,193,001 participants included in this release have completed and submitted the baseline questionnaire, and for 330,058 of these individuals we have generated genotype array data. 957,444 participants have at least one record in one or more of the linked health records data tables.
Participant data
The participant table includes information from all 1,193,001 participants who have registered and consented to join the Our Future Health programme, on or before 4 July 2024. Each record in the participant table corresponds to exactly one record in the questionnaire data.
Questionnaire data
The Release 8 questionnaire data includes participants who have completed either v1, v2, v2.1 or v2.2 of the Our Future Health baseline questionnaire. This includes participants who joined during the initial pilot in 2021 and after the main recruitment period began in the summer of 2022.
Participants who started the questionnaire on or after 24 May 2021 will have completed v1 of the questionnaire (N = 52,016 participants).
Participants who started the questionnaire on or after 20 November 2022 will have completed v2 of the questionnaire (N = 725,629 participants).
Participants who started the questionnaire on or after 21 December 2023 will have completed v2.1 of the questionnaire (N = 364,971 participants).
Participants who started the questionnaire on or after 4 July 2024 will have completed v2.2 of the questionnaire (N = 50,387 participants).
Genotype array data
The genotype data release contains information on 700,138 variants for 330,058 participants. This data was obtained using a custom Illumina Infinium Excalibur beadchip array, designed by Our Future Health in collaboration with Illumina.
Linked health records data
In total, we have attempted linkage to health records data for 988,219 participants, who completed their questionnaire before 11 April 2024. 957,444 (97%) of the 988,219 participants sent to NHSE were successfully linked to an NHS number. 957,444 participants have at least one record in one or more of the linked health records data tables.
Linked Health Records data from this release includes participants that completed their questionnaire before 11 April 2024 and, therefore, contains fewer participants than the current Questionnaire data release. This is due to lag between the submission of participant details to NHSE and the data being received, quality assured and processed.
This release also includes 3 new datasets:
National Disease Registration Service (NDRS) Cancer Registry
National Disease Registration Service (NDRS) Cancer Pathways
Linked Participants, a new entity listing all participants successfully linked to an NHS number
Participant and Questionnaire data
What information does the Participant and Questionnaire data contain?
For more information on what information is included in the Participant and Questionnaire data, see our Participant dataand Questionnaire data. These pages will also cover how we:
de-identify data
manage re-identification risk
version control
tailor questionnaire journeys
store the data in the TRE
name our variables
For the current release, all participants must have submitted a complete questionnaire, on or before 4 July 2024.
What changes have been made as part of this release?
all issues listed below have been fixed in Questionnaire version v2.2
Questions about walking
For the question "Thinking about the last 4 weeks, in a typical WEEK, on how many days did you walk for at least 10 minutes at a time?" (ACTIVITY_WALK_DAYS_1_1
in questionnaire v1 and ACTIVITY_WALK_DAYS_2_1
in v2 and v2.1) if participants respond with the answer "Unable to walk" (-2), they should skip the following questions:
"How many minutes did you usually spend walking on a typical DAY?"
ACTIVITY_WALK_MINS_1_1
orACTIVITY_WALK_MINS_2_1
"How would you describe your usual walking pace?"
ACTIVITY_WALK_PACE_1_1
"Do you get short of breath walking with people of your own age on level ground?"
HEALTH_RESP_SHORT_1_1
"Do you get a pain in either leg on walking?"
HEALTH_PAIN_LEG_1_1
However, this logic was not working as intended, where participants who reported that they were unable to walk were progressing to the questions listed above.
This issue affects v1, v2 and v2.1 of the questionnaire.
Regular but inconsistent smoking behaviours
Participants who respond to the question "Do you smoke cigarettes now?" SMOKE_STATUS_2_1
with "Yes, some days" (2), "Yes, but rarely" (3) or "No, not at all" (0), and respond to the question "Did you ever smoke cigarettes on most or all days?" SMOKE_PREV_REG_2_1
with "No" (0) or "Prefer not to answer" (-3) should proceed do the following questions:
"Compared to 10 years ago do you smoke..."
SMOKE_CHG_2_1
"In the time that you have smoked, have you ever stopped for more than 6 months?"
SMOKE_CHG_ABST_2_1
"When you stopped smoking for more than 6 months, why did you stop?"
SMOKE_CHG_ABST_REASON_1_M
However due to an error in logic, the participants who met the above criteria failed to progress appropriately, resulting in a loss of detailed information on past smoking behaviours.
This issue affects v2 and v2.1.
Boundary responses for questions about work status
In v2.1 a fix was applied to ensure all participants who selected options with conflicting logic for the question "Which of the following describes your current situation?" WORK_STATUS_2_M
were asked the correct set of following questions. However, the fix now prevents participants who select boundary responses ("Prefer not to answer" (-3) and "None of the above" (-7)) from progressing to the following questions, which is incorrect.
This issue affects a small number of participants, and impacts only v2.1.
Questions about reasons for changes in alcohol consumption
The following questions should allow participants to select multiple responses. However the question type is incorrect and only allows participants to select a single response. This results in potential data loss for participants who may have changed their alcohol consumption for several reasons:
"Which of the following, if any, do you think contributed to you stopping drinking alcohol?"
ALCOHOL_CHG_REDUCE_REASON_2_1
"Which of the following, if any, do you think contributed to you reducing the amount you drank?"
ALCOHOL_CHG_ABST_REASON_2_1
This issue is specific to v2 and v2.1.
As part of the fix, we have updated the response type for certain questions from a single select (previously represented as single integers) to an array type (now represented as a list of multiple integers). In the data dictionary, these questions are now marked with a 'yes' under the 'is_multi_select' column, and their field names have been updated to include an 'M' as the last character, replacing the previous '1' to signify 'multi-select.'
All responses to these questions across versions 2, 2.1, and 2.2 are stored under the same columns, which have been renamed to ALCOHOL_CHG_REDUCE_REASON_2_M
and ALCOHOL_CHG_ABST_REASON_2_M
, as described above. However, it's important to note that responses for versions v2 and v2.1 will contain arrays with a length of 1.
What should I be aware of when working with the participant and questionnaire data in this release?
Technical data loss
A suspected system issue that occurred prior to October 2022 resulted in a small number of questionnaires submitted around that time to have missing data for some questions. The missing data cannot be explained by errors in dynamic logic. We are analysing the impact and will provide further information in future releases.
Implausible age and year combinations
Responses to questions about age or year of birth are initially validated against the participant's date of birth. However, if a participant later updates their date of birth, their previous answers to these questions are not re-validated. As a result, some responses may become inconsistent with the updated date of birth. This issue affects only a small number of cases, but we plan to address it in a future release.
Multiple submissions by the same participants
We currently do not have any procedures in place to prevent individuals from registering multiple times using different contact details. We are aware of a small number of instances where the same individual may have submitted multiple questionnaires. This has not been fully investigated, but would likely affect less than 0.1% of submissions.
Updating responses to parent questions
Due to how data capture currently works, there are instances where participants update their response to a parent question which overwrites the previous one. However, the responses to downstream dynamic questions linked to the old parent response may persist. This can create a conflict with the expected logic. We are actively working on a solution to this issue, which impacts a very small percentage (less than 0.1%) of submissions across all versions.
An example of this issue occurs with sex-specific conditions. A small number of records show inconsistent combinations of self-reported sex and the corresponding sex-specific questions. This inconsistency may arise when participants change their answer to the question "What sex were you registered with at birth?" (DEMOG_SEX_1_1
or DEMOG_SEX_2_1
) after they have already answered sex-specific questions. As a result, erroneous responses to questions intended for the opposite sex are retained, instead of being removed based on the updated questionnaire path.
Errors in questionnaire configuration
For comprehensive documentation on all historical bugs related to errors in the implementation of dynamic logic, please refer to How has the baseline health questionnaire changed over time? and Change log for Questionnaire versions. Please note that errors in logic may persist across releases, even after they have been fixed.
Updates to dynamic logic documentation
The first Codebooks were made available with the P6 release (19 March 2024). Minor amendments have been applied to both the human-readable templates and the codebooks to better represent the logic that is applied in v2.2 (as of 4 July 2024). Both of these files can be found on the Data and cohort page of our website (external link).
Genotype array data
There are three categories of files included in the current release: two sets of files containing participant genotypes and one file containing sample-level information. Each set is provided as a separate data ‘entity’ within the Trusted Research Environment. Each participant is represented by a single sample in each file in the genotype data. The primary identifier is the participant ID (PID) which can be used to link these files with the Participant and Questionnaire tables.
The two sets of genotype files, pVCF and BGEN files, contain the same genotypes for the same participants and genetic variants. Each file set is split by chromosome, across 25 separate files (22 autosomal chromosomes, two sex chromosomes 'X', 'Y' and mitochondrial 'MT'). Each chromosome-specific file has it own pVCF index file (.tbi), or BGEN index file (.bgi) and accompanying BGEN .sample file (.sample). The pVCF contains additional genotype metadata that is not present in the BGEN file. We provide both types of files for convenience and to improve the experience of researchers using the data. The sample-level information file contains information useful for quality control (QC) purposes, such as batch, estimated genetic sex and call rate.
SNV pVCF
pVCF
VCF 4.1
snv_pvcf
ofh_snv.v4.chrZ.vcf.gz
25
pVCF containing SNV genotypes and metadata
SNV pVCF
pVCF index file
-
snv_pvcf
ofh_snv.v4.chrZ.vcf.gz.tbi
25
pVCF-associated index file
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v4.chrZ.bgen
25
BGEN file containing SNV genotypes
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v4.chrZ.sample
25
BGEN-associated sample file
BGEN
BGEN index file
-
snv_bgen
ofh_snv.v4.chrZ.bgen.bgi
25
BGEN-associated index file
Sample QC
QC metrics
-
sample_qc_metrics
ofh_sample_qc_ metrics.v4.tsv
1
Plain-text tabular file with sample-level information
File names in this data release include the following components:
an indicator that the data comes from Our Future Health participants (“ofh_”)
an indicator of file contents (“snv” or “sample_qc_metrics”)
the version number (“.v4”) of the genotype data release, to be incremented with each release
the chromosome number (“.chrZ”) for the genetic variants contained within that file, where Z can can be any of 1-22, 'X', 'Y' or 'MT'
a suffix representing the file type, e.g. “.vcf.gz”, per the specifications of pVCF or BGEN file sets, or “.tsv” for tab-separated values
What information do the genotype files contain?
Both the pVCF and BGEN file sets contain genotypes from all participants in the release and include genotype calls for all 22 autosomal chromosomes, X, Y and MT (mitochondrial). All genotypes are for single-nucleotide polymorphisms (SNPs) or small insertion-deletions (INDELs) aligned to the GRCh38 human reference genome. Multi-allelic variants have been split into separate biallelic records. The pVCF file metadata includes GenCall score, Log R Ratio, and B-allele Frequency, all of which are available in the FORMAT field. In the BGEN fileset, estimated genetic sex is included in the BGEN-associated sample file.
For more information on the fields present in each genotype file, please refer to the genotype data tab of the Our Future Health data dictionary. For more information on the exact genetic variants present in each genotype file, refer to the CPRA variant list. Both these files can be found on the Data and cohort page of our website.
What information does the sample QC file contain?
The sample QC file contains basic sample-level information useful for QC purposes or batch effect adjustments. For more information on the fields present in the file, please refer to the genotype tab of the Our Future Health data dictionary located on our Data and cohort page (external link).
What should I be aware of when working with the genotype data in this release?
This data release does not include sample QC results, other than limited outputs from the genotype calling process including call rate, genetic sex, and related aggregate genotyping QC metrics. No variant QC results are provided. We have, however, conducted several targeted statistical checks on the data, including assessment of principal components, sample heterozygosity, sample relatedness and variant Hardy-Weinberg statistics. We plan to include the outputs of these analyses in a future release. Users are reminded to conduct their own comprehensive statistical QC appropriately tailored to their research question.
We note the following issues with the current data release which we aim to address in a future release:
Estimated call rate is based on all chromosomes, including both the X and Y chromosomes. Median call rates for females may be lower than for males due to Y chromosome missingness.
The date string in the batch identifier within the sample QC metrics file is the date of genotype calling and not the date on which the genotyping assay started in the laboratory. Genotype calling using the intensity data files of some samples occurred considerably later than the laboratory assay was completed, so this date cannot be used for statistical adjustment of between-day variation introduced at genotyping stage.
The presence of non-haploid genotypes ('0/1', '0/0' and '1/1') was observed for Y and MT chromosome variants, affecting ~0.25% of Y chromosome genotype calls and a smaller proportion for MT, for both female and male samples, arising from low or noisy probe intensities for some genetic variants. Non-haploid genotypes occur outside of the pseudoautosomal regions (PARs) for the Y chromosome. These non-haploid genotype calls for haploid chromosomes should be treated as missing (no call). Note that some tools, such as plink2 or qctool may error or display unexpected behaviour when processing Y or MT chromosome files, due to the presence of these non-haploid genotypes. In future releases, non-haploid calls for Y and MT genetic variants will be set to missing in both the pVCF and BGEN files prior to data release.
A small number of genetic variants (~200) were found to have been incorrectly reported in the pVCF and BGEN files, resulting from multi-mapping probes (where a probe sequence maps to multiple locations in the genome) or multi-base SNP targets which were misaligned to the reference genome during genotype calling. These genetic variants should be excluded from analysis. We provide a list of these variants by way of an indicator column "inaccurate annotation" in the CPRA variant list file to facilitate their exclusion. You can download this file from our Data and cohort page (external link). We aim to resolve this issue in future releases of genotype data.
Changes in laboratory reagents aimed at optimising genotyping as well as continual improvement in laboratory processes mean that some variation in the call rate distribution is evident between batches and across time. Future further optimised cluster files for genotype re-calling will likely reduce the magnitude of these differences.
A small number of samples were estimated to have an implausibly large number of third-degree (or closer) relatives in the genotype data release. Researchers should aim to further investigate and appropriately action any sample exclusions based on calculated genetic relatedness in the cohort when conducting their own statistical QC. Similarly, any samples detected as genetic duplicates should be treated with caution and may arise from the same participant registering multiple times. These samples should not be considered to be identical twins/triplets without further confirmation. We aim to further investigate this small number of samples and provide recommendations for appropriate actions in future data releases.
What changes have been made as part of this release?
There are no additional changes to this release. Participants who withdrew from the programme have been removed from Release 8.
Linked health records data
What information does the linked health records data contain?
The HES and NDRS datasets provide a wide range of information on patient admissions to NHS facilities, including clinical, administrative, and geographic information. For more information on how these data are collected and processed, please refer to the HES Data Collection page (external link) and NDRS Access Page (external link).
The data contains linked heath records from selected Hospital Episodes Statistics (HES) datasets, Admitted Patient Care (APC), Accident & Emergency (A&E), and Outpatient, and selected cancer datasets from the National Disease Registration Service (NDRS), Cancer Registry and Cancer Pathways datasets. The HES and NDRS datasets only include records collected by NHS England (NHSE), meaning these data contain only care records from NHS providers in England. For descriptions of the HES datasets see the Release 7 page.
Cancer Pathways contains a summary of patient pathways from diagnosis to treatment and follow-up. Cancer pathways has coverage of cancer pathway events from tumours diagnosed from 1 January 2013 onwards.
Cancer Registry is a collated dataset of all registrable tumours as defined by National Cancer Registration and Analysis Service (NCRAS). The NCRAS is used to build a picture of a patient's treatment from diagnosis. The data includes information on patient diagnosis, the tumour and any treatment events.
Linked participants table lists all the participants who were successfully linked to an NHS number.
We used the HES data dictionary v2.02 and NDRS data dictionary v5.2 for validation on variable format and codes.
What changes have been made as part of this release?
We have added 2 new NHS-E data buckets that researchers may apply for: Cancer Registry and Cancer Pathways. We are also releasing more fields in as part of the Hospital Episodes Statistics (HES) datasets, including Index of Multiple Deprivation (IMD) and pseudonymised healthcare provider codes. The new provider codes are the pseudonymised version of the trust-levelPROCODE3
field in the HES and NDRS datasets. For more information on how this field was pseudonymised see the Linked health records data page.
In total, we mapped over 480 providers. Seven providers (1.4% of all providers in data) were mapped to unknown because they did not appear in either the NHS Organisation Data Service API or the Archived Closed Organisation dataset.
All linked health records data have been provided by NHS England.
We have also added a new table, participant_nhs_linked
, which contains a list of all participants that were successfully linked to an NHS number. This table is available alongside any approved linked health record dataset and will help in identifying which participants have been linked to an NHS number but have no recorded healthcare contacts in any of the datasets provided. In the United Kingdom, anyone who is registered for care with the NHS is assigned an NHS number. The NHS number is assigned either at birth or when NHS care is first received. This number is valid for life and only reassigned in specific circumstances like adoption or gender reassignment. For more information on NHS numbers see the NHS website. As a result, it is possible to have an NHS number yet have no healthcare contact beyond primary care.
The participant_nhs_linked
table is generated from the Demographics table provided by NHSE after linkage. The Demographics table contains only participants who were successfully linked to an NHS number. There are limitations to generating this table from the Demographics. It is possible that a small number of participants may be missing from the Demographics table but still have linked health records data in the release. In this release, there are 12 participants with linked health records data who are not in the participant_nhs_linked
table. For more information on this issue and recommendations for working with the data see the section on known data issues.
How did we de-identify the linked health records data to minimise risks of identifying participants?
For categorical fields with a higher risk of re-identification, we suppressed categories which included fewer than 10 participants. To avoid the suppressed category being deduced by elimination, the next smallest category was also suppressed. Categories were suppressed by replacing the coded entries for corresponding participants with the suppression code, -999.
The following fields had suppression applied: admission source (ADMISORC
), admission method (ADMIMETH
), and discharge destination (DISDEST
) in HES Admitted Patient Care. The table below shows which codes are suppressed in each column.
In the NDRS Cancer Registry Treatment dataset, we are releasing a field which lists chemotherapy drugs received during treatment (CHEMO_ALL_DRUGS
). This field also contains the name of any clinical trials a participant was enrolled in during treatment. To mitigate re-identification risk, we replaced the name of the clinical trial with 'ANONYMISED CLINICAL TRIAL.'
ADMIMETH
2C = Baby born at home as intended;
25 = Admission via Mental Health Crisis Resolution Team;
83 = Baby born outside the Health Care Provider except when born at home as intended
ADMISORC
38=Penal establishment: police station;
39=Penal establishment, court or police station / police custody suite;
40=Penal establishment;
41=Court;
42=Police Station / Police Custody Suite;
48=High security psychiatric hospital, Scotland;
49=high security psychiatric accommodation in an NHS hospital provider;
50=NHS other hospital provider: medium secure unit
DISDEST
38=Penal establishment: police station;
39=Penal establishment, court or police station / police custody suite;
40=Penal establishment;
42=Police Station / Police Custody Suite;
48=High security psychiatric hospital, Scotland;
49=high security psychiatric accommodation in an NHS hospital provider;
50=NHS other hospital provider: medium secure unit
What should I be aware of when working with the linked health records data in this release?
Please note that there are 12 participants in the data who are not listed in the Linked Participants entity. We are working with NHSE to solve this discrepancy and are aiming to resolve this issue in the next release. We recommend either removing these 12 participants from any analyses or adding 12 for any calculations where the total number of participants are derived from the Linked Participants table.
This discrepancy can occur because each entity in the release is generated from difference data sources. We are working with NHSE to ensure that these resources are harmonised.
Further information on known data quality issues in the NHSE datasets can be found in the NHSE HES Data Quality Reports (external link).
Are there any limitations to the data available in the current release?
The cohort browser in the Trusted Research Environment can only filter numeric or categorical data. In the NRDS Cancer datasets, some fields with diagnosis and procedures information like ICD-10 and OPCS-4 codes are entered as strings. Therefore, it is not possible to use the cohort browser to filter by specific diagnosis and procedure codes in those datasets. To filter for specific ICD-10 or OPCS-4 codes, we recommend loading and filtering the data using a Jupyter Notebook. It is possible to access ICD-10 and OPCS-4 information in the cohort browser for the HES and ONS Death Registration datasets.
We currently do not have any procedures in place to prevent individuals from registering multiple times using different contact details. We are aware of a small number of instances where the same individual may have submitted multiple questionnaires. This resulted in the same participant being assigned multiple PID’s (duplicate participant) and can be seen in the linked health records data as duplicate records.
These duplicate participants can be identified in the HES and NDRS datasets by locating duplicate row-level identifiers; for example, duplicate EPIKEY
entries in Admitted Patient Care (APC).
These duplicate participants are currently being investigated, and we are not removing these individuals from the data. These only impacts a small number of records in each impacted dataset (<0.4% of all records in each dataset).
How are the data structured?
The data release includes 8 entities, organised as follows:
Hospital Episode Statistics
nhse_eng_inpat (Admitted Patient Care)
nhse_eng_ed (Accident and Emergency)
nhse_eng_outpat (Outpatients)
Civil Registrations of Death
nhse_engwal_deaths
National Disease Registration Service Cancer Datasets (NDRS)
nhse_eng_canpat (Cancer Pathways)
nhse_eng_canreg_pattumour (Cancer Registry Patient Tumour)
nhse_eng_canreg_treat (Cancer Registry Cancer Treatment)
Linked Participants
participant_nhs_linked
The table below summarises the available data, including number of available fields and dates for each dataset.
The entity names indicate the data source, geographic data coverage, and name of the dataset. For example, nhse_engwal_death
indicates the data source is NHS England, the entity includes data from England and Wales, and the dataset is for deaths.
HES Admitted Patient Care
Episodes of in-patient care
1 April 2007 to 31 March 2023
108
HES Accident & Emergency
Attendance of major A&E department
1 April 2007 to 31 March 2023
91
HES Outpatient
Outpatient appointments
1 April 2007 to 31 March 2023
55
ONS Death Registration
Death registration and mortality data
9 June 2022 to 5 June 2024
19
NDRS Cancer Pathways
Cancer pathways data
1 January 2013 to 23 August 2023
12
NDRS Cancer Registry Patient Tumour
Cancer treatment data at tumour-level
3 January 1995 to 31 December 2021
49
NDRS Cancer Registry Cancer Treatment
Cancer data by treatment event at given tumour
9 May 1977 to 5 July 2023
22
Linked Participants
Participants successfully linked to an NHS number
All participants who submitted questionnaire before 11 April 2024
2
What metadata is available to help document the data release?
We provide the following data files on our Data and cohort page (external link):
Data dictionary – which defines the raw data fields and metadata information, such as labels, descriptions and units of measurements
Participant and Questionnaire coding file – which contains the granular details of categorical or raw coded values for fields contained within the participant and questionnaire data
NHS England linked health records coding file - which contains raw coded values for fields within each of the linked health records data sets
CPRA variant list – which contains a list of genetic variant IDs which map to the genetic variants available in our genotype files
If using Microsoft Excel to browse these files, for an optimal viewing experience, ensure the encoding settings are set to UTF-8.
On the What type of questions did we ask? section of the Questionnaire data page we also provide:
human readable versions of both version 1 and version 2 of the questionnaire - which are text copies of the baseline health questionnaire
a questionnaire logic codebook – which represents dynamic logic implemented for v2.2 of the baseline health questionnaire and can be used in conjunction with v2 of the human readable questionnaire
Last updated