Release 7
Information about the data released on 26 June 2024
What data is included in Release 7?
All 989,132 participants included in this release have completed and submitted the baseline questionnaire, and for 330,069 of these individuals we have generated genotype array data. 644,119 participants have at least one record in one or more of the linked health records data tables.
Participant data
The participant table includes information from 989,132 participants who have registered and consented to join the Our Future Health programme, on or before April 11 2024. Each record in the participant table corresponds to exactly one record in the questionnaire data.
Questionnaire data
The questionnaire table includes data from those participants who have completed either v1, v2 or v2.1 of the Our Future Health baseline questionnaire. This includes participants who joined during the initial pilot in 2021 and after the main recruitment period began in the summer of 2022.
Participants who started the questionnaire on or after 24 May 2021 will have completed v1 of the questionnaire (N = 51,906 participants).
Participants who started the questionnaire on or after 20 November 2022 will have completed v2 of the questionnaire (N = 723,653 participants).
Participants who started the questionnaire on or after 21 December 2023 will have completed v2.1 of the questionnaire (N = 213,573 participants).
A questionnaire is considered complete when a participant has answered all sections and have submitted their responses.
Genotype array data
The genotype data release contains information on 700,138 variants for 330,069 participants. This data was obtained using a custom Illumina Infinium Excalibur beadchip array, designed by Our Future Health in collaboration with Illumina.
Linked health records data
We sent 703,066 participants, who completed their questionnaire before 3 November 2023, for matching. 681,718 (97%) of the 703,066 participants sent to NHSE received a match. 644,119 participants have at least one record in one or more of the linked health records data tables.
This first Linked Health Records data release includes participants that completed their questionnaire before 3 November 2023 and therefore contains fewer participants than the current Questionnaire data release. This is due to lag between the submission of participant details to NHSE and the data being received, quality assured and processed.
Participant and Questionnaire Data
For information on how we:
de-identify data
manage re-identification risk
version control
tailor questionnaire journeys
store the data in the TRE
name our variables
see our Participant data and Questionnaire data pages.
What are the inclusion criteria for this release?
For the current release, all participants must have submitted a complete questionnaire, on or before April 11 2024.
What has changed since previous releases?
No updates have been made to the current release. For information on updates made in previous releases regarding:
Minor versioning
Handling questions with multi-select answers
Information on conditions for dynamic questions
see our section on changes in Release 6.
What should I be aware of when working with the participant and questionnaire data in this release?
Technical data loss
A suspected system issue that occurred prior to October 2022 that resulted in a small number of questionnaires submitted around that time to have missing data for some questions. The missing data cannot be explained by errors in dynamic logic. We are analysing the impact and will provide further information in future releases.
Implausible age and year combinations
Responses to questions about age or year of birth are initially validated against the participant's date of birth. However, if a participant later updates their date of birth, their previous answers to these questions are not re-validated. As a result, some responses may become inconsistent with the updated date of birth. Although this issue affects only a small number of cases, we plan to address it in future releases by implementing a solution.
Multiple submissions by the same participants
We currently do not have any procedures in place to prevent individuals from registering multiple times using different contact details. We aware of a small number of instances where the same individual may have submitted multiple questionnaires. This has not been fully investigated, but would likely affect less than 0.1% of submissions.
Errors in questionnaire logic
The following section contains details about questions that have had errors in dynamic logic.
For more details on dynamic logic see Do all participants respond to every question?
all issues listed below have been fixed in questionnaire version v2.1
Questions about work status
Participants can provide multiple responses to the question "Which of the following describes your current situation?" WORK_STATUS_2_M
. Choosing any of the answer options "In paid employment or self-employed" (1), "Looking after home and/or family" (3), "Doing unpaid or voluntary work (6), "On paid leave (e.g. parental leave, long term sick leave, furlough)" (8), will prompt additional questions about their work:
"How many years have you worked in your current job?"
WORK_YRS_1_1
"In a typical WEEK, how many hours do you spend at work?"
WORK_WK_HRS_1_1
"How many times a WEEK do you travel from home to your main work?"
WORK_WK_TRAVEL_1_1
"What types of transport do you use to get to and from work?"
WORK_TRANSPORT_1_M
"About how many miles is it between your home and your work?"
WORK_DISTANCE_1_1
"Does your work involve walking or standing for most of the time?"
WORK_WALK_STAND_1_1
"Does your work involve heavy manual or physical work?"
WORK_MANUAL_LABOUR_1_1
"Does your work involve shift work?"
WORK_SHIFTS_1_1
Due to an error in dynamic logic, if a participant selects any of the answer values 1, 3 6 or 8 in combination with a response that includes skip logic, (meaning that participants won’t be asked about their work situation), for example, "Retired" (2), or "Full or part-time student" (7), all work-related fields mentioned above were skipped. This issue is specific to v2.
As part of this fixed, an additional bug was introduced. See 3. Boundary responses for questions about work status in the section below for details.
Contraceptive methods
Participants can provide multiple responses to the question "What have you used for contraception?" GYN_CONTRACEPT_METHODS_1_M
. If a participant selects "Combined Pill" (1) and or "Mini Pill (5)", they are supposed to be subsequently shown questions regarding their age at which they first and last took the contraceptive pill:
"About how old were you when you first went on the contraceptive pill?"
GYN_CONTRACEPT_PILL_FIRST_AGE_1_1
"How old were you when you last used the contraceptive pill?"
GYN_CONTRACEPT_PILL_LAST_AGE_1_1
Due to an error in dynamic logic, if a participant selects any of the answer values 1 and or 5 in combination with a response that includes skip logic, (i.e. any other response), the age-related questions above were subsequently skipped. This issue is specific to v2 of the questionnaire.
Duplication of follow-up questions on reasons for change in smoking
Participants who reported that their smoking habits had not decreased in the last 10 years (selecting answers "More nowadays" (1) or "About the same" (2), according to the question "Compared to 10 years ago do you smoke..." SMOKE_CHG_2_1
were being erroneously asked the following independent question pairs:
"Why did you reduce your smoking?"
SMOKE_CHG_REDUCE_REASON_2_M
"In the time that you have smoked, have you ever stopped for more than 6 months?"
SMOKE_CHG_REDUCE_ABST_1_1
resulting in a duplication of data. This bug did not result in data loss, since the relevant participants were correctly presented with the required pair of follow-up questions:
"In the time that you have smoked, have you ever stopped for more than 6 months?"
SMOKE_CHG_ABST_2_1
"When you stopped smoking for more than 6 months, why did you stop?"
SMOKE_CHG_ABST_REASON_1_M
Note that though the question text is the same for one of these questions, the data is contextually different, and stored independently in the data release.
This issue is specific to v2, but was fixed in v2.1
Questions about vaping
Participants who:
selected a combination of the options "Cigarettes" (0) and "Electronic delivery devices that can be vaped, such as e-cigarettes (e.g. UWELL, Vype, Vuse, Vapouriz, WizMix)" (1) in response to the question "Have you ever REGULARLY used any of these tobacco products?"
SMOKE_REG_1_M
AND answered "No" (0) for the question "In the time that you have smoked, have you ever stopped for more than 6 months?"
SMOKE_CHG_ABST_2_1
erroneously skipped some follow-up questions in the smoking section, including the following vaping-related questions:
"How often, on average, did you use e-cigarettes (vaping) during the past 12 months?"
SMOKE_VAPE_AVG_1_1
"What type of e-liquids/cartridges do you or did you use in your e-cigarettes?"
SMOKE_VAPE_TYPE_1_M
This issue is specific to questionnaire v2, but was fixed in v2.1
The issues outlined below are scheduled to be fixed in a future release:
these issues currently occur in both or one of v2 and v2.1 questionnaire submissions
fixes to all issues below will be implemented in v2.2 which is planned for release in summer 2024
Questions about walking
For the question "Thinking about the last 4 weeks, in a typical WEEK, on how many days did you walk for at least 10 minutes at a time?" (ACTIVITY_WALK_DAYS_1_1
in questionnaire v1 and ACTIVITY_WALK_DAYS_2_1
in v2 and v2.1) if participants respond with the answer "Unable to walk" (-2), they should skip the following questions:
"How many minutes did you usually spend walking on a typical DAY?"
ACTIVITY_WALK_MINS_1_1
orACTIVITY_WALK_MINS_2_1
"How would you describe your usual walking pace?"
ACTIVITY_WALK_PACE_1_1
"Do you get short of breath walking with people of your own age on level ground?"
HEALTH_RESP_SHORT_1_1
"Do you get a pain in either leg on walking?"
HEALTH_PAIN_LEG_1_1
However, this logic is currently not working as intended where participants who are unable to walk are progressing to the questions listed above.
This issue affects all versions of the questionnaire.
Regular but inconsistent smoking behaviours
Participants who respond to the question "Do you smoke cigarettes now?" SMOKE_STATUS_2_1
with "Yes, some days" (2), "Yes, but rarely" (3) or "No, not at all" (0), and respond to the question "Did you ever smoke cigarettes on most or all days?" SMOKE_PREV_REG_2_1
with "No" (0) or "Prefer not to answer" (-3) should proceed do the following questions:
"Compared to 10 years ago do you smoke..."
SMOKE_CHG_2_1
"In the time that you have smoked, have you ever stopped for more than 6 months?"
SMOKE_CHG_ABST_2_1
"When you stopped smoking for more than 6 months, why did you stop?"
SMOKE_CHG_ABST_REASON_1_M
However due to an error in logic, the majority of participants who meet the above criteria fail to progress appropriately.
This issue affects v2 and v2.1.
Boundary responses for questions about work status
In v2.1 a fix was applied to ensure all participants who selected options with conflicting logic for the question "Which of the following describes your current situation?" WORK_STATUS_2_M
were asked the correct set of following questions. However, the fix now prevents participants who select boundary responses ("Prefer not to answer" (-3) and "None of the above" (-7)) from progressing to the following questions, which is incorrect.
This issue affects a small number of participants, and impacts only v2.1 questionnaire submissions.
Questions about reasons for changes in alcohol consumption
The following questions should allow participants to select multiple responses. However the question type is incorrect, and only allows participants to select a single response. This results in potential data loss for participants who may have changed their alcohol consumption for several reasons:
"Which of the following, if any, do you think contributed to you stopping drinking alcohol?"
ALCOHOL_CHG_REDUCE_REASON_2_1
"Which of the following, if any, do you think contributed to you reducing the amount you drank?"
ALCOHOL_CHG_ABST_REASON_2_1
This issue affects all versions of the questionnaire.
Questions about sex-specific conditions
A small number of records exhibit inconsistent combinations of self-reported sex, and sex-specific questions (where participants are dynamically shown certain questions, dependent on their sex).
This inconsistency may arise when participants modify their response to the question "What sex were you registered with at birth?" (DEMOG_SEX_1_1
or DEMOG_SEX_2_1
) after already providing answers to sex-specific questions. Consequently, the erroneous responses to questions intended for the opposite sex are retained, instead of being removed in accordance with the updated questionnaire journey. The discrepancy affects less than 0.1% of questionnaire submissions across all versions.
Updates to dynamic logic documentation
The first Codebooks were made available with the P6 release (March 19 2024). Minor amendments have been applied to both the human readable templates and the codebooks to better represent the logic that is applied in the current questionnaire as of 26 June 2024.
Genotype array data
There are three categories of files included in the current release: two sets of files containing participant genotypes and one file containing sample-level information. Each set is provided as a separate data ‘entity’ within the Trusted Research Environment. Each participant is represented by a single sample in each file in the genotype data. The primary identifier is the participant ID (PID) which can be used to link these files with the Participant and Questionnaire tables.
The two sets of genotype files, pVCF and BGEN files, contain the same genotypes for the same participants and genetic variants. Each file set is split by chromosome, across 25 separate files (22 autosomal chromosomes, two sex chromosomes 'X', 'Y' and mitochondrial 'MT'). Each chromosome-specific file has it own pVCF index file (.tbi), or BGEN index file (.bgi) and accompanying BGEN .sample file (.sample). The pVCF contains additional genotype metadata that is not present in the BGEN file. We provide both types of files for convenience and to improve the experience of researchers using the data. The sample-level information file contains information useful for quality control (QC) purposes, such as batch, estimated genetic sex and call rate.
SNV pVCF
pVCF
VCF 4.1
snv_pvcf
ofh_snv.v3.chrZ.vcf.gz
25
pVCF containing SNV genotypes and metadata
SNV pVCF
pVCF index file
-
snv_pvcf
ofh_snv.v3.chrZ.vcf.gz.tbi
25
pVCF-associated index file
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v3.chrZ.bgen
25
BGEN file containing SNV genotypes
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v3.chrZ.sample
25
BGEN-associated sample file
BGEN
BGEN index file
-
snv_bgen
ofh_snv.v3.chrZ.bgen.bgi
25
BGEN-associated index file
Sample QC
QC metrics
-
sample_qc_metrics
ofh_sample_qc_ metrics.v3.tsv
1
Plain-text tabular file with sample-level information
File names in this data release include the following components:
an indicator that the data comes from Our Future Health participants (“ofh_”)
an indicator of file contents (“snv” or “sample_qc_metrics”)
the version number (“.v3”) of the genotype data release, to be incremented with each release
the chromosome number (“.chrZ”) for the genetic variants contained within that file, where Z can can be any of 1-22, 'X', 'Y' or 'MT'
a suffix representing the file type, e.g. “.vcf.gz”, per the specifications of pVCF or BGEN file sets, or “.tsv” for tab-separated values.
What information do the genotype files contain?
Both the pVCF and BGEN file sets contain genotypes from all participants in the release and include genotype calls for all 22 autosomal chromosomes, X, Y and MT (mitochondrial). All genotypes are for single-nucleotide polymorphisms (SNPs) or small insertion-deletions (INDELs) aligned to the GRCh38 human reference genome. Multi-allelic variants have been split into separate biallelic records. The pVCF file metadata includes GenCall score, Log R Ratio, and B-allele Frequency, all of which are available in the FORMAT field. In the BGEN fileset, estimated genetic sex is included in the BGEN-associated sample file.
For more information on the fields present in each genotype file, please refer to the genotype data tab of the Our Future Health data dictionary. For more information on the exact genetic variants present in each genotype file, refer to the CPRA variant list. Both these files can be found on the Data and cohort page of our website.
What information does the sample QC file contain?
The sample QC file contains basic sample-level information useful for QC purposes or batch effect adjustments. For more information on the fields present in the file, please refer to the genotype tab of the Our Future Health data dictionary located on our Data and cohort page (external link).
What should I be aware of when working with the genotype data in this release?
This data release does not include sample QC results, other than limited outputs from the genotype calling process including call rate, genetic sex, and related aggregate genotyping QC metrics. No variant QC results are provided. We have, however, conducted several targeted statistical checks on the data, including assessment of principal components, sample heterozygosity, sample relatedness and variant Hardy-Weinberg statistics. We plan to include the outputs of these analyses in a future release. Users are reminded to conduct their own comprehensive statistical QC appropriately tailored to their research question.
We note the following issues with the current data release which we aim to address in a future release:
Estimated call rate is based on all chromosomes, including both the X and Y chromosomes. Median call rates for females may be lower than for males due to Y chromosome missingness.
The date string in the batch identifier within the sample QC metrics file is the date of genotype calling and not the date on which the genotyping assay started in the laboratory. Genotype calling using the intensity data files of some samples occurred considerably later than the laboratory assay was completed, so this date cannot be used for statistical adjustment of between-day variation introduced at genotyping stage.
The presence of non-haploid genotypes ('0/1', '0/0' and '1/1') was observed for Y and MT chromosome variants, affecting ~0.25% of Y chromosome genotype calls and a smaller proportion for MT, for both female and male samples, arising from low or noisy probe intensities for some genetic variants. Non-haploid genotypes occur outside of the pseudoautosomal regions (PARs) for the Y chromosome. These non-haploid genotype calls for haploid chromosomes should be treated as missing (no call). Note that some tools, such as plink2 or qctool may error or display unexpected behaviour when processing Y or MT chromosome files, due to the presence of these non-haploid genotypes. In future releases, non-haploid calls for Y and MT genetic variants will be set to missing in both the pVCF and BGEN files prior to data release.
A small number of genetic variants (~200) were found to have been incorrectly reported in the pVCF and BGEN files, resulting from multi-mapping probes (where a probe sequence maps to multiple locations in the genome) or multi-base SNP targets which were misaligned to the reference genome during genotype calling. These genetic variants should be excluded from analysis. We provide a list of these variants by way of an indicator column "inaccurate annotation" in the CPRA variant list file to facilitate their exclusion. You can download this file from our Data and cohort page (external link). We aim to resolve this issue in future releases of genotype data.
Changes in laboratory reagents aimed at optimising genotyping as well as continual improvement in laboratory processes mean that some variation in the call rate distribution is evident between batches and across time. Future further optimised cluster files for genotype re-calling will likely reduce the magnitude of these differences.
A small number of samples were estimated to have an implausibly large number of third-degree (or closer) relatives in the genotype data release. Researchers should aim to further investigate and appropriately action any sample exclusions based on calculated genetic relatedness in the cohort when conducting their own statistical QC. Similarly, any samples detected as genetic duplicates should be treated with caution and may arise from the same participant registering multiple times. These samples should not be considered to be identical twins/triplets without further confirmation. We aim to further investigate this small number of samples and provide recommendations for appropriate actions in future data releases.
Linked health records data
What information does the linked health records contain?
The data release includes four entities, organised as follows:
Hospital Episode Statistics
nhse_eng_inpat (Admitted Patient Care)
nhse_eng_ed (Accident and Emergency)
nhse_eng_outpat (Outpatients)
Civil Registrations of Death
nhse_engwal_deaths
The HES datasets provide a wide range of information on patient admissions to NHS facilities, including clinical, administrative, and geographic information. For more information on how HES data are collected and processed, please refer to the Data Collection page on the NHS England website (external link).
The data contains linked heath records from selected Hospital Episodes Statistics (HES) datasets – Admitted Patient Care (APC), Accident & Emergency (A&E), and Outpatient. The HES dataset only includes records collected by NHS England (NHS-E), meaning that only care records from NHS providers in England are included. This release does not include any records from providers in the devolved nations.
Admitted Patient Care (APC) contains information about episodes of care where a participant is admitted into hospital, including regular day or night attendances in England. Details include dates and methods of admission and discharge, the main and treatment specialities of the consultant responsible for the patient during the episode, recorded diagnoses, and types of operations and associated dates.
Accident and Emergency (A&E) contains information about attendances recorded at major A&E departments, single specialty A&E departments, walk-in centres and minor injury units in England. Details include dates and times of arrival, initial treatment and departure, source of referral and attendance disposal, investigations carried out and treatments provided.
Outpatient (OP) contains Information about outpatient appointments in England. Details include the appointment date, the source of referral, whether the patient attended, the main and treatment specialities of the consultant responsible for the patient and the types of procedures undertaken.
The linked health records data release also includes death registration records from the Office for National Statistics, which includes mortality data for England and Wales.
We used the HES data dictionary v1.17 for validation on variable format and codes.
The table below summarises the available data, including number of available fields and dates for each dataset.
HES Admitted Patient Care
Episodes of in-patient care
1 April 2007 to 30 September 2023
88
HES Accident & Emergency
Attendance of major A&E department
1 April 2007 to 31 March 2020
51
HES Outpatient
Outpatient appointments
1 April 2007 to 30 September 2023
36
ONS Death Registration
Death registration and mortality data
1 June 2021 to 6 December 2023
18
How did we de-identify the linked health records data to minimise risks of identifying participants?
Codes were suppressed if a coded field with higher-risk information contained less than ten participants. We suppressed the groups with less than ten participants and the next smallest group to further decrease re-identification risk.
All codes were replaced with the suppression code, ‘-999’.
The following variables contained codes for suppression: admission source (ADMISORC
) and admission method (ADMIMETH
) in HES Admitted Patient Care.
ADMIMETH
2C = Baby born at home as intended;
25 = Admission via Mental Health Crisis Resolution Team;
83 = Baby born outside the Health Care Provider except when born at home as intended
ADMISORC
38=Penal establishment: police station;
39=Penal establishment, court or police station / police custody suite;
40=Penal establishment;
41=Court;
42=Police Station / Police Custody Suite;
48=High security psychiatric hospital, Scotland;
49=high security psychiatric accommodation in an NHS hospital provider;
50=NHS other hospital provider: medium secure unit
What should I be aware of when working with the linked health records data in this release?
There are invalid codes within some of the categorical variables. These have not been removed from the data. Any invalid codes are listed in the coding file as ‘invalid.’ These codes most likely occurred from data entry errors which were not cleaned. Further information on known data quality issues in the NHS-E datasets can be found in the NHS-E HES Data Quality Reports (external link).
What exclusions were applied to the data and how many participants were excluded?
Participants were excluded if date of death as stated in the Demographics table or ONS Death Registration occurred before Our Future Health registration. In total, we excluded 108 participants from the current release (0.02% of all matched participants). These participants are excluded while we investigate the validity of these matches.
Are there any limitations to the data available in the current release?
The current linked health records data release uses the Release 5 Questionnaire data cohort. This includes all participants who were eligible for linkage before 3 November 2023. Therefore, the denominator for the linked health records cohort is the participants in Release 5 minus the number of participants without a successful linkage. It is currently not possible to distinguish participants with a failed linkage and participants with successful linkage but no health records. It is possible to estimate the number based on the cohort linkage rate (97%).
The cohort browser in the Trusted Research Environment can only filter numeric or categorical data. Some fields with diagnosis and procedures information like ICD-10 and OPCS-4 codes are entered as strings. Therefore, it is not possible to use the cohort browser to filter by specific diagnosis and procedure codes. To filter for specific ICD-10 or OPCS-4 codes, we recommend loading and filtering the data using a Jupyter Notebook.
We currently do not have any procedures in place to prevent individuals from registering multiple times using different contact details. We are aware of a small number of instances where the same individual may have submitted multiple questionnaires. This resulted in the same participant being assigned multiple PID’s (duplicate participant) and can be seen in the linked health records data as duplicate records.
Duplicate participants are present in APC, A&E, and Outpatient datasets. Duplicate participants are not present in ONS Death Registration dataset. These duplicate participants can be identified by locating duplicate row-level identifiers; for example, duplicate EPIKEY entries in Admitted Patient Care (APC).
These duplicate participants are currently being investigated, and we are not removing these individuals from the data. These only impacts a small number of records in each impacted dataset (about 0.1% of records in each dataset).
We may take a further decision about our approach to duplicates in the future but will communicate and justify any such decision.
What metadata is available to help document the data release?
We provide the following data files on our Data and cohort page (external link):
data dictionary – which defines the raw data fields and metadata information, such as labels, descriptions and units of measurements
Participant and Questionnaire coding file – which contains the granular details of categorical or raw coded values for fields contained within the participant and questionnaire data
NHS England linked health records coding file - which contains raw coded values for fields within each of the linked health records data sets
CPRA variant list – which contains a list of genetic variant IDs which map to the genetic variants available in our genotype files
If using Microsoft Excel to browse these files, for an optimal viewing experience, ensure the encoding settings are set to UTF-8.
On the What type of questions did we ask? section of the Questionnaire data page we also provide:
human readable versions of both version 1 and version 2 of the questionnaire - which are text copies of the baseline health questionnaire
a questionnaire logic codebook – which represents dynamic logic implemented for v2.1 of the baseline health questionnaire and can be used in conjunction with v2 of the human readable questionnaire
Will future releases be compatible with this data release?
Participant and Questionnaire data
We are expecting to make changes for future releases, for example we:
made every effort to avoid errors in the data production process but will aim to address any identified issues in the next release.
will provide additional information on questionnaire journeys.
will make improvements to how the smoking section is currently presented.
we will consider one-hot encoding for questions with multiple answers that are currently only stored as an array.
Linked health records data
We are planning to release more NHS datasets, including National Disease Registration Service (NDRS) Cancer Pathways and Cancer Registration. We are also planning to release more variables in all the HES datasets in addition to those included in this release.
Last updated