Release 13
Information about the data released on 11th December 2025
What data is included in Release 13?
Data from up to 1,929,752 participants are included in this release. Of those, all 1,929,752 participants have completed and submitted the baseline questionnaire, and 1,456,410 have completed their in-person Clinic Measurements appointment. Up to 1,841,458 participants are included in our various geographies data releases. For 775,118 of these individuals we have generated genotype array data. 1,690,845 participants were successfully linked to an NHS number of which 1,665,668 participants have at least one secondary care, dispensed medication, or death registration record.
Participant data
The Participant table includes information from 1,929,744 participants who have registered and consented to join the Our Future Health programme, and submitted a complete questionnaire on or before 5 October 2025.
Participant withdrawals are now processed separately for each dataset to ensure accurate and consistent handling. As a result, small differences in participant counts may appear across assets when one dataset was generated later than another. In some cases, this means the Participant table may contain slightly fewer records than other tables.
Participant geographies data
The participant geography data are divided into four tables. The Country and Region table covers 1,841,458 participants across England, Wales, and Scotland. The Middle Layer Super Output Area (MSOA) and Lower Layer Super Output Area (LSOA) tables each contain 1,813,223 participants from England and Wales, and the Intermediate Zones table includes 28,235 participants from Scotland. These are a subset of the participants who have completed and submitted a questionnaire on or before 5 October 2025.
Questionnaire data
Release 13 of the Questionnaire table includes 1,929,752 participants who have completed either v1, v2, v2.1 or v2.2 of the Our Future Health baseline questionnaire. This includes participants who joined during the initial pilots from 2021 and after the main recruitment period began in October 2022.
participants who started the questionnaire on or after 24 May 2021 will have completed v1 of the questionnaire (N = 52,780 participants)
participants who started the questionnaire on or after 20 November 2022 will have completed v2 of the questionnaire (N = 737,768 participants)
participants who started the questionnaire on or after 21 December 2023 will have completed v2.1 of the questionnaire (N = 369,738 participants)
participants who started the questionnaire on or after 13 June 2024 will have completed v2.2 of the questionnaire (N = 769,466 participants)
Clinic measurements data
As of July 2025, over 1.5 million participants have attended an Our Future Health Clinic appointment. The current release includes a subset of 1,456,410 participants who have both completed and submitted a questionnaire and attended an appointment both on or before 5 October 2025.
Genetic data
The genotype data release contains information on 701,345 variants for 775,118 participants. This data was obtained using a custom Illumina Infinium Excalibur beadchip array, designed by Our Future Health in collaboration with Illumina.
The imputed genotype data release contains information on 159,587,100 variants for 550,000 participants. Phasing and imputation was performed using the UK Biobank 200K phased whole genome sequencing data as a reference panel.
Linked health records data
In this release, all participants who were linked to the same NHS number, and thus have signed up multiple times, have been removed from the linked health records datasets.
We are aware that some individuals may have registered multiple times using different email addresses, but we do not check for uniqueness of other personal information or conduct an identity check. See the Participant data page for a full description of the issue. Participants with multiple registrations, in which they have provided identical or nearly identical personal information (name, address and date of birth), may be linked to the same NHS number, and thus may have duplicate health records.
Any participants who linked to the same NHS Number have been removed from the linked health data cohort. This resulted in 12,352 participant identifiers (PIDs), and all accompanying health records associated with those PIDs, being removed from the linked health records data.
In total, we have attempted linkage to health records data for 1,781,135 participants, who completed their questionnaire before 9 April 2025. 1,690,845 (95.6%) of the 1,781,135 participants sent to NHSE were successfully linked to an NHS number. 1,665,668 participants (98.5% of all linked participants) have at least one secondary care, dispensed medicine, or death registration record in one or more of the linked health records data tables.
Linked Health Records data from this release includes participants that completed their questionnaire before 9 April 2025 and, therefore, contains fewer participants than the current Questionnaire data release. This is due to lag between the submission of participant details to NHSE and the data being received, quality assured and processed.
Participant and Questionnaire data
What information does the Participant and Questionnaire data contain?
For details on what information is included in the Participant and Questionnaire data, see our Participant data and Questionnaire data pages . These pages cover how we:
de-identify data
manage re-identification risk
version control
tailor questionnaire journeys
store the data in the TRE
What changes have been made as part of this release?
Participants who have withdrawn from the program have been removed from Release 13. Version v2.2 of the questionnaire remains the active live version.
As described above, participant withdrawals are now processed independently for each dataset to ensure accurate and consistent handling. Consequently, minor differences in participant counts may appear across assets, including instances where the Participant table contains slightly fewer records than other tables. These discrepancies are expected to be very small, and for P13, specifically, there are only 10 fewer participants in the Participant table than in Questionnaire. Where such discrepancies occur, the affected participants will be removed from all assets in the subsequent release.
What should I be aware of when working with the participant and questionnaire data in this release?
Technical data loss
A suspected system issue that occurred prior to October 2022 resulted in a small number of questionnaires submitted around that time to have missing data for some questions. The missing data cannot be explained by errors in dynamic logic. We are analysing the impact and will provide further information in future releases.
Implausible age and year combinations
Responses to questions about age or year of birth are initially validated against the participant’s recorded date of birth at the time of response. However, if a participant later updates their date of birth, these earlier responses are not re-validated. The Participant data reflects the most recent date of birth, which may lead to inconsistencies between updated birth information and previously recorded responses. This issue affects only a small number of cases, and we plan to resolve it in a future data release.
Updated responses to parent questions
Due to the current data capture process, there are cases where a participant updates their response to a parent question, which correctly overwrites the original answer. However, responses to dependent (dynamic) questions linked to the previous parent response may persist, resulting in logical inconsistencies.
One example involves sex-specific questions. In a small number of records, there are inconsistencies between the participant’s self-reported sex and their responses to sex-specific items. This can occur when a participant changes their response to "What sex were you registered with at birth?" - recorded in fields DEMOG_SEX_1_1 or DEMOG_SEX_2_1 - after having completed questions tailored to their previous response. As a result, responses to questions intended for the opposite sex may be retained in error, rather than being removed or excluded based on the updated logic path.
This issue affects a very small proportion of submissions; less than 0.1% across all versions. We are actively working on a solution.
Errors in questionnaire configuration
For comprehensive documentation on all historical bugs related to errors in the implementation of dynamic logic, please refer to Change log for Questionnaire versions. Please note that errors in logic may persist across releases, even after they have been fixed for the affected version.
Updating records between releases
In exceptional cases, a participant’s record may appear to be modified between releases. For example, if a participant mistakenly completes a questionnaire intended for their partner, the incorrect record is deleted to allow the correct individual to submit their responses. Such cases are extremely rare, affecting fewer than 0.001% of records. See our documentation for Release 9 for more details.
Participants who have registered more than once (participant and questionnaire data)
As described on the Participant data page, we are aware that some individuals may have registered multiple times. This may mean that in a small number of cases, the same person may have submitted multiple questionnaires under different registrations.
Currently, it is not possible to identify these duplicate records from the participant or questionnaire data with high confidence. Although a participant who submitted multiple questionnaires under different registrations might do so in good faith and be expected to provide similar answers, responses are unlikely to be identical. This approach would also not detect multiple registrations where questionnaire responses are very different. However, approved study applications which include linked data could use the linked health records to identify a large proportion of the duplicate records whenever submitted personal information is the same or similar (see What should I be aware of when working with the linked health records data in this release?).
Participant geographies data
What information does the Participant geographies data contain?
The participant geographies data currently consists four separate datasets:
Country and region for England, Wales and Scotland
Middle Layer Super Output Areas (MSOA) for England and Wales
Lower Layer Super Output Areas (LSOA) for England and Wales
Intermediate Zones (IZ) for Scotland
These data obtained through participant’s self-reported address, collected during their registration to the Our Future Health programme. For details on how we process Participant geographies data and how we create the Participant geographies datasets, see our Participant geographies data
What changes have been made as part of this release?
In the previous release, geographic coverage was limited to country- and region-level data, capturing only a subset of the earliest participants. This release significantly expands the Participant Geographies datasets by adding LSOAs, MSOAs, and Scottish Intermediate Zones, providing much finer geographic detail.
To achieve this, we adapted our approach from directly mapping latitude and longitude to country and region, to a sequential method that maps participant coordinates through lower-, mid-, and higher-level geographies. Although the mapping process has changed, it has been thoroughly validated and remains fully consistent with the methodology used in the earlier release. For more information, see Participant geographies data
Removal of participants from Northern Ireland
In the previous Participant geographies release, 157 participants from Northern Ireland (NI) were included at the country level. For P13 however, all NI participants - fewer than 600 - have been excluded from the Participant geographies release due to small numbers and the additional processing required for pipeline extension. NI participants are expected to be included in future releases.
What should I be aware of when working with the Participant geographies data in this release
Users should be aware that inclusion criteria, data processing, and geographic mapping methods may be refined in future releases. As a result, information and data completeness may change over time. Researchers should take this into account when interpreting or comparing data across releases.
Data loss for partial withdrawals
Participants who have fully or partially withdrawn from the programme are excluded from all Participant Geographies datasets. For partially withdrawn participants, data collected prior to withdrawal are normally retained; however, this newly created dataset did not capture geographic linkage information for these individuals. Consequently, data for partially withdrawn participants, approximately 0.25% of the cohort, are not included in this release.
Area coverage
Participants are represented across all four devolved nations and all English regions; however, Northern Ireland is excluded from the current release. Coverage at finer geographic levels is uneven, and not all MSOAs, LSOAs, or Scottish Intermediate Zones are represented.
Overall, 94.6% of MSOAs (6,878 of 7,264) include more than 10 participants, with coverage higher in England (98.2%) and lower in Wales (36%). For LSOAs, overall coverage is 90.4% (32,236 of 35,672), with 94% of areas in England and 27.5% in Wales exceeding 10 participants. Scottish Intermediate Zones have lower coverage, with 24.95% of areas (332 of 1,332) including more than 10 participants.
Coverage is influenced by several factors:
Participant density: some areas, particularly rural or sparsely populated regions, have few registered participants.
Small-number suppression: areas with fewer than ten participants are removed to protect confidentiality.
Urban clustering: in densely populated areas, participants may cluster in a few neighbourhoods, leading to uneven representation across adjacent areas.
Programme enrolment patterns: geographic coverage will improve over time as more participants join the programme.
Exclusion of participants living in Crown Dependencies
Participants living in a Crown Dependency at the time of registration have been excluded from all datasets, including the Participant Geographies datasets. Fewer than 0.005% of eligible participants were excluded for this reason.
Exclusion of participants who manually entered their address
Our registration form uses the Ideal Postcodes API to validate participant addresses. Participants who manually entered their full address have been excluded from this release due to potential quality issues, formatting inconsistencies, or data capture errors, such as cases where postcodes could not be mapped to coordinates or only approximate matches were obtained. Fewer than 0.8% of eligible participants were excluded for this reason.
Exclusion of Data Zones for Scotland
To protect participant confidentiality and ensure statistical stability, a small-number suppression protocol is applied. Participants assigned to a lower-level area (e.g., LSOA or Data Zone) with ten or fewer individuals are excluded from that area and all upstream geographies.
Approximately 41% (3,031 of 7,392) of Scottish Data Zones contain ten or fewer participants. Data Zones are generally smaller in population and households than LSOAs and include a relatively higher proportion of rural areas. As a result, Data Zones are not included in the current Participant Geographies release but may be added in a future release once participant numbers and coverage are sufficient.
Suppressions at the Data Zone level and corresponding exclusions from higher-level geographies remain in effect as previously described.
Exclusion of postcode-coordinate mismatches
As part of the geographic mapping process, participant addresses are converted to latitude and longitude coordinates and assigned to a single Output Area (OA) using a point-in-polygon spatial approach. These OA assignments are then aggregated to higher-level geographies via an internally generated lookup table, ensuring consistency across all spatial layers.
To validate this methodology, coordinate-derived OA assignments were compared with the August 2025 ONS Postcode Directory (ONSPD), which links postcodes directly to OAs. In the UK, postcodes may cover multiple households, and some postcodes can cross OA boundaries, particularly in rural areas or where boundaries are irregular. Approximately 3% of UK postcodes span multiple OAs. The ONSPD handles such cases by assigning postcodes with addresses that straddle a boundary to the OA corresponding to the mean grid reference of all addresses within that postcode.
During validation, roughly 2.5% of coordinate-based OA assignments did not match the ONSPD. These discrepancies may result from legitimate postcode splits, as described above, or from mapping inconsistencies. As a precaution, additional validation is ongoing, and affected participants have been temporarily excluded from all geography releases until the discrepancies are fully resolved.
Participants who have registered more than once (participant geographies data)
As described on the Participant data page, we are aware that some individuals may have registered multiple times. Participants with multiple registrations in which they have provided identical or nearly identical personal information (name, address and date of birth) may have duplicate records in the participant geographies data.
Currently, it is not possible to identify these duplicate records from the participant geographies data directly. However, approved study applications which include linked data could use the linked health records to identify a large proportion of the duplicate records whenever submitted personal information is the same or similar (see What should I be aware of when working with the linked health records data in this release?).
Clinic measurements data
What information does the Clinic measurements data contain?
For details on what information is included in the Clinic measurements data see our Clinic measurements data page. This page covers how we:
de-identify data
manage re-identification risk
version control
store the data in the TRE
For the current release, all participants must have attended a clinic appointment and have submitted a complete questionnaire on or before 5 October 2025.
What changes have been made as part of this release?
There are no changes to this release. Participants who withdrew from the programme have been removed from Release 13. For information on how appointments are conducted, see Procedure for Clinic measurements.
What should I be aware of when working with the Clinic measurements data in this release?
Un-versioned updates to the appointments process
The current versioning approach applied to the Clinic measurements data table includes only two major versions, which can be used to identify whether or not a participant had an appointment that included heart rhythm or third heart readings. These updates include things such as:
introducing XS and XL blood pressure cuffs
changes to the order of measurements collected
addition of specific instructions for obtaining readings from pregnant individuals
For more details on versioning, please refer to the section on Change log for Clinic measurements appointment processes.
Multiple measurements obtained for heart readings
During the original appointment process (version 1), the protocol for heart readings was to obtain only two measurements. However, in version 1, it was reported that clinicians occasionally took multiple readings and re-entered values for the first two measurements, attempting to achieve more typical results. To mitigate this, version 2 introduced the option for a third reading if abnormal measurements were recorded for the first two readings.
Missing data for third heart readings
Due to technical issues, software updates, or rare system failures, there may be isolated cases of data capture inconsistencies. As of appointment version 2, participants who have abnormal readings recorded for their first and second set of heart measurements are offered the opportunity to provide a third set of measurements, as described in the section Do all participants provide every measurement?
However, we note two exceptions:
criteria met but data missing (false negative data): participants who meet the criteria for a third readings, but have no data for third readings
criteria not met but data provided (false positive data): participants who do not meet the criteria but do have data for a third reading
This discrepancy affects fewer than 0.01% of records. The vast majority of participants who meet the criteria for third readings in version 2 have data recorded as expected.
Data capture for height, weight and waist measurements
During appointments, the following ranges are allowed for height, weight, and waist measurements:
height: Between 90 and 299 centimetres
weight: Between 20 and 400 kilograms
waist circumference: Between 30 and 200 centimetres
These ranges are intentionally broad and may not always reflect biologically plausible measurements. The same boundaries are applied to both height and weight in the Our Future Health Baseline Questionnaire.
We have identified infrequent outliers in the clinic measurements data that suggest occasional human error during data capture, affecting less than 1% of observations. These errors are likely to include:
waist circumference may have been entered in inches instead of centimetres
height and weight measurements may have been reversed, with height entered in the weight field and vice versa
the same values may have been erroneously entered for multiple fields (e.g., height and weight, or height, weight, and waist)
No mitigation has been applied in the current release, meaning these issues will persist in the data.
To ensure accurate measurements are recorded, our data capture application and associated Standard Operating Procedures (SOPs) are continually updated with guidelines and prompts to assist in precise data collection. We are committed to addressing these data issues and may update our data cleansing rules in future releases.
Participants who have registered more than once (clinic measurements data)
As described on the Participant data page, we are aware that some individuals may have registered multiple times. This may mean that in a very small number of cases, the same person may have attended multiple in-person appointments under different registrations.
Currently, it is not possible to identify these duplicate records from the clinic measurements data directly. Even where a participant may have attended multiple in-person appointments and had physical measurements taken, natural variation and measurement error will mean that it is unlikely that the measurements would be identical. However, approved study applications which include linked data could use the linked health records to identify a large proportion of the duplicate records whenever submitted personal information is the same or similar (see What should I be aware of when working with the linked health records data in this release?).
Genetic data
Genotype array data
There are three categories of files included in the current release: two sets of files containing participant genotypes and one file containing sample-level information. Each set is provided as a separate data ‘entity’ within the Trusted Research Environment. Each participant is represented by a single sample in each file in the genotype data. The primary identifier is the participant ID (PID) which can be used to link these files with the Participant and Questionnaire tables.
The two sets of genotype files, pVCF and BGEN files, contain the same genotypes for the same participants and genetic variants. Each file set is split across regions for each chromosome (22 autosomal chromosomes, two sex chromosomes 'X', 'Y' and mitochondrial 'MT'), across 160 separate files. Each genotype file has an associated pVCF index file (.tbi), or BGEN index file (.bgi) specific for that chromosome region, in addition to an accompanying BGEN .sample file (.sample). We provide both types of files for convenience and to improve the experience of researchers using the data. The sample-level information file contains information useful for quality control (QC) purposes, such as batch, estimated genetic sex and call rate. The regional index BED file contains the chromosome and genomic coordinates of variants present within each pVCF or BGEN file.

Table 1 - File names for array genotype data
SNV pVCF
pVCF
VCF 4.1
snv_pvcf
ofh_snv.v9.chrZ-bXXXX.vcf.gz
160
pVCF containing SNV genotypes and metadata
SNV pVCF
pVCF index file
-
snv_pvcf
ofh_snv.v9.chrZ-bXXXX.vcf.gz.tbi
160
pVCF-associated index file
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v9.chrZ-bXXXX.bgen
160
BGEN file containing SNV genotypes
BGEN
BGEN fileset
BGEN 1.2
snv_bgen
ofh_snv.v9.chrZ-bXXXX.sample
160
BGEN-associated sample file
BGEN
BGEN index file
-
snv_bgen
ofh_snv.v9.chrZ-bXXXX.bgen.bgi
160
BGEN-associated index file
Sample QC
QC metrics
-
sample_qc_metrics
ofh_sample_qc_ metrics.v9.tsv
1
Plain-text tabular file with sample-level information
Regions index BED file
BED file
-
snv_resources
ofh_snv_regions.v9.bed
1
Plain-text tabular file in BED file format
File names in this data release include the following components:
an indicator that the data comes from Our Future Health participants (“ofh_”)
an indicator of file contents (“snv” or “sample_qc_metrics”)
the version number (“.v9”) of the genotype data release, to be incremented with each release
the chromosome number (“.chrZ”) for the genetic variants contained within that file, where Z can be any of 1-22, 'X', 'Y' or 'MT'
the region identifier (-bXXXX) which maps to the genomic coordinates in the BED file
a suffix representing the file type, e.g. “.vcf.gz”, per the specifications of pVCF or BGEN file sets, or “.tsv” for tab-separated values
What information do the genotype files contain?
Both the pVCF and BGEN file sets contain genotypes from all participants in the release and include genotype calls for all 22 autosomal chromosomes, X, Y and MT (mitochondrial). All genotypes are for single-nucleotide polymorphisms (SNPs) or small insertion-deletions (INDELs) aligned to the GRCh38 human reference genome. Multi-allelic variants have been split into separate biallelic records. In the BGEN fileset, estimated genetic sex is included in the BGEN-associated sample file.
For more information on the fields present in each genotype file, please refer to the genotype data tab of the Our Future Health data dictionary. For more information on the exact genetic variants present in each genotype file, refer to the CPRA variant list. Both these files can be found on the Data and cohort page of our website (external link).
What information does the sample QC file contain?
The sample QC file contains basic sample-level information useful for QC purposes or batch effect adjustments. For more information on the fields present in the file, please refer to the genotype tab of the Our Future Health data dictionary located on our Data and cohort page (external link).
What should I be aware of when working with the genotype data in this release?
This data release does not include sample QC results, other than limited outputs from the genotype calling process including call rate, genetic sex, and related aggregate genotyping QC metrics. No variant QC results are provided. We have, however, conducted several targeted statistical checks on the data, including assessment of principal components, sample heterozygosity, sample relatedness and variant Hardy-Weinberg statistics. We plan to include the outputs of these analyses in a future release. Users are reminded to conduct their own comprehensive statistical QC appropriately tailored to their research question.
We note the following issues with the current data release which we aim to address in a future release:
Estimated call rate is based on all chromosomes, including both the X and Y chromosomes (median call rates for females may be lower than for males due to Y chromosome missingness).
The date string in the batch identifier within the sample QC metrics file is the date of genotype calling and not the date on which the genotyping assay started in the laboratory. Genotype calling using the intensity data files of some samples occurred considerably later than the laboratory assay was completed, so this date cannot be used for statistical adjustment of between-day variation introduced at genotyping stage.
The presence of non-haploid genotypes ('0/1', '0/0' and '1/1') was observed for Y and MT chromosome variants, affecting ~0.25% of Y chromosome genotype calls and a smaller proportion for MT, for both female and male samples, arising from low or noisy probe intensities for some genetic variants. Non-haploid genotypes occur outside of the pseudoautosomal regions (PARs) for the Y chromosome. These non-haploid genotype calls for haploid chromosomes should be treated as missing (no call). Note that some tools, such as plink2 or qctool may error or display unexpected behaviour when processing Y or MT chromosome files, due to the presence of these non-haploid genotypes. In future releases, non-haploid calls for Y and MT genetic variants will be set to missing in both the pVCF and BGEN files prior to data release.
A small number of genetic variants were found to have been incorrectly reported in the pVCF and BGEN files, resulting from multi-mapping probes (where a probe sequence maps to multiple locations in the genome) or multi-base SNP targets which were misaligned to the reference genome during genotype calling. These genetic variants should be excluded from analysis. We provide a list of these variants by way of an indicator column "inaccurate annotation" in the CPRA variant list file to facilitate their exclusion. You can download this file from our Data and cohort page (external link). We aim to resolve this issue in future releases of genotype data.
Changes in laboratory reagents aimed at optimising genotyping as well as continual improvement in laboratory processes mean that some variation in the call rate distribution is evident between batches and across time. Future further optimised cluster files for genotype re-calling will likely reduce the magnitude of these differences.
A small number of samples (1407) were estimated to have an implausibly large number of third-degree (or closer) relatives in the genotype data release. Researchers should aim to further investigate and appropriately action any sample exclusions based on calculated genetic relatedness in the cohort when conducting their own statistical QC.
As described on the Participant data page, we are aware that some individuals may have registered multiple times. This may mean that in a very small number of cases, the same person may have attended multiple in-person appointments, and provided multiple blood samples, under different registrations. Samples detected as genetic triplets, quadruplets and quintuplets have been excluded from the genetic data release. However, some records may be detected as genetic duplicates. Such samples should be treated with caution, as they may have arisen due to participants registering multiple times. They should not be considered to be identical twins without further confirmation. However, approved study applications which include linked data could use the linked health records to identify a large proportion of the duplicate records whenever submitted personal information is the same or similar (see What should I be aware of when working with the linked health records data in this release?).
Due to the complexity of genotype-calling at multi-allelic loci, variants with more than one alternate allele are not currently fully supported by Illumina’s genotype caller. We have previously noted a ceiling effect for the allele frequencies for some multi-allelic variants (N=499), where these variants have exclusively heterozygous genotypes. Illumina genotype calling software currently compares homozygous clusters across assays at the multi-allelic locus and interprets any conflicting calls between the major alternate allele and the minor alternate allele(s) as ambiguous, and subsequently sets these genotypes to missing (./.). As a result, Illumina have cautioned that all multi-allelic variants may potentially have effects that are less obvious than the 499 we previously noted, especially if the frequency of the alternate alleles is quite low. Multi-allelic variants are currently retained in the genetic data release and we advise that these should be used with caution or excluded from analyses. We provide a list of these variants by way of an indicator column "multiallelic.variant" in the CPRA variant list file to facilitate their exclusion. You can download this file from our Data and cohort page (external link). We hope to resolve this issue in future releases of genotype data.
What changes have been made as part of this release?
Change of genotype array manifest
A change in genotyping chemical reagents was made to help reduce noise in the resulting genotype data. In order to ensure optimal performance, there has been a change in the genotype array manifest and cluster files. The A1 manifest file has now been phased out and all genotyping performed for the current release has used the C2 manifest file. Samples previously called using the A1 manifest file have been re-called using the C2 manifest. This means a different number of variants are available for some samples in addition to different genotypes. This change is reflected in the sample-qc-metrics file, where the column for manifest version indicates v2 for all samples, referring to the C2 manifest.
pVCF file metadata
Metadata in the pVCF files previously included GenCall score, Log R Ratio and B-allele Frequency in the FORMAT field. These have now been removed and will no longer be available in future releases.
Imputed genetic data
What data is included in this current release?
Files in the current release include two sets of files containing imputed genetic data for participants, one file containing sample-level information and an additional file with variant summary data. Each set is provided separately within the Trusted Research Environment. Each participant is represented by a single sample in each file in the imputed genetic data. The primary identifier is the participant ID (PID) which can be used to link these files with the Participant and Questionnaire tables.
The two sets of imputed genetic data files, pVCF and BGEN, contain the same genotypes for the same participants and genetic variants. Each file set is split across regions for each chromosome (22 autosomal chromosomes, chromosome X), totalling 809 separate files per participant. Each file of imputed genetic data has an associated pVCF index file (.tbi), or BGEN index file (.bgi) specific for that chromosome region, in addition to an accompanying BGEN .sample file (.sample). We provide both types of files for convenience and to improve the experience of researchers using the data. We also provide a regional index BED file which contains the chromosome and genomic coordinates of variants present within each pVCF or BGEN file. The sample-level information file contains information such as batch, imputation group and estimated genetic sex. In addition, the variant-level summary data file (VCF) contains information on dosage r2 and ALT allele frequencies for all variants and has an accompanying index file (.tbi). Both will have their use for quality control (QC) purposes.

Table 2 - File names for imputed genotype data
SNV pVCF
pVCF
VCF 4.2
imputed_pvcf
ofh_imputed.v5.chrZ-bXXXX.vcf.gz
809
pVCF containing imputed genotypes and metadata
SNV pVCF
pVCF index file
-
imputed_pvcf
ofh_imputed.v5.chrZ-bXXXX.vcf.gz.tbi
809
pVCF-associated index file
BGEN
BGEN fileset
BGEN 1.2
imputed_bgen
ofh_ imputed.v5.chrZ-bXXXX.bgen
809
BGEN file containing imputed genotypes
BGEN
BGEN fileset
BGEN 1.2
imputed _bgen
ofh_ imputed.v5.chrZ-bXXXX.sample
809
BGEN-associated sample file
BGEN
BGEN index file
-
imputed _bgen
ofh_ imputed.v5.chrZ-bXXXX.bgen.bgi
809
BGEN-associated index file
Sample QC
QC metrics
-
sample_qc_metrics
ofh_imputed_sample_qc_ metrics.v5.tsv
1
Plain-text tabular file with sample-level information
Variant summary statistics
VCF
VCF 4.2
imputed_resources
ofh_imputed_variant_summary_stats.v5.vcf.gz
1
VCF containing variant-level summary statistics
Variant summary statistics
VCF index file
-
imputed_resources
ofh_imputed_variant_summary_stats.v5.vcf.gz.bgi
1
VCF-associated index file
Regions index BED file
BED file
-
imputed_resources
ofh_imputed_regions.v5.bed
1
Plain-text tabular file in BED file format
File names in this data release include the following components:
an indicator that the data comes from Our Future Health participants (“ofh_”)
an indicator of file contents for imputed genetic data (“imputed”)
the version number (“.v5”) of the imputed genotype data release, to be incremented with each release
the chromosome number (“.chrZ”) for the genetic variants contained within that file, where Z can be any of 1-22, or 'X'
the region identifier (-bXXXX) which maps to the genomic coordinates in the BED file
a suffix representing the file type, e.g. “.vcf.gz”, per the specifications of pVCF or BGEN file sets, or “.tsv” for tab-separated values
What information do the imputed genotype files contain?
Both the pVCF and BGEN file sets contain genotypes from all participants in the release across all 22 autosomal chromosomes and chromosome X (non-PAR). Genotypes are provided in GT:GP format, where GT is the thresholded genotype call and GP is the imputed genotype probability. All genotypes are for SNPs or small indels aligned to the GRCh38 human reference genome. Multi-allelic variants have been split into separate biallelic records. In the BGEN fileset, estimated genetic sex is included in the BGEN-associated sample file. The fields in each genotype file have been summarised in the genotype tab of the Our Future Health data dictionary. The CPRA variant list provides a list of the genetic variants included in the imputed dataset in CHR:POS:REF:ALT (chromosome, position, reference allele and alternate allele) format. Both files can be found on the Data and cohort page of our website (external link).
What information does the imputed sample QC file contain?
The sample QC file contains basic sample-level information useful for QC purposes or potential imputation group effect adjustments. For more information on the fields present in the file, please refer to the genotype tab of the Our Future Health data dictionary located on our Data and cohort page (external link).
What information does the variant summary statistics file contain?
The variant summary statistics file contains estimated metrics on the variant level in VCF format, useful for QC purposes. This includes the dosage r2, ALT allele frequencies and the number of groups which were imputed or directly genotyped for the variant. Further information on the fields present in this file can be found in the genotype tab of the Our Future Health data dictionary located on our Data and cohort page (external link).
What should I be aware of when working with the genotype data in this release?
This data release does not include sample QC results, other than limited outputs from the genotype calling process such as genetic sex. Variant metrics are limited to those provided in the variant summary statistics file. We have, however, conducted several targeted statistical checks on the data, including assessment of principal components, sample heterozygosity, sample relatedness and variant Hardy-Weinberg statistics. We plan to include the outputs of these analyses in a future release. Users are reminded to conduct their own comprehensive statistical QC appropriately tailored to their research question.
We note the following issues with the current data release which we aim to address in a future release:
A small number of samples were estimated to have an implausibly large number of third-degree (or closer) relatives in the genotype data release. Researchers should aim to further investigate and appropriately action any sample exclusions based on calculated genetic relatedness in the cohort when conducting their own statistical QC.
As described on the Participant data page, we are aware that some individuals may have registered multiple times. This may mean that in a very small number of cases, the same person may have attended multiple in-person appointments, and provided multiple blood samples, under different registrations. Samples detected as genetic triplets, quadruplets and quintuplets have been excluded from the genetic data release. However, some records may be detected as genetic duplicates. Such samples should be treated with caution, as they may have arisen due to participants registering multiple times. Whilst the rate of duplicates does not exceed the expected rate of genetic twins in the UK population, they should not be considered to be identical twins without further confirmation. However, approved study applications which include linked data could use the linked health records to identify a large proportion of the duplicate records whenever submitted personal information is the same or similar (see What should I be aware of when working with the linked health records data in this release?).
Please note that the
PROP_TYPEDfield description in the variant summary statistics file should refer to the proportion of groups which were directly typed for the variant rather than imputed. This will be corrected in the next release.
Linked health records data
What information does the linked health records data contain?
This is the first release of Medicines Dispensed in Primary Care for the linked cohort. The dataset includes any medications dispensed by community pharmacists, appliance contractors and dispensing doctors in England. See the Linked data set descriptions for a full description of the dataset.
This release also contains linked health records from Hospital Episodes Statistics (HES), the National Disease Registration Service (NDRS), and Office of National Statistics (ONS) Death Registration.
The HES and NDRS data sets provide a wide range of information on patient admissions to NHS facilities, including clinical, administrative, and geographic information. The HES data sets do not contain electronic patient health records or information on medicines and dosages. For more information on how these data sets are collected and processed, please refer to the HES Data Collection page (external link) and NDRS Access Page (external link). The HES and NDRS data sets only include records collected by NHS England (NHSE), meaning these data contain only care records from NHS providers in England.
The data contains linked heath records from selected Hospital Episodes Statistics (HES) data sets, Admitted Patient Care (APC), Accident & Emergency (A&E), Outpatient and Emergency Care Dataset (ECDS), and selected cancer data sets from the National Disease Registration Service (NDRS), Cancer Registry and Cancer Pathways data sets and Dispensed Medicines in Primary Care. For more details on each data set see the section on the linked health records data page entitled Linked data set descriptions.
We used the HES data dictionary v2.04, ECDS data dictionary v1.4, NDRS data dictionary v5.2, and Dispensed Medicines January 2024 dictionary release for validation on variable format and codes.
All linked health records data have been provided by NHS England.
What changes have been made as part of this release?
The linked health records data have been de-duplicated in this release. As described on the Participant data page, we are aware that some individuals may have registered multiple times. Participants with multiple registrations in which they have provided identical or nearly identical personal information (name, address and date of birth) may be linked to the same NHS number, and thus may have duplicate health records.
In total, 12,352 PIDs, and any health records associated with those PIDs, have been removed from the linked health records data because they have been linked to the same NHS Numbers. These PIDs are still present in the other data products.
We have updated the pseudonymised provider code list to incorporate any new providers. In total, we mapped 11,836 providers. 349 providers (2.9% of all providers in data) were mapped to unknown because they did not appear in either the NHS Organisation Data Service API or the Archived Closed Organisation data set.
How are the data sets structured?
The data release includes 10 entities, organised as follows:
Hospital Episode Statistics
nhse_eng_inpat (Admitted Patient Care)
nhse_eng_ed (Accident and Emergency)
nhse_eng_outpat (Outpatients)
nhse_eng_ecds (Emergency Care Dataset)
Dispensed Medicines in Primary Care
nhse_eng_primcare_meds
Civil Registrations of Death
nhse_engwal_deaths
National Disease Registration Service Cancer Data (NDRS)
nhse_eng_canpat (Cancer Pathways)
nhse_eng_canreg_pattumour (Cancer Registry Patient Tumour)
nhse_eng_canreg_treat (Cancer Registry Cancer Treatment)
Linked Participants
participant_nhs_linked
The table below summarises the available data, including number of available fields and dates for each data set. These dates include provisional data for the HES datasets. Please refer to the section on provisional data for more information on the dates for finalised vs provisional data. Further descriptions can be found on the Linked data set descriptions page.
The entity names indicate the data source, geographic data coverage, and name of the data set. For example, nhse_engwal_death indicates the data source is NHS England, the entity includes data from England and Wales, and the data set is for deaths.
HES Admitted Patient Care
Episodes of in-patient care
1 April 1997 to 31 March 2025
1 April 2025 to 31 July 2025
108
HES Accident & Emergency
Attendance of major A&E department
1 April 2007 to 31 March 2020
No provisional data
91
HES Outpatient
Outpatient appointments
1 April 2003 to 31 March 2025
1 April 2025 to 31 July 2025
55
HES Emergency Care Dataset
Attendances of major A&E department
1 April 2020 to 31 March 2025
1 April 2025 to 31 July 2025
162
Dispensed Medicines in Primary care
Medicines dispensed in England
1 April 2018 to 1 June 2025
No provisional data
32
ONS Death Registration
Death registration and mortality data
1 June 2022 to 31 July 2025
1 August 2025 to 27 August 2025
20
NDRS Cancer Pathways
Cancer pathways data
1 January 2013 to 21 July 2024
No provisional data
12
NDRS Cancer Registry Patient Tumour
Cancer treatment data at tumour-level
1 January 1995 to 31 December 2022
No provisional data
49
NDRS Cancer Registry Cancer Treatment
Cancer data by treatment event at given tumour
1 January 1995 to 20 June 2024
No provisional data
22
Linked Participants
Participants successfully linked to an NHS number
All participants who submitted questionnaire before 9 April 2025
No provisional data
2
How did we de-identify the linked health records data to minimise risks of identifying participants?
For categorical fields with a higher risk of re-identification, we suppressed categories which included fewer than 10 participants as well as codes which indicate admissions from or discharge to mental or penal facilities.
To avoid the suppressed category being deduced by elimination, the next smallest category was also suppressed. Categories were suppressed by replacing the coded entries for corresponding participants with the suppression code, -999.
The following fields had suppression applied: admission source (ADMISORC), admission method (ADMIMETH), and discharge destination (DISDEST) in HES Admitted Patient Care. The table below shows which codes are suppressed in each column.
We also suppress SNOMED codes in HES Emergency Care Dataset (ECDS) related to penal or detention centres, psychiatric admissions, homelessness, and rehabilitation. We also propose to suppress admissions requiring speciality resources, including mountain rescue, air ambulance and coastguard rescue service to further mitigate re-identification risk through spontaneous recognition and further mask the small number of participants with penal, mental health, and homelessness codes.
In the NDRS Cancer Registry Treatment data set, we are releasing a field which lists chemotherapy drugs received during treatment (CHEMO_ALL_DRUGS). This field also contains the name of any clinical trials a participant was enrolled in during treatment. To mitigate re-identification risk, we replaced the name of the clinical trial with ANONYMISED CLINICAL TRIAL.
ADMIMETH
2C = Baby born at home as intended;
25 = Admission via Mental Health Crisis Resolution Team;
83 = Baby born outside the Health Care Provider except when born at home as intended 84 = Admission by Admissions Panel of a High Security Psychiatric Hospital patient not entered on the HSPH Admissions Waiting List (available between 1999 and 2006)
ADMISORC
37=Court (1999-00 to 2006-07 and from 2022-23) 38=Penal establishment: police station;
39=Penal establishment, court or police station / police custody suite;
40=Penal establishment;
41=Court;
42=Police Station / Police Custody Suite;
48=High security psychiatric hospital, Scotland;
49=high security psychiatric accommodation in an NHS hospital provider;
50=NHS other hospital provider: medium secure unit
DISDEST
38=Penal establishment: police station;
39=Penal establishment, court or police station / police custody suite;
40=Penal establishment;
42=Police Station / Police Custody Suite;
48=High security psychiatric hospital, Scotland;
49=High security psychiatric accommodation in an NHS hospital provider;
50=NHS other hospital provider: medium secure unit
SNOMED codes
1047991000000102 = Arrival by prison transport 1066011000000104 = Referred by Her Majesty's prison service 1079611000000109 = Place of occurrence of injury is prison 1066001000000101 = Custodial services: detention centre 61801003 = Patient referral for psychiatric aftercare 4266003 = Referral to drug addiction rehabilitation service 38670004 = Alcoholism rehabilitation 183584001 = Referral to community psychiatric nurse 61801003 = Referral to community rehabilitation 231467000 = Absinthe addiction 1077211000000104 = Homeless persons drop in centre 32911000 = Homeless 105526001 = Homeless family 1079661000000106 = Place of occurrence of injury is hostel for the homeless 1077211000000104 = Referred by homeless drop-in centre 1066051000000100 = Referred by mountain rescue service 1048081000000101 = Fixed wing / medical repatriation by air
What should I be aware of when working with the linked health records data in this release?
Further information on known data quality issues in the NHSE data sets can be found in the NHSE HES Data Quality Reports (external link)
The cohort represented in the cancer data is different to the other linked health records data.
The NDRS Cancer Pathways, Cancer Registry Patient Tumour and Cancer Registry Cancer Treatment data sets are a re-release of the cancer data available in Release 10. They include cancer records for participants who submitted a questionnaire before 15 October 2024. All other linked health records are for participants who submitted a questionnaire before 9 April 2025. This means it is likely that there are some participants in the Linked Participants data with a cancer diagnosis that we do not have cancer records for.
To mitigate this issue, we recommend using the SUBMISSION_DATE field in the questionnaire to filter the participants to those who submitted a questionnaire prior to 15 October 2024 and comparing those participants with the successfully linked participants in the Linked Participants entity. This will provide the list of participants who were eligible for linkage and could appear in the NDRS data sets.
Discrepancies in number of participants the Linked Participants entity
Please note that there are 12 participants with a linked health record who do not have their PIDs listed in the Linked Participants entity. We are working with NHSE to solve this discrepancy.
The Linked Participants entity lists all the successfully linked participants who submitted a questionnaire prior to 9 April 2025.
Using medical ontologies (e.g. ICD-10, OPCS-4, SNOMED) in the cohort browser
The cohort browser in the Trusted Research Environment can only filter numeric or categorical data. In the NDRS Cancer data sets, some fields with diagnosis and procedures information like ICD-10 and OPCS-4 codes are entered as strings. Therefore, it is not possible to use the cohort browser to filter by specific diagnosis and procedure codes in those data sets. To filter for specific ICD-10 or OPCS-4 codes, we recommend loading and filtering the data using a Jupyter Notebook. It is possible to access ICD-10 and OPCS-4 information in the cohort browser for the HES and ONS Death Registration data sets. It is also possible to access SNOMED information in the cohort browser for HES Emergency Care Dataset.
Participants who have registered more than once (linked health records data)
As described on the Participant data page, we are aware that some individuals may have registered multiple times. Participants with multiple registrations in which they have provided identical or nearly identical personal information (name, address and date of birth) may be linked to the same NHS number, and thus may have duplicate health records.
Any participants who link to the same NHS Number have been removed from the linked health data cohort. In total, 12,352 PIDs, and any health records associated with those PIDs, have been removed from the linked health records data because they have been linked to the same NHS Numbers. These PIDs are still present in the other data products.
HES provisional data may change between releases
The HES Admitted Patient Care, Outpatient, and Emergency Care data include some provisional records. These are the most recent admissions and appointments that were available for the cohort at the time the data was supplied by NHS England, but the records have not been finalised. Therefore, the data entered in these records could change slightly in future releases. Once a year, the latest full financial year of provisional data is finalised and made available to Our Future Health by NHS England.
In the current release:
any appointments in the Outpatient data that occurred from 1 April 2024 onwards are likely provisional data and subject to change in future releases
any hospital episodes in the Admitted Patient Care data that finished after 31 March 2024 onwards are likely provisional data and subject to change in future releases
any appointments that occurred before 1 April 2024 or hospital episodes which finished prior to 1 April 2024 are likely finalised data and are not subject to change
Last updated
