> For the complete documentation index, see [llms.txt](https://ourfuturehealth.gitbook.io/our-future-health/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ourfuturehealth.gitbook.io/our-future-health/data-releases/2026-data-releases/release-14.md).

# Release 14

### What data is included in Release 14?

The current release, released on 28th of May 2026, contains the following:

* **2,022,814** participants are included in the Participant data.&#x20;
* **2,021,810** participants are included in the Questionnaire data.&#x20;
* **1,518,202** participants have provided in-person Clinic Measurements data.
  * Of those, **1,159,273** are also included in the POCT Lipid Profile data.&#x20;
* Up to **1,983,038** participants are included in our various geographies data releases.&#x20;
* **755,000** participants have genotype array data and imputed genetic data.&#x20;
* **1,690,704** participants were successfully linked to an NHS number.
  * Of those, **1,666,336** participants have at least one secondary care, dispensed medication, or death registration record.

Participant withdrawals are now processed separately for each dataset to ensure accurate and consistent handling. As a result, small differences in participant counts may appear across assets when one dataset was generated later than another. In some cases, this means the Participant table may contain slightly fewer records than other tables.

#### Participant data

The Participant table includes information from 2,022,814 participants who have registered and consented to join the Our Future Health programme, and submitted a complete questionnaire on or before 24 March 2026.&#x20;

#### Questionnaire data

Release 14 of the Questionnaire table includes 2,021,810 participants who have completed either v1, v2, v2.1 or v2.2 of the Our Future Health baseline questionnaire. This includes participants who joined during the initial pilots from 2021 and after the main recruitment period began in October 2022.

* Participants who started the questionnaire on or after 24 May 2021 will have completed v1 of the questionnaire (N = 52,871 participants)
* Participants who started the questionnaire on or after 20 November 2022 will have completed v2 of the questionnaire (N = 739,323 participants)
* Participants who started the questionnaire on or after 21 December 2023 will have completed v2.1 of the questionnaire (N = 370,407 participants)
* Participants who started the questionnaire on or after 13 June 2024 will have completed v2.2 of the questionnaire (N = 859,209 participants)&#x20;

#### Participant geographies data&#x20;

The participant geography data are divided into four tables.&#x20;

* The Country and Region table covers 1,983,038 participants across England, Wales, Scotland and Northern Ireland.
* The Middle Layer Super Output Area (MSOA) and Lower Layer Super Output Area (LSOA) tables each contain 1,914,754 participants from England and Wales.
* The Intermediate Zones table includes 40,031 participants from Scotland.&#x20;

These are a subset of the participants who have completed and submitted a questionnaire on or before 24 March 2026.

#### Clinic Measurements data

As of May 2026, nearly 1.6 million participants have attended an Our Future Health Clinic appointment. The current release includes a subset of 1,518,202 participants who have both completed and submitted a questionnaire and attended an appointment both on or before 24 March 2026<mark style="color:$danger;">.</mark>

#### Point-of-Care Testing (POCT) Lipid Profile data

Point-of-Care Testing (POCT) Lipid Profile measurements are a dataset comprising cholesterol data collected during clinic appointments between July 2022 and December 2024. During this period, lipid measurements were obtained using the Mission® Cholesterol Monitoring System device as part of routine clinic procedures. These data are available for 1,159,273 participants; a subset of participants within the Clinic Measurements dataset, and are provided as a separate table.&#x20;

#### Genetic data

The genetic data release includes harmonised genotype array data, imputed genotype data and genetic ancestry data, all of which are available for a common set of 755,000 participants.

The genotype array data release contains information on 686,416 variants. This data was obtained using a custom Illumina Infinium Excalibur beadchip array, designed by Our Future Health in collaboration with Illumina. Genetic kinship data has also been inferred for these participants.

The imputed genotype data release contains information on 159,587,100 variants. Imputation was performed using the UK Biobank 200k phased whole genome sequencing data as a reference panel.

Genetic ancestry data has been inferred from genotype data by applying a Global Ancestry Estimation (GEA) workflow developed by Genomics Ltd.

#### Linked health records data

The linked health records data contain information on 1,666,336 participants who have at least one record in one or more of the linked health data tables:&#x20;

**Primary care**

* Medicines Dispensed in Primary Care

**Secondary care**

* Accident and  &#x20;Emergency (HES A\&E)
* Emergency Care Dataset (HES ECDS)
* Admitted Patient Care (HES APC)
* Outpatient (HES OP)

**Cancer**

* National Disease Registration&#x20;  Service (NDRS) Cancer Registry Patient Tumour
* NDRS Cancer Registry Treatment
* NDRS Cancer Registry pre-1995\*
* NDRS Cancer Pathways

**Death**

* Office of National Statistics (ONS) Death Registration

\*This is the first release of the historic pre-1995 cancer registry dataset from the National Disease Registration Service, which includes registrable tumours of individuals diagnosed between 1 January 1985 and 31 December 1994.&#x20;

***

### Participant and Questionnaire data&#x20;

#### What information does the Participant and Questionnaire data contain?

For details on what information is included in the Participant and Questionnaire data, see our [Participant data](/our-future-health/data-types/participant-data.md) and [Questionnaire data](/our-future-health/data-types/questionnaire-data.md) pages . These pages cover how we:&#x20;

* de-identify data&#x20;
* manage re-identification risk&#x20;
* version control&#x20;
* tailor questionnaire journeys&#x20;
* store the data in the TRE&#x20;

#### What changes have been made as part of this release?

Participants who have withdrawn from the program have been removed from Release 14.&#x20;

As described above, participant withdrawals are now processed independently for each dataset to ensure accurate and consistent handling. Consequently, minor differences in participant counts may appear across assets, including instances where the Participant table contains slightly fewer records than other tables. These discrepancies are expected to be very small. Where such discrepancies occur, the affected participants will be removed from all assets in the subsequent release.

Version v2.2 of the questionnaire remains the active live version.

#### What should I be aware of when working with the participant and questionnaire data in this release?

**Technical data loss**

A suspected system issue that occurred prior to October 2022 resulted in a small number of questionnaires submitted around that time to have missing data for some questions. The missing data cannot be explained by errors in dynamic logic. We are analysing the impact and will provide further information in future releases.

**Implausible age and year combinations**

Responses to questions about age or year of birth are initially validated against the participant’s recorded date of birth at the time of response. However, if a participant later updates their date of birth, these earlier responses are not re-validated. The Participant data reflects the most recent date of birth, which may lead to inconsistencies between updated birth information and previously recorded responses. This issue affects only a small number of cases, and we plan to resolve it in a future data release.

**Updated responses to parent questions**

Due to the current data capture process, there are cases where a participant updates their response to a parent question, which correctly overwrites the original answer. However, responses to dependent (dynamic) questions linked to the previous parent response may persist, resulting in logical inconsistencies.

One example involves sex-specific questions. In a small number of records, there are inconsistencies between the participant’s self-reported sex and their responses to dependent, sex-specific questions. This can occur when a participant changes their response to "What sex were you registered with at birth?" - recorded in fields  `DEMOG_SEX_1_1` or `DEMOG_SEX_2_1` - after having completed later questions that were dependent on their previous answer. As a result, responses may be retained to questions dependent on their previous answer, even if this dependency is no longer correct, rather than being removed or excluded based on the updated logic path.

This issue affects a very small proportion of submissions; less than 0.1% across all versions. We intend to address this issue in a future release.

**Errors in questionnaire configuration**

For comprehensive documentation on all historical bugs related to errors in the implementation of dynamic logic, please refer to [Change log for questionnaire versions](/our-future-health/data-types/questionnaire-data/change-log-for-questionnaire-versions.md). Please note that errors in logic may persist across releases, even after they have been fixed for the affected version.

**Updating records between releases**

In exceptional cases, a participant’s record may appear to be modified between releases. For example, if a participant mistakenly completes a questionnaire intended for their partner, the incorrect record is deleted to allow the correct individual to submit their responses. Such cases are extremely rare, affecting fewer than 0.001% of records. See our documentation for [Release 9](/our-future-health/data-releases/2024-data-releases/release-9.md) for more details.

**Participants without a questionnaire record**

In the current release, due to timing and synchronisation issues during asset creation, 1,004 participants are present in the Participant table but not in the Questionnaire table. Although this discrepancy is larger than in previous releases, all participants included in this release have completed and submitted a questionnaire, and their data will be included in the next release.

Additionally, there are a small number of cases where participants may be present in other assets (e.g. linked datasets) but are not included in the Participant asset.&#x20;

#### Participants who have registered more than once (participant and questionnaire data)

As described on the [Participant data](/our-future-health/data-types/participant-data.md) page, we are aware that some individuals may have registered multiple times. This may mean that in a small number of cases, the same person may have submitted multiple questionnaires under different registrations.&#x20;

Currently, it is not possible to identify these duplicate records from the participant or questionnaire data with high confidence. Although a participant who submitted multiple questionnaires under different registrations might do so in good faith and be expected to provide similar answers, responses are unlikely to be identical. This approach would also not detect multiple registrations where questionnaire responses are very different. However, approved study applications which include linked data could use the linked health records to identify a large proportion of the duplicate records whenever submitted personal information is the same or similar (see [#what-should-i-be-aware-of-when-working-with-the-linked-health-records-data-in-this-release](#what-should-i-be-aware-of-when-working-with-the-linked-health-records-data-in-this-release "mention")).

***

### Participant geographies data

#### What information does the Participant geographies data contain?

The participant geographies data currently consists four separate datasets:

1. Country and region for England, Wales, Scotland and Northern Ireland
2. Middle Layer Super Output Areas (MSOA) for England and Wales
3. Lower Layer Super Output Areas (LSOA) for England and Wales
4. Intermediate Zones (IZ) for Scotland

These data obtained through participant’s self-reported address, collected during their registration to the Our Future Health programme<mark style="color:purple;">.</mark> For details on how we process Participant geographies data and how we create the Participant geographies datasets, see our [#participant-geographies-data](#participant-geographies-data "mention")

#### What changes have been made as part of this release?

Participants who withdrew from the programme have been removed from Release 14.

The previous release expanded geographic coverage from country- and region-level data to include LSOAs, MSOAs, and Scottish Intermediate Zones, providing much finer geographic detail.

**Inclusion of Northern Ireland at the country and region level**&#x20;

This release includes Northern Ireland (NI) at country and region level. At present, lower- and mid-level NI data are not available in the TRE due to limited coverage and cohort population distribution. These geographies are defined on smaller population and household counts than equivalent geographies in other UK nations. Therefore, adequate coverage must be ensured prior to safe release.

**Suppression rules**

We have updated our de-identification protocol. Small group sizes continue to be identified at the lowest geographical level (LSOAs and Data Zones), with suppression then applied at both low- and mid-level geographies (LSOAs/MSOAs and Intermediate Zones). Suppression is no longer carried through to country or region level, where data are available for all valid participants. Previously, suppression was applied at the lowest level and extended to country and region level.

**Data processing enhancements**

We have made minor improvements to our geocoding pipeline to enhance spatial accuracy and improve processing of large datasets. As a result, a small number of participants present in both releases may appear to have changed assigned areas between versions. Fewer than 10 participants are affected.

**Removal of exclusions based on postcode–coordinate mismatches**

In previous releases, coordinate-derived Output Area (OA) assignments were validated against the ONS Postcode Directory (ONSPD), and participants with mismatches were excluded from geography outputs. In this release, these exclusions have been removed. Observed discrepancies are expected due to differences in methodology: postcode-based assignment relies on postcode centroids, which can assign all addresses within a postcode to a single OA. In contrast, coordinate-based point-in-polygon methods assign individuals based on their exact spatial location. These approaches can therefore differ for postcodes that span OA boundaries, particularly in rural or irregularly shaped areas. For more information, see [Release 13](/our-future-health/data-releases/2025-data-releases/release-13.md#what-should-i-be-aware-of-when-working-with-the-participant-geographies-data-in-this-release)

#### What should I be aware of when working with the Participant geographies data in this release

Users should be aware that inclusion criteria, data processing, and geographic mapping methods may be refined in future releases. As a result, information and data completeness may change over time. Researchers should take this into account when interpreting or comparing data across releases.

**Missing data for partially withdrawn participants**

Participants who have fully or partially withdrawn from the programme are excluded from all Participant Geographies datasets. For partially withdrawn participants, data that has been collected and linked prior to withdrawal (such as address) are normally retained; however, the the Participant Geographies datasets do not capture geographic information for these individuals. Consequently, data for a subset of partially withdrawn participants are not included in this release, affecting approximately 0.25% of eligible participants.

**Area coverage**

Participants are represented across all four devolved nations and all English regions. Coverage at finer geographic levels is uneven, and not all MSOAs, LSOAs, or Scottish Intermediate Zones are represented.

Overall, 95.3% of MSOAs (6,921 of 7,264) include more than 10 participants, with coverage higher in England (98.7%) and lower in Wales (37.7%).&#x20;

For LSOAs, overall coverage is 91.7% (32,696 of 35,672), with 95.1% of areas in England and 30.9% in Wales exceeding 10 participants.&#x20;

Scottish Intermediate Zones have lower coverage, with 30.6% of areas (408 of 1,332) including more than 10 participants.

Coverage is influenced by several factors:

* Participant density: some areas, particularly rural or sparsely populated regions, have few registered participants.
* Small-number suppression: areas with fewer than ten participants are removed to protect confidentiality.
* Urban clustering: in densely populated areas, participants may cluster in a few neighbourhoods, leading to uneven representation across adjacent areas.
* Programme enrolment patterns: geographic coverage will improve over time as more participants join the programme.

**Exclusion of participants living in Crown Dependencies**

Participants living in a Crown Dependency at the time of registration have been excluded from all datasets, including the Participant Geographies datasets. Fewer than 0.005% of eligible participants were excluded for this reason.

**Exclusion of participants who manually entered their address**

Our registration form uses the Ideal Postcodes API to validate participant addresses. Participants who manually entered their full address have been excluded from this release due to potential quality issues, formatting inconsistencies, or data capture errors, such as cases where postcodes could not be mapped to coordinates or only approximate matches were obtained. Fewer than 0.8% of eligible participants were excluded for this reason.

**Exclusion of Data Zones for Scotland**&#x20;

To protect participant confidentiality and ensure statistical stability, a small-number suppression protocol is applied. Participants assigned to a lower-level area (e.g., LSOA or Data Zone) with ten or fewer individuals are excluded from that area and associated higher-level geographies.

Approximately 46% (3,422 of 7,392) of Scottish Data Zones contain ten or fewer participants. As Data Zones are smaller in population and household size than LSOAs and include a higher proportion of rural areas, they are not included in the current Participant Geographies release but may be considered for future releases once participant numbers and coverage are sufficient.

**Exclusion of Data Zones and Super Data Zones for Northern Ireland**

Participants from NI in the OFH cohort covers approximately only 40% of all Northern Ireland Data Zones (1,502 of 3,780).

At a threshold of ten or fewer participants, 1,364 of 1,502 Data Zones (90.7%) would be excluded, removing around 64% of participants. This would leave approximately 9% of zones and 36% of participants, making Data Zone level release impractical.

Because mid-level geographies are derived using suppression rules based on low-level counts, applying the same threshold would also substantially reduce availability at Super Data Zone level, leaving only around 2,000 participant records for release.

As outlined above, Northern Ireland statistical geographies are designed with smaller population counts, and OFH coverage is currently insufficient to support safe release at these finer levels. NI Data Zones will therefore not be included in the current Participant Geographies release, but may be considered in a future release once participant numbers and coverage are sufficient.

#### Participants who have registered more than once (participant geographies data)

As described on the [#participant-data](#participant-data "mention") page, we are aware that some individuals may have registered multiple times. Participants with multiple registrations in which they have provided identical or nearly identical personal information (name, address and date of birth) may have duplicate records in the participant geographies data.

Currently, it is not possible to identify these duplicate records from the participant geographies data directly. However, approved study applications which include linked data could use the linked health records to identify a large proportion of the duplicate records whenever submitted personal information is the same or similar (see [What should I be aware of when working with the linked health records data in this release?](https://ourfuturehealth.gitbook.io/our-future-health/data/data-releases/2025-data-releases/release-11#what-should-i-be-aware-of-when-working-with-the-linked-health-records-data-in-this-release)).

***

### Clinic measurements data&#x20;

#### What information does the Clinic Measurements data contain?

For details on what information is included in the Clinic Measurements data see our [Clinic measurements data page](/our-future-health/data-types/clinic-measurements-data.md). This page covers how we:&#x20;

* de-identify data&#x20;
* manage re-identification risk&#x20;
* version control&#x20;
* store the data in the TRE&#x20;

For the current release, all participants must have attended a clinic appointment and have submitted a complete questionnaire on or before 24 March 2026<mark style="color:$danger;">.</mark>

#### What changes have been made as part of this release?

Participants who withdrew from the programme have been removed from Release 14.

**POCT lipid profile data**&#x20;

Point-of-Care Testing (POCT) lipid profile measurements (Mission® Cholesterol Monitoring System) were collected during appointments between July 2022 and December 2024. These data are now available for a subset of participants within the Clinic Measurements dataset and are stored as a separate dataset from the anthropometric measurements. For more information see [POCT Lipid Profile data](/our-future-health/data-types/clinic-measurements-data/poct-lipid-profile-data.md).

**Clinic location and references**

We have added a set of three clinic-level variables to the Clinic Measurements dataset to enable grouping of appointments occurring within the same clinic and under similar operational conditions. All variables are pseudocoded and de-identified to remove any geographically identifiable information.

Together, these variables provide a hierarchical framework for assessing variation in the clinic measurements across organisational and operational levels, and may support partial adjustment for clustering in statistical analyses.

* `CLINIC_SITE_REF`**:** each clinic is assigned a unique clinic site reference. This variable captures overall between-clinic variation driven by differences such as staff, throughput, equipment, and operational conditions. It can be used alongside appointment date to approximate clinic volume and temporal variation.
* `CLINIC_PROVIDER`**:** this variable identifies the provider responsible for delivering the clinic service. It captures differences in delivery, workflows, resource allocation, and service design.
* `IS_MOBILE_CLINIC`**:**  a boolean variable indicating whether a clinic is delivered via a mobile unit (1 = True) or a static site (0 = False). Static sites are fixed physical clinic locations, while mobile clinics are delivered in purpose-built vehicles that operate across multiple locations.

**Limitations for mobile clinics**

Mobile clinics operate across multiple geographic locations; however, individual mobile units cannot be consistently tracked across locations due to the absence of a stable unit-level identifier. Clinic site references are assigned per clinic instance rather than per physical mobile unit, which limits linkage of repeated activity for the same unit across sites. As a result, this variable primarily captures between-location variation (e.g. differences in population demographics, accessibility, and setting), rather than variation attributable to specific mobile units or repeated site-level effects.

**Booking type**

Additionally, some of our mobile clinics also operate different booking models, including pre-booked appointments, walk-ins, or a combination of both at the same site. This information is not currently captured here, and these effects cannot be directly accounted for in analyses but may contribute to residual variation in outcome. For instance, booking type may affect engagement, completion, and clinic workflow, with pre-booked participants typically more prepared and walk-ins showing more variable completion.

**Future improvements**&#x20;

We recognise that additional information on geographic setting and operational structure could further improve characterisation of within- and between-site variation. Granular indicators such as clinic size, staffing composition (e.g., number and mix of staff), equipment and calibration procedures, and other local operational characteristics may help explain residual variation not captured by current variables. We are exploring options to incorporate additional location- and setting-related indicators in future releases, where available and appropriate.

#### What should I be aware of when working with the Clinic Measurements data in this release?

**Un-versioned updates to the appointments process**

The current versioning approach applied to the Clinic Measurements data table includes only two major versions, which can be used to identify whether or not a participant had an appointment that included heart rhythm or third heart readings. These updates include things such as:

* introducing XS and XL blood pressure cuffs
* changes to the order of measurements collected
* addition of specific instructions for obtaining readings from pregnant individuals&#x20;

For more details on versioning, please refer to the section on [Change log for Clinic Measurements appointment processes](/our-future-health/data-types/clinic-measurements-data/change-log-for-clinic-measurements-appointment-processes.md).

**Multiple measurements obtained for heart readings**

During the original appointment process (version 1), the protocol for heart readings was to obtain only two measurements. However, in version 1, it was reported that staff occasionally took multiple readings and re-entered values for the first two measurements, attempting to achieve more typical results. To mitigate this, version 2 introduced the option for a third reading if abnormal measurements were recorded for the first two readings.

**Missing data for third heart readings**

Due to technical issues, software updates, or rare system failures, there may be isolated cases of data capture inconsistencies. As of appointment version 2, participants who have abnormal readings recorded for their first and second set of heart measurements are offered the opportunity to provide a third set of measurements, as described in the section [Clinic Measurements data](/our-future-health/data-types/clinic-measurements-data.md#do-all-participants-provide-every-measurement)&#x20;

However, we note two exceptions:

1. criteria met but data missing (false negative data): participants who meet the criteria for a third readings, but have no data for third readings
2. criteria not met but data provided (false positive data): participants who do *not* meet the criteria but do have data for a third reading

This discrepancy affects fewer than 0.01% of records. The vast majority of participants who meet the criteria for third readings in version 2 have data recorded as expected.

**Data capture for height, weight and waist measurements**

During appointments, the following ranges are allowed for height, weight, and waist measurements:

* height: Between 90 and 299 centimetres
* weight: Between 20 and 400 kilograms
* waist circumference: Between 30 and 200 centimetres

These ranges are intentionally broad and may not always reflect biologically plausible measurements. The same boundaries are applied to both height and weight in the Our Future Health Baseline Questionnaire.&#x20;

We have identified infrequent outliers in the Clinic Measurements data that suggest occasional human error during data capture, affecting less than 1% of observations. These errors are likely to include:

* waist circumference may have been entered in inches instead of centimetres
* height and weight measurements may have been reversed, with height entered in the weight field and vice versa
* the same values may have been erroneously entered for multiple fields (e.g., height and weight, or height, weight, and waist)

No mitigation has been applied in the current release, meaning these issues will persist in the data.

To ensure accurate measurements are recorded, our data capture application and associated Standard Operating Procedures (SOPs) are continually updated with guidelines and prompts to assist in precise data collection. We are committed to addressing these data issues and may update our data cleansing rules in future releases.

#### Participants who have registered more than once (Clinic Measurements data)

As described on the [Participant data](/our-future-health/data-types/participant-data.md) page, we are aware that some individuals may have registered multiple times. This may mean that in a very small number of cases, the same person may have attended multiple in-person appointments under different registrations.&#x20;

Currently, it is not possible to identify these duplicate records from the Clinic Measurements data directly. Even where a participant may have attended multiple in-person appointments and had physical measurements taken, natural variation and measurement error will mean that it is unlikely that the measurements would be identical. However, approved study applications which include linked data could use the linked health records to identify a large proportion of the duplicate records whenever submitted personal information is the same or similar (see [#what-should-i-be-aware-of-when-working-with-the-linked-health-records-data-in-this-release](#what-should-i-be-aware-of-when-working-with-the-linked-health-records-data-in-this-release "mention")).

***

### Point-of-Care Testing (POCT) Lipid Profile data

This is the first release of the POCT Lipid Profile data, provided as a separate table alongside the Clinic Measurements dataset. All participants in the POCT dataset are also present in the Clinic Measurements dataset.&#x20;

POCT lipid data collection is no longer ongoing and was removed from the appointment process on 23 December 2024. The dataset therefore represents a historical collection of lipid measurements obtained during clinic appointments. The dataset may be updated over time, including the application of withdrawals, and revisions as data processing and quality control procedures are refined.

#### What information does the POCT Lipid Profile data contain?

Currently the POCT Lipid Profile data includes measurements of blood lipids, which reflect cholesterol balance and cardiovascular risk and includes variables for the following:

* total cholesterol (TC): overall cholesterol concentration in blood
* high-density lipoprotein cholesterol (HDL): cholesterol fraction involved in reverse cholesterol transport; higher levels are protective
* triglycerides (TG): circulating fats used for energy; elevated levels are associated with increased cardiovascular and metabolic risk
* low-density lipoprotein cholesterol (LDL-C): cholesterol fraction that transports cholesterol to tissues; higher levels increase cardiovascular risk
* non-HDL cholesterol: represents all atherogenic cholesterol
* TC:HDL-C ratio: reflects balance between atherogenic and protective lipoproteins, with higher values indicating higher risk

Further details on the POCT data collection process, including data de-identification, re-identification risk management, version control, and secure storage within the Trusted Research Environment (TRE), are provided here [#point-of-care-testing-poct-lipid-profile-data](#point-of-care-testing-poct-lipid-profile-data "mention").&#x20;

Additional information on the validation study is also available here: [POCT validation study](/our-future-health/data-types/clinic-measurements-data/poct-lipid-profile-data/poct-validation-study.md). The validation work assessed the accuracy, reliability, and comparability of POCT lipid measurements against established reference standards, ensuring suitability for research use.&#x20;

Details on clinic appointment procedures are provided here: [Procedure for Clinic Measurements](/our-future-health/data-types/clinic-measurements-data/procedure-for-clinic-measurements.md).

For the current release, all participants must have attended a clinic appointment and completed a questionnaire on or before 24 March 2026.

#### What should I be aware of when working with the POCT Lipid Profile data in this release?

**Changes to the appointments process (updating boundaries)**

The POCT Lipid Profile data is a subset of the Clinic Measurements dataset and uses the same appointment version indicator to describe substantive differences in the appointment process over time. Currently, there are two appointment versions, v1 and v2, which can be derived directly from `APPOINTMENT_VERSION`.  See here for more details [Clinic Measurements data](/our-future-health/data-types/clinic-measurements-data.md#how-do-we-use-major-and-minor-versioning)

For POCT data specifically, the appointment versions indicate whether a participant’s appointment included repeat measurements triggered by elevated initial total cholesterol values:

* ≥ 7.5 mmol/L for participants aged < 30 years, or
* ≥ 9.0 mmol/L for participants aged ≥ 30 years

**Changes to data capture system input ranges**

The input range limits within the data capture system used to record measurements (the Clinical Staff Application, CSA) have changed over time. A very small proportion (approximately 0.8%) of measurements were recorded on early versions of the CSA which had input ranges that were, in some cases, narrower than the analytical ranges of the POCT device. These limits were later revised to ensure that the full analytical range of the device could be captured in the CSA.&#x20;

Note that these changes are not reflected in the `APPOINTMENT_VERSION` indicator provided with the dataset. Researchers can distinguish between the pre- and post-change periods using the implementation date of **9 November 2022**, with all appointments on or after that date using the wider input ranges. Further details can be found [here](/our-future-health/data-types/clinic-measurements-data/poct-lipid-profile-data.md#how-did-we-process-the-data-for-each-release).

**Exclusion of records with values outside the analytical range**

A small number of records contained lipid values outside the range of the Mission® POCT device. This reflects updates to the CSA, which allowed entry of values beyond the device’s supported range as described above. The absolute limits of the device are included in the table here.&#x20;

*Mission® POCT device minimum and maximum values (mmol/L)*&#x20;

| Field             | Min, mmol/L | Max, mmol/L |
| ----------------- | ----------- | ----------- |
| Total Cholesterol | 2.59        | 12.93       |
| HDL-C             | 0.39        | 2.59        |
| Triglycerides     | 0.51        | 7.34        |

Values outside these ranges are considered unreliable and may reflect device limitations or data entry errors. Records were excluded if any lipid measurement (from first or repeat readings) fell outside these ranges. Values exactly at the limits were retained.&#x20;

This affected a very small proportion of records (<0.05%).

**Exclusion of records where HDL-C is more than TC**

Records were excluded where high-density lipoprotein cholesterol (HDL-C) exceeded total cholesterol (TC), as this is not plausible and is indicative of potential data capture, data entry, or processing errors. This was assessed across all available readings (first and, where applicable, repeat measurements). Where this occurred, the entire participant record was excluded from the dataset.&#x20;

This affected a very small proportion of records (<0.05%).

**Handling of boundary values**

Boundary values are defined as lipid measurements recorded at exactly the minimum or maximum analytical limits of the Mission® POCT device for total cholesterol (TC), high-density lipoprotein cholesterol (HDL-C), and triglycerides (TG).

When a measurement falls outside the measurable range of the device (e.g., 2.59 - 12.93 mmol/L for TC), the result is displayed on the device as the minimum or maximum boundary value together with a 'less than' (<) or 'greater than' (>) indicator, respectively, rather than as a precise numeric value. For example, a measurement of total cholesterol that is above the analytical range maximum of 12.93 mmol/L would be displayed as ">12.93". We observed an excess of these boundary values in the POCT data, likely due to out-of-range measurements being recorded as the boundary value alone. Consequently, although boundary values may sometimes be specific measurements, they will more often correspond to values outside the measurable range.

Overall, 4.8% of participants had at least one lipid measurement at a boundary in first or repeat readings, with a smaller proportion observed in repeat measurements (0.7%). Boundary values were more common in first readings and varied by analyte, with HDL-C showing the greatest clustering at the upper limit and TG and TC more often clustering at lower limits.

Boundary values should therefore be interpreted with caution as they may not reflect precise lipid concentrations.

**Derived variables**&#x20;

The Mission® device does provide outputs for LDL, non-HDL cholesterol, and the TC:HDL-C ratio. However, in this dataset these variables were independently recalculated using the primary lipid measurements (TC, HDL-C, and TG), as described in [POCT Lipid Profile data](/our-future-health/data-types/clinic-measurements-data/poct-lipid-profile-data.md#what-additional-variables-are-derived-from-the-poct-lipid-profile)

**Inclusion criteria for calculating derived variables**

In light of the issues described above, derived lipid variables (including LDL-C, non-HDL-C, and the TC:HDL-C ratio) are not calculated for participants with any lipid measurement recorded exactly at the minimum or maximum boundary allowed by the device. This exclusion is applied where any available reading (first and, where applicable, repeat measurements) falls at a boundary value.

In doing so, we have excluded derived variables for approximately 4.8% of participants (as described above). This approach is used to reduce the risk of incorporating potentially misclassified boundary entries into derived calculations. Where required, users may recalculate derived variables independently using the raw lipid measurements.

**False positive data**&#x20;

As of appointment version 2, participants with high POCT lipid reading are offered a second set of measurements, as described above.

However, we note two exceptions:

1. criteria met but data missing (false negative data): participants who meet the criteria for a second POCT reading, but have no data for second readings
2. criteria not met but data provided (false positive data): participants who do *not* meet the criteria but do have data for a second reading

This discrepancy affects fewer than 0.05% of records. The vast majority of participants who meet the criteria for second POCT measurements in version 2 have data recorded as expected.

**Agreement between first and second readings**

Agreement between first and repeat measurements was tested using Spearman’s rank correlation coefficient. Among the 8,510 participants with both initial and repeat POCT measurements, most showed good agreement between readings. The proportion of participants with less than 10% variation between measurements was 76% for total cholesterol, 84% for HDL, and 62% for triglycerides. TG demonstrated the greatest variability, consistent with known biological and analytical variation.&#x20;

Repeat measurements are not randomly distributed and are enriched for participants with elevated total cholesterol levels. As a result, they tend to be more variable and extreme. No direct corrections were applied to reconcile differences between initial and repeat measurements; however, implausible discrepancies are largely addressed through other exclusion criteria, such as removal of out-of-range values.

**Missing reason for skipped second readings**

Participants may skip first or second readings. When a measurement is skipped, a reason is usually recorded (e.g., device failure, insufficient sample, or participant refusal). However, for skipped second readings, the reason is not currently captured in the dataset. This information will be included in a future release.

**Other data entry errors**

It is possible that, in a very small number of instances, errors occurred during manual entry of values into the data capture system. For POCT lipid measurements obtained directly from the device (TC, HDL-C, and TG), errors may include:

* incorrect numeric entry (e.g., misplaced decimal points)
* selection of incorrect input fields during data entry or device interface navigation (e.g., entering a derived value such as a ratio instead of a raw lipid measurement)
* transcription errors during transfer of device-reported values (i.e., incorrect copying of values from the device screen into the data capture system)
* duplication of values across multiple fields (e.g., identical values entered for more than one lipid measure)

To ensure accurate measurements were recorded, our data capture application and associated Standard Operating Procedures (SOPs) were continually updated with guidelines and prompts to assist in precise data collection. We are committed to addressing these data issues and may update our data cleansing rules in future releases.

#### Participants who have registered more than once (POCT lipid profile data)

Participants who have registered more than once are handled in the same way as in the Clinic Measurements dataset (see [#participants-who-have-registered-more-than-once-clinic-measurements-data](#participants-who-have-registered-more-than-once-clinic-measurements-data "mention"))

***

### Genetic data&#x20;

#### Genotype array data

There are two main categories of files included in the current release:

* Two sets of files containing participant genotypes (SNV pVCF and BGEN), one region index file, and one PCA loadings file for variants used in the PCA and kinship analysis.
* Two files containing sample-level information (sample QC metrics and kinship table).&#x20;

Each set is provided as a separate data ‘entity’ within the Trusted Research Environment. Each participant is represented by a single sample in each file in the genotype data. The primary identifier is the participant ID (PID) which can be used to link these files with the Participant and Questionnaire tables.

The two sets of genotype files, pVCF and BGEN files, contain the same genotypes for the same participants and genetic variants. Each file set is split across regions for each chromosome (22 autosomal chromosomes, two sex chromosomes 'X', 'Y' and mitochondrial 'MT'), across 160  separate files. Each genotype file has an associated pVCF index file (.tbi), or BGEN index file (.bgi) specific for that chromosome region, in addition to an accompanying BGEN .sample file (.sample). We provide both types of files for convenience and to improve the experience of researchers using the data. 40 PC loadings for variants are provided in vcf format with associated index file. The sample-level information file contains information useful for quality control (QC) purposes, such as batch, estimated genetic sex, principal components and call rate. The regional index BED file contains the chromosome and genomic coordinates of variants present within each pVCF or BGEN file. Kinship file contains the pairwise related samples.

<figure><img src="/files/AUnvCL7qlm5s1fIC5b4C" alt=""><figcaption></figcaption></figure>

Overview of genotype array data files

| Category               | Type            | Version  | Entity              | File name                          | Number of files | Description                                                  |
| ---------------------- | --------------- | -------- | ------------------- | ---------------------------------- | --------------- | ------------------------------------------------------------ |
| SNV pVCF               | pVCF            | VCF 4.1  | snv\_pvcf           | ofh\_snv.chrZ-bXXXX.vcf.gz         | 160             | pVCF containing SNV genotypes and metadata                   |
| SNV pVCF               | pVCF index file | -        | snv\_pvcf           | ofh\_snv.chrZ-bXXXX.vcf.gz.tbi     | 160             | pVCF-associated index file                                   |
| BGEN                   | BGEN fileset    | BGEN 1.2 | snv\_bgen           | ofh\_snv.chrZ-bXXXX.bgen           | 160             | BGEN file containing SNV genotypes                           |
| BGEN                   | BGEN fileset    | BGEN 1.2 | snv\_bgen           | ofh\_snv.chrZ-bXXXX.sample         | 160             | BGEN-associated sample file                                  |
| BGEN                   | BGEN index file | -        | snv\_bgen           | ofh\_snv.chrZ-bXXXX.bgen.bgi       | 160             | BGEN-associated index file                                   |
| Sample QC              | QC metrics      | -        | sample\_qc\_metrics | ofh\_sample\_qc\_ metrics.tsv      | 1               | Plain-text tabular file with sample-level information        |
| Regions index BED file | BED file        | -        | snv\_resources      | ofh\_snv\_regions.bed              | 1               | Plain-text tabular file in BED file format                   |
| Kinship                |                 | -        | Sample\_qc\_metrics | ofh\_snv\_kinship.txt              | 1               | Plain-text tabular file with sample-level information        |
| PCA Loadings           | VCF file        | VCF4.2   | Sample\_qc\_metrics | ofh\_snv\_pca\_loadings.vcf.gz     | 1               | <p> </p><p>vcf file – 40 PC loadings per variant</p><p> </p> |
| PCA Loadings           | VCF index file  |          |                     | ofh\_snv\_pca\_loadings.vcf.gz.tbi | 1               | vcf index file                                               |

File names in this data release include the following components:

* an indicator that the data comes from Our Future Health participants (“ofh\_”)
* an indicator of file contents (“snv” or “sample\_qc\_metrics”)
* the chromosome number (“.chrZ”) for the genetic variants contained within that file, where Z can be any of 1-22, 'X', 'Y' or 'MT'
* the region identifier (-bXXXX) which maps to the genomic coordinates in the BED file
* a suffix representing the file type, e.g. “.vcf.gz”, per the specifications of pVCF or BGEN file sets, or “.tsv” for tab-separated values

#### What information do the genotype files contain?

Both the pVCF and BGEN file sets contain genotypes from all participants in the release and include genotype calls for all 22 autosomal chromosomes, X, Y and MT (mitochondrial). All genotypes are for single-nucleotide polymorphisms (SNPs) or small insertion-deletions (INDELs) aligned to the GRCh38 human reference genome. In the BGEN file set, estimated genetic sex is included in the BGEN-associated sample file.

For more information on the fields present in each genotype file, please refer to the genotype data tab of the Our Future Health data dictionary. For more information on the exact genetic variants present in each genotype file, refer to the CPRA variant list (GRCh38). Both these files can be found on the [Data and cohort page of our website (external link)](https://research.ourfuturehealth.org.uk/data-and-cohort/).

#### What information does the sample QC file contain?

The sample QC file contains basic sample-level information useful for QC purposes or batch effect adjustments and the first 40 principal components. For more information on the fields present in the file, please refer to the genotype tab of the Our Future Health data dictionary located on our [Data and cohort page (external link)](https://research.ourfuturehealth.org.uk/data-and-cohort/).

#### What information does Kinship file contain

Kinship file contains the following columns:

<table><thead><tr><th width="95.00390625">Field</th><th width="97.0703125">Type</th><th>Description</th></tr></thead><tbody><tr><td>ID1</td><td>string</td><td>Sample_id for individual 1 in related pair</td></tr><tr><td>ID2</td><td>string</td><td>Sample_id for individual 2 in related pair</td></tr><tr><td>HetHet</td><td>numeric</td><td>Fraction of markers for which the pair both have a heterozygous genotype (output from KING software)</td></tr><tr><td>IBS0</td><td>numeric</td><td>Fraction of markers for which the pair shares zero alleles (output from KING software)</td></tr><tr><td>Kinship</td><td>numeric</td><td>Estimate of the kinship coefficient for this pair based on the set of markers used in the kinship inference (Output from KING software). The set of markers is indicated by the field used in kinship inference.</td></tr></tbody></table>

#### What should I be aware of when working with the genotype data in this release?

This data release includes sample QC results for kinship, principal components, limited outputs  from the genotype calling process including call rate, genetic sex, and related aggregate genotyping QC metrics. No variant QC results are provided.  In this release, we are providing principal components, sample heterozygosity, kinship and variant Hardy-Weinberg statistics. Users are reminded to conduct their own comprehensive statistical QC appropriately tailored to their research question.&#x20;

We note the following issues with the current data release which we aim to address in future releases:

* Estimated call rate is based on all chromosomes, including both the X and Y chromosomes (median call rates for females may be lower than for males due to Y chromosome missingness).
* The date string in the batch identifier within the sample QC metrics file is the date of genotype calling and not the date on which the genotyping assay started in the laboratory. Genotype calling using the intensity data files of some samples occurred considerably later than the laboratory assay was completed, so this date cannot be used for statistical adjustment of between-day variation introduced at genotyping stage.
* The presence of non-haploid genotypes ('0/1', '0/0' and '1/1') was observed for Y and MT chromosome variants, affecting \~0.25% of Y chromosome genotype calls and a smaller proportion for MT, for both female and male samples, arising from low or noisy probe intensities for some genetic variants. Non-haploid genotypes occur outside of the pseudoautosomal regions (PARs) for the Y chromosome. These non-haploid genotype calls for haploid chromosomes should be treated as missing (no call). Note that some tools, such as plink2 or qctool may error or display unexpected behaviour when processing Y or MT chromosome files, due to the presence of these non-haploid genotypes. In future releases, non-haploid calls for Y and MT genetic variants will be set to missing in both the pVCF and BGEN files prior to data release.
* A small number of genetic variants were found to have been incorrectly reported in the pVCF and BGEN files, resulting from multi-mapping probes (where a probe sequence maps to multiple locations in the genome) or multi-base SNP targets which were misaligned to the reference genome during genotype calling. All genetic variants that were found to be incorrectly reported have been removed from this release. We hope to correct the reporting of these variants in a future release but in the meantime, please refer to the imputed data release to determine if a removed variant is available.
* Changes in laboratory reagents aimed at optimising genotyping as well as continual improvement in laboratory processes mean that some variation in the call rate distribution is evident between batches and across time. Future further optimised cluster files for genotype re-calling will likely reduce the magnitude of these differences.
* A small number of samples (78) were estimated to have an implausibly large number of third-degree (or closer) relatives. We further identified 129 sample pairs with IBS0 = 0, a pattern typically compatible with a parent–offspring relationship, but for which their kinship coefficients were below the first-degree threshold, falling instead within the second-degree range. These pairs were therefore classified as second-degree relatives in the genotype data releas&#x65;**.** Researchers should aim to further investigate and appropriately action any sample exclusions based on calculated genetic relatedness in the cohort when conducting their own statistical QC.
* As described on the [Participant data](file:///o/zth2OqucsdcGFEsqxjHc/s/2tKGBkuXj4uUxFc6HobA/~/changes/110/data/participant-data) page, we are aware that some individuals may have registered multiple times. This may mean that in a very small number of cases, the same person may have attended multiple in-person appointments, and provided multiple blood samples, under different registrations. Samples detected as genetic triplets, quadruplets and quintuplets have been excluded from the genetic data release. However, some records may be detected as genetic duplicates. Such samples should be treated with caution, as they may have arisen due to participants registering multiple times. They should not be considered to be identical twins without further confirmation. However, approved study applications which include linked data could use the linked health records to identify a large proportion of the duplicate records whenever submitted personal information is the same or similar (see [What should I be aware of when working with the linked health records data in this release?](file:///o/zth2OqucsdcGFEsqxjHc/s/2tKGBkuXj4uUxFc6HobA/~/changes/110/data/data-releases/2025-data-releases/release-11%23what-should-i-be-aware-of-when-working-with-the-linked-health-records-data-in-this-release)).
* Due to the complexity of genotype-calling at multi-allelic loci, variants with more than one alternate allele are not currently fully supported by Illumina’s genotype caller. We previously noted a ceiling effect for the allele frequencies for some multi-allelic variants (N=499),  where these variants had exclusively heterozygous genotypes. Illumina genotype calling software currently compares homozygous clusters across assays at the multi-allelic locus and interprets any conflicting calls between the major alternate allele and the minor alternate allele(s) as ambiguous, and subsequently sets these genotypes to missing (./.). As a result, Illumina have cautioned that all multi-allelic variants may potentially have effects that are less obvious than the 499  we previously noted, especially if the frequency of the alternate alleles is quite low. As a result, in order to avoid ambiguous and inaccurate calls, we made a decision to remove data for all multi-allelic loci in this genotype array data release.  Should a solution become available, we will revisit multi-allelic loci calling in future releases. In the meantime, we recommend that researchers interested in multi-allelic loci use our imputed data release instead.
* Eight pathogenic variants included in the array appeared as homozygous for the alternate allele in almost every case. Investigation by Illumina confirmed that these variants were being called incorrectly. Five of the eight probes mapped to at least one additional locus, two of the probes were found to be inaccurate (likely due to high GC content in the probe and the surrounding region), and one variant remains under investigation, but may be the result of a manifest-related issue. The affected variants are labelled as inaccurate in the CPRA variant list. You can download this file from our [Data and cohort page (external link)](https://research.ourfuturehealth.org.uk/data-and-cohort/). We aim to resolve this issue in future releases of genotype data.

#### Imputed genetic data

The current release includes one file containing sample-level QC information, two sets of files containing imputed genetic data for participants in different formats, and an additional file with variant summary data. Each set is provided separately within the Trusted Research Environment. Each participant is represented by a single sample in each file in the imputed genetic data. The primary identifier is the participant ID (PID) which can be used to link these files with the Participant and Questionnaire tables.

The two sets of imputed genetic data files, pVCF and BGEN, contain the same genotypes for the same participants and genetic variants. Each file set is split across regions for each chromosome (22 autosomal chromosomes, chromosome X), totalling 809 separate files per participant. Each file of imputed genetic data has an associated pVCF index file (.tbi), or BGEN index file (.bgi) specific for that chromosome region, in addition to an accompanying BGEN .sample file (.sample). We provide both types of files for convenience and to improve the experience of researchers using the data. We also provide a regional index BED file which contains the chromosome and genomic coordinates of variants present within each pVCF or BGEN file. Sample- and variant-level files provide additional information for quality control (QC) purposes. The sample-level information file contains information such as batch, imputation group and estimated genetic sex. The variant-level summary data file (VCF) contains information on dosage r<sup>2</sup> and ALT allele frequencies for all variants and has an accompanying index file (.tbi).&#x20;

<figure><img src="/files/z85v0zhE3uffyKpRKNYi" alt=""><figcaption><p>Overview of imputed genotype data files</p></figcaption></figure>

**Table - File names for imputed genotype data**

<table><thead><tr><th valign="top">Category</th><th valign="top">Type</th><th valign="top">Version</th><th valign="top">Entity</th><th valign="top">File name</th><th valign="top">Number of files</th><th valign="top">Description</th></tr></thead><tbody><tr><td valign="top"><p> </p><p>SNV pVCF</p></td><td valign="top">pVCF</td><td valign="top">VCF 4.1</td><td valign="top">imputed_pvcf</td><td valign="top">ofh_imputed.v6.chrZ-bXXXX.vcf.gz</td><td valign="top">809</td><td valign="top">pVCF containing imputed genotypes and metadata</td></tr><tr><td valign="top">SNV pVCF</td><td valign="top">pVCF index file</td><td valign="top">-</td><td valign="top">imputed_pvcf</td><td valign="top">ofh_imputed.v6.chrZ-bXXXX.vcf.gz.tbi</td><td valign="top">809</td><td valign="top">pVCF-associated index file</td></tr><tr><td valign="top">BGEN</td><td valign="top">BGEN fileset</td><td valign="top">BGEN 1.2</td><td valign="top">imputed_bgen</td><td valign="top">ofh_ imputed.v6.chrZ-bXXXX.bgen</td><td valign="top">809</td><td valign="top">BGEN file containing imputed genotypes</td></tr><tr><td valign="top">BGEN</td><td valign="top">BGEN fileset</td><td valign="top">BGEN 1.2</td><td valign="top">imputed _bgen</td><td valign="top">ofh_ imputed.v6.chrZ-bXXXX.sample</td><td valign="top">809</td><td valign="top">BGEN-associated sample file</td></tr><tr><td valign="top">BGEN</td><td valign="top">BGEN index file</td><td valign="top">-</td><td valign="top">imputed _bgen</td><td valign="top">ofh_ imputed.v6.chrZ-bXXXX.bgen.bgi</td><td valign="top">809</td><td valign="top">BGEN-associated index file</td></tr><tr><td valign="top">Sample QC</td><td valign="top">QC metrics</td><td valign="top">-</td><td valign="top">sample_qc_metrics</td><td valign="top">ofh_imputed_sample_qc_ metrics.v6.tsv</td><td valign="top">1</td><td valign="top">Plain-text tabular file with sample-level information</td></tr><tr><td valign="top">Variant summary statistics</td><td valign="top">VCF</td><td valign="top">VCF 4.2</td><td valign="top">imputed_resources</td><td valign="top">ofh_imputed_variant_summary_stats.v6.vcf.gz</td><td valign="top">1</td><td valign="top">VCF containing variant-level summary statistics</td></tr><tr><td valign="top">Variant summary statistics</td><td valign="top">VCF index file</td><td valign="top">-</td><td valign="top">imputed_resources</td><td valign="top">ofh_imputed_variant_summary_stats.v6.vcf.gz.bgi</td><td valign="top">1</td><td valign="top">VCF-associated index file</td></tr><tr><td valign="top">Regions index BED file</td><td valign="top">BED file</td><td valign="top">-</td><td valign="top">imputed_resources</td><td valign="top">ofh_imputed_regions.v6.bed</td><td valign="top">1</td><td valign="top">Plain-text tabular file in BED file format</td></tr></tbody></table>

File names in this data release include the following components:

* an indicator that the data comes from Our Future Health participants (“ofh\_”)
* an indicator of file contents for imputed genetic data (“imputed”)
* the version number (“.v6”) of the imputed genotype data release, to be incremented with each release
* the chromosome number (“.chrZ”) for the genetic variants contained within that file, where Z can be any of 1-22, or 'X'
* the region identifier (-bXXXX) which maps to the genomic coordinates in the BED file
* a suffix representing the file type, e.g. “.vcf.gz”, per the specifications of pVCF or BGEN file sets, or “.tsv” for tab-separated values

**What information do the imputed genotype files contain?**

Both the pVCF and BGEN file sets contain genotypes from all participants in the release across all 22 autosomal chromosomes and chromosome X (non-PAR). Genotypes are provided in GT:GP format, where GT is the thresholded genotype call and GP is the imputed genotype probability. All genotypes are for SNPs or small indels aligned to the GRCh38 human reference genome. Multi-allelic variants have been split into separate biallelic records. In the BGEN fileset, estimated genetic sex is included in the BGEN-associated sample file. The fields in each genotype file have been summarised in the genotype tab of the Our Future Health data dictionary. The CPRA variant list provides a list of the genetic variants included in the imputed dataset in CHR:POS:REF:ALT (chromosome, position, reference allele and alternate allele) format (GRCh38). Both files can be found on the [Data and cohort page of our website (external link)](https://research.ourfuturehealth.org.uk/data-and-cohort/).

**What information does the imputed sample QC file contain?**

The sample QC file contains basic sample-level information useful for QC purposes or potential imputation group effect adjustments. For more information on the fields present in the file, please refer to the genotype tab of the Our Future Health data dictionary located on our [Data and cohort page (external link)](https://research.ourfuturehealth.org.uk/data-and-cohort/).

**What information does the variant summary statistics file contain?**

The variant summary statistics file contains estimated metrics on the variant level in VCF format, useful for QC purposes. This includes the dosage r<sup>2</sup>, ALT allele frequencies and the number of groups which were imputed or directly genotyped for the variant. Further information on the fields present in this file can be found in the genotype tab of the Our Future Health data dictionary located on our [Data and cohort page (external link)](https://research.ourfuturehealth.org.uk/data-and-cohort/).

**What should I be aware of when working with the imputed genotype data in this release?**

Sample QC results are limited to those provided in the sample QC file and additional resource files including principal components and sample relatedness (kinship) data. Variant metrics are limited to those provided in the variant summary statistics file. We have, conducted several targeted statistical checks on the data, including assessment of principal components, sample heterozygosity, sample relatedness and variant Hardy-Weinberg statistics. We plan to include additional outputs of these analyses in a future release. Users are reminded to conduct their own comprehensive statistical QC appropriately tailored to their research question.

#### Genetic ancestry data

**How did we process the data for this release?**

Inferred genetic ancestry data was received from Genomics Ltd for 147 groups of samples in the form of a .tsv file for each group. Each file was merged to create a single .tsv file with genetic ancestry information for all samples. Evidence of batch effects between the groupings was assessed by plotting principal components derived from OFH genetic data and colouring according to the Genomics group ID of the participants. This was also done separately for each of the 25 sub-regions for the top 3 PCs. In doing so, no individual clusters that could indicate evidence of batch effects were observed.

<figure><img src="/files/WhgtJj4KsdV6qHh1AXXl" alt=""><figcaption><p>Our Future Health top 6 principal components coloured by Genomics Group ID</p></figcaption></figure>

<figure><img src="/files/TqOVoknVqNfyKhPdfB8U" alt=""><figcaption><p>PC1 vs PC2 plotted by ancestry and coloured by Genomics Group ID</p></figcaption></figure>

<figure><img src="/files/GegzM94u7UmfAVpbm1OC" alt=""><figcaption><p>PC2 vs PC3 plotted by ancestry and coloured by Genomics Group ID</p></figcaption></figure>

The mean was estimated across all groupings for the proportion of individuals assigned to each sub-continental region, and plotted with error bars (±1 SD) to further assess variability between groupings. There was very little evidence of variability, with most variability observed with the C\_S\_UK sub-region. This is most likely due to the sub-region having the highest number of ancestry assignments.

<figure><img src="/files/isvwlLZsjcUoZzJ7cHBd" alt=""><figcaption><p>Mean Ancestry Proportions across all groupings (±1 SD)</p></figcaption></figure>

Standardised z-scores were generated for each ancestry sub-region, allowing the variation of each sub-continental region to be compared with each other. Violin plots were generated which were found to be closely centred around z = 0, showing that the ancestry assignments are mostly consistent across groupings. Larger sub-regions did show slightly wider violin plots and z-score spreads compared to the rarer ancestry groups which are more tightly centred.

<figure><img src="/files/bNQXyzTQMQ2T8bER61ka" alt=""><figcaption><p>Variability Across Groupings by Ancestry (z-score distribution). Wider/taller violins indicate greater between-batch variability; dotted lines at ±2 SD</p></figcaption></figure>

**What data is included in the current release?**

The P14 release includes genetic ancestry data for 755,000 participants computed across 142 groups. Each participant has been assigned a single hard-call ancestry label in addition to proportions for admixture for each of the 25 sub-continental regions.

**How strong is the concordance between genetically inferred ancestry and self-reported ethnicity?**

Genetically inferred ancestry and self-reported ethnicity capture related but non-equivalent information, and their categories do not map one-to-one. Concordance was therefore assessed pairwise by estimating, for each genetic ancestry group, the proportions of individuals who reported ethnicity aggregated to each of the six major continental populations, being sub-Saharan Africa (AFR\_SS), East Asian (EAS), South Asian (SAS), East and South Asian (EAS\_SAS), Middle East and West Asian (M\_EAST\_W\_ASIA) and Europe including the UK (EUR).

&#x20;Genetic ancestry and aggregated self-reported ethnicity were found to be strongly associated (Cramer’s V = 0.85; chi-squared p-value < 2x10<sup>-16</sup>). For most combinations of genetic ancestry and aggregated self-reported ethnicity, concordance is very high for a single aggregated group of ethnicities and low for all others.

<figure><img src="/files/hbIt3hX8bmldqTeSFfqw" alt=""><figcaption><p>Agreement heat-map between genetic ancestry and self-reported ethnicity</p></figcaption></figure>

**What should I be aware of when working with the ancestry data?**

In the current release, participants with genetic ancestry data also have genetic array data and genetic imputed data. Please note that “ancestry” is defined as genetically inferred ancestry rather than self-reported ethnicity. This data is suitable for research purposes only.

***

### Linked health records data

#### **What information does the linked health records data contain?**

This release contains linked health records from Hospital Episodes Statistics (HES), the National Disease Registration Service (NDRS), Office of National Statistics (ONS) Death Registration, and NHS Business Services Authority (NHSBSA).

The linked datasets provide a wide range of information on patient admissions and events in NHS facilities, including clinical, administrative, and geographic information. The HES and NDRS data sets do not contain electronic patient health records or information on medicines and dosages. Medicines Dispensed in Primary Care contain information on medicines dispensed and dosage. Private prescriptions are not available. For more information on how these data sets are collected and processed, please refer to:

* [HES Data Collection page (external link)](https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/hospital-episode-statistics#data-collection)
* [NDRS Access Page (external link)](https://digital.nhs.uk/ndrs/data/access-to-data)
* [NHSBSA Data collection page (external link)](https://digital.nhs.uk/data-and-information/data-tools-and-services/data-services/medicines-dispensed-in-primary-care-nhsbsa-data#about-the-data-product).&#x20;

The linked datasets only include records collected by NHS England (NHSE), meaning these data contain only care records from NHS providers in England.

See the [Linked data set descriptions](https://ourfuturehealth.gitbook.io/our-future-health/data/linked-health-records-data#linked-data-set-descriptions) for names and descriptions of all available datasets.

We used the HES data dictionary v2.04, ECDS data dictionary v1.4, NDRS data dictionary v5.2, and Dispensed Medicines January 2024 dictionary release for validation on variable format and codes.

All linked health records data have been provided by NHS England.

#### What changes have been made as part of this release?

This is the first release to include the historic pre-1995 cancer registry dataset from the National Disease Registration Service, which includes registrable tumours diagnosed between 1 January 1985 and 31 December 1994.

The linked health records data have been de-duplicated. In total, 12,352 `PID`s, and any health records associated with those `PID`s, have been removed from the linked health records data because they have been linked to the same NHS Numbers. These `PID`s are still present in the other data products.

We have updated the pseudonymised provider code list to incorporate any new providers. In total, we mapped 11,844 providers. 349 providers (2.9% of all providers in data) were mapped to unknown because they did not appear in either the [NHS Organisation Data Service API](https://digital.nhs.uk/services/organisation-data-service/export-data-files/csv-downloads/other-nhs-organisations) or the [Archived Closed Organisation data set](https://digital.nhs.uk/services/organisation-data-service/export-data-files/csv-downloads/miscellaneous).

#### Date ranges available in linked data

The table below summarises the finalised and provisional dates available for each dataset. Please refer to [the section on provisional data](#hes-provisional-data-may-change-between-releases) for more information on provisional data. More detailed descriptions of the datasets can be found on the [Linked data set descriptions page](/our-future-health/data-types/linked-health-records-data.md#linked-data-set-descriptions).&#x20;

<table><thead><tr><th width="249">Entity</th><th>Dates available finalised data</th><th>Dates available provisional data</th></tr></thead><tbody><tr><td><strong>Primary care</strong></td><td></td><td></td></tr><tr><td>Dispensed Medicines in Primary care</td><td>1 April 2018 to 1 June 2025</td><td>No provisional data</td></tr><tr><td><strong>Secondary care</strong></td><td></td><td></td></tr><tr><td>HES Accident &#x26; Emergency</td><td>1 April 2007 to 31 March 2020</td><td>No provisional data</td></tr><tr><td>HES Emergency Care Dataset</td><td>1 April 2020 to 31 March 2025</td><td>1 April 2025 to 31 October 2025</td></tr><tr><td>HES Admitted Patient Care</td><td>1 April 1997 to 31 March 2025</td><td>1 April 2025 to 31 October 2025</td></tr><tr><td>HES Outpatient</td><td>1 April 2003 to 31 March 2025</td><td>1 April 2025 to 31 October 2025</td></tr><tr><td><strong>Cancer</strong></td><td></td><td></td></tr><tr><td>NDRS Cancer Registry Patient Tumour</td><td>1 January 1995 to 31 December 2023</td><td>No provisional data</td></tr><tr><td>NDRS Cancer Registry Cancer Treatment</td><td>1 January 1995 to 12 June 2025</td><td>No provisional data</td></tr><tr><td>NDRS Cancer Registry pre-1995</td><td>1 January 1985 to 31 December 1994</td><td>No provisional data</td></tr><tr><td>NDRS Cancer Pathways</td><td>1 January 2013 to 26 July 2024</td><td>No provisional data</td></tr><tr><td><strong>Death</strong></td><td></td><td></td></tr><tr><td>ONS Death Registration</td><td>1 June 2022 to 31 October 2025</td><td>1 November 2025 to 26 November 2025</td></tr><tr><td><strong>Derived datasets</strong></td><td></td><td></td></tr><tr><td>Linked Participants</td><td>All participants who submitted questionnaire before 9 April 2025</td><td>No provisional data</td></tr></tbody></table>

#### **Discrepancies in number of participants the Linked Participants entity**

Please note that there are 12 participants with a linked health record who do not have their `PID`s listed in the Linked Participants entity. We are working with NHSE to solve this discrepancy.&#x20;

The Linked Participants entity lists all the successfully linked participants who submitted a questionnaire prior to 9 April 2025.

#### **Provisional data may change between releases**

The HES Admitted Patient Care, Outpatient, and Emergency Care data include some provisional records. These are the most recent admissions and appointments that were available for the cohort at the time the data was supplied by NHS England, but the records have not been finalised. Therefore, the data entered in these records could change slightly in future releases. Once a year, the latest full financial year of provisional data is finalised and made available to Our Future Health by NHS England.&#x20;

ONS Deaths Registration may also contain incomplete records for more recent entries, which can be updated in future releases. Refer to the table above for date ranges for finalised and provisional data.

In the current release:

* any appointments in the Outpatient data that occurred from 1 April 2025 onwards are likely provisional data and subject to change in future releases
* any hospital episodes in the Admitted Patient Care data that finished after 31 March 2025 onwards are likely provisional data and subject to change in future releases
* any appointments that occurred before 1 April 2025 or hospital episodes which finished prior to 1 April 2025 are likely finalised data and are not subject to change


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ourfuturehealth.gitbook.io/our-future-health/data-releases/2026-data-releases/release-14.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
