> For the complete documentation index, see [llms.txt](https://ourfuturehealth.gitbook.io/our-future-health/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ourfuturehealth.gitbook.io/our-future-health/data-types/participant-data.md).

# Participant data

***

### What information does the Participant table contain?&#x20;

Participant data is provided in a single table in the TRE. It has one row per participant. The participant table includes the following information:&#x20;

* sex
* gender
* ethnicity
* month and year of birth&#x20;
* consent version
* month and year of consent
* month and year of registration
* whether a blood sample of sufficient quality and volume for extraction of buffy coat, DNA, and plasma has been obtained

Sex, gender, and ethnicity are collected in the Our Future Health baseline questionnaire. Some questions have changed across questionnaire versions, as described in [Questionnaire data](/our-future-health/data-types/questionnaire-data.md#version-changes-and-developments). In these cases, each version of the question is stored as a separate column with a unique field name. Participants will only have data in the column for the question version they completed. For more details see [#how-are-different-versions-of-the-same-questions-stored](#how-are-different-versions-of-the-same-questions-stored "mention").&#x20;

***

### How do we process the data for each release?

We process the raw data from all participants who were in the programme on or before the cut-off date for the most recent release (e.g. 24 March 2026 for [release 14](/our-future-health/data-releases/2026-data-releases/release-14.md)). Data for participants who have fully withdrawn from Our Future Health are deleted routinely after they request to withdraw. Any participants who have fully withdrawn from the programme since the last data release will not be included in the current data release.

#### How do we de-identify the data?

To protect participants’ privacy, we have removed all sensitive, directly identifiable information including name, address, postcode, telephone number, and email address. We generated a set of unique participant identifiers (PIDs) and randomly assigned them to participants.&#x20;

Suppressed values are replaced with a unique code in the data sets: "-999'". This means that researchers can differentiate between suppressed values and records without a response, which would be coded as null. Where relevant, the coding file contains two fields - `code` (numeric value) and `meaning` (textual label) - which define suppressed values. To access the data dictionary and coding files, see our [Data and cohort page (external link)](https://research.ourfuturehealth.org.uk/data-and-cohort).

We have restricted the dates in the Participant table to include only the month and year of birth. For participants identified as being over the age of 95 on the date at which the release is created, their birth month and year are fully suppressed and replaced with a unique code. This measure is implemented to protect their identity, given the small number of individuals in this age group. Additionally, all questionnaire fields related to dates or ages were reviewed and suppressed accordingly to ensure they do not unintentionally reveal a participant’s age when birth details have been suppressed.

#### What exclusions were applied to the data?&#x20;

We have excluded a very small number of records that were inadvertently duplicated due to technical issues and data capture processes.&#x20;

These records have been excluded from the release.&#x20;

#### How is the participant data organised in the Trusted Research Environment (TRE)?&#x20;

The data release includes 14 variables in the Participant entity which is organised into a single table in the TRE. Each entity can be linked to another entity (for example, the Questionnaire table) using a unique identifier (called Participant Identifier or PID). Other than `PID`, variable names are unique within and between all entities.&#x20;

#### How do I interpret the structured field names?&#x20;

Field names are short, descriptive, and often abbreviated names used to describe the contents of a particular column. In the Participant table, for example, `PID` refers to a unique participant record identifier, while `BIRTH_YEAR` refers to the participant’s year of birth in YYYY format and `BIRTH_MONTH` refers to the month of birth in MM format.

### How do we handle participants who register more than once?

The intention of the Our Future Health programme is that participants register only once, and that each participant has exactly one unique participant identifier (PID) to identify their records in our data releases. We are aware, however, that a very small proportion of participants have registered to join Our Future Health multiple times. This is possible if they enter a unique email address for each registration, because we do not check for uniqueness of other personal information or conduct an identity check.

If a participant registers more than once, then each of their registered accounts will be associated with a different PID. Although most participants who register more than once do not complete multiple questionnaires, nor attend multiple in-person appointments to give a blood sample, there is a small proportion who have done so. We estimate that approximately 0.5% of participants completed multiple questionnaires, and that approximately 0.1% attended more than one in-person appointment.

Our current policy is to include participants in data releases if they have at least submitted a baseline questionnaire. However, we are not currently filtering out known or suspected multiple registrations from our data releases. A very small proportion of records in our data releases are therefore ‘duplicates’, meaning that they appear to be distinct, but pertain to the same person. Note that this does not mean that the records will be identical, because data will have been collected for each duplicate registration on different occasions and will therefore differ due to measurement and reporting error and other sources of variation.

We expect the practical consequences of duplicate records to be very small for most researchers, but we aim to ensure that the quality of data in our data releases is as high as possible. We are therefore working to ensure that we can accurately identify duplicate records, and in future data releases either flag them in the data or remove them entirely.&#x20;

We further describe how this issue affects each dataset in the documentation for our latest [Current data release](/our-future-health/data-releases/current-data-release.md). In brief, we do not believe that duplicate records can reliably be identified from questionnaire or clinic measurements data. Duplicates in Linked health records data for active participants have been [removed](/our-future-health/data-types/linked-health-records-data/linked-data-content-and-processing.md#method-for-de-duplicating-the-linked-data-cohort). These records can be detected from duplicated administrative identifiers (such as `EPIKEY` in HES Admitted Patient Care) provided with linked health records. They would also appear to be genetically identical, although a proportion of these are likely to be identical twins (higher-order genetically identical participants have had their genetic data excluded).

### What metadata is available to help document the data release?&#x20;

We provide the following data files on the [current data release](/our-future-health/data-releases/current-data-release.md) page and on our [Data and cohort page (external link)](https://research.ourfuturehealth.org.uk/data-and-cohort):

* Data dictionary – which defines the raw data fields and metadata information, such as labels, descriptions and units of measurements&#x20;
* Coding file – which contains the granular details of categorical or raw coded values

If using Microsoft Excel to browse these files, for an optimal viewing experience, ensure the encoding settings are set to UTF-8.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ourfuturehealth.gitbook.io/our-future-health/data-types/participant-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
