Participant data

Information about the participant data in the Our Future Health resource, including the scope and structure of these data and how the data were generated and processed


What information does the Participant table contain?

Participant data is provided in a single table. It has one row per participant. The participant table includes the following information:

  • sex

  • gender

  • ethnicity

  • month and year of birth

  • consent version

  • month and year of consent

  • month and year of registration

  • blood sample

Sex, gender, and ethnicity are collected in the Our Future Health baseline questionnaire. Some questions have changed across questionnaire versions, as describe in Version changes and developments . In these cases, each version of the question is stored as a separate column with a unique field name. Participants will only have data in the column for the version they completed. For more details see Participant data.

The blood sample variable indicates whether we obtained a blood sample of sufficient quality and volume for extraction of buffy coat, DNA, and plasma.


How did we process the data for each release?

We processed the raw data from all participants who were in the programme on or before the cut-off date for the most recent release. Data for participants who have fully withdrawn from Our Future Health are deleted routinely after they request to withdraw. Any participants who have fully withdrawn from the programme since the last data release will not be included in the current data release.

How do we de-identify the data to minimise risks of identifying participants?

To protect participants’ anonymity, we have removed ("suppressed") all sensitive, directly identifiable information including name, address, postcode, telephone number, and email address. We generated a set of unique participant identifiers (PIDs) and randomly assigned them to participants.

Suppressed values were replaced with a unique code in the data sets as "-999'". This means that researchers can differentiate between suppressed values and records without a response (coded as null). Where relevant, the numeric values (code) and textual labels (meaning) for suppressed values can be found in the coding file. To access the data dictionary and coding files, see our Data and cohort page (external link).

We have restricted the dates in the Participant table to include only the month and year of birth. For participants identified as being over the age of 95 on the production date, their birth dates were fully suppressed and replaced with a unique code. This measure was implemented to protect their identity, given the small number of individuals in this age group. Additionally, all questionnaire fields related to dates or ages were reviewed to ensure they do not unintentionally reveal a participant’s age when birth details have been suppressed.

What exclusions were applied to the data?

We have excluded a very small number of records that were inadvertently duplicated due to technical technical issues and data capture processes.

These records have been excluded from the release.

How is the participant data organised in the Trusted Research Environment (TRE)?

The data release includes 14 variables for the participant data. In the TRE, the data is organised into a single entity which we refer to as the "Participant'' table. Each entity can be linked to another entity (for example, the questionnaire) using a unique identifier (called Participant Identifier or PID). Other than the PID, variable names are unique within and between all entities.

How do I interpret the structured field names?

Field names are short, descriptive, and often abbreviated names used to describe the contents of a particular column. For the Participant table, for example, PID refers to a unique participant record identifier, while BIRTH_YEAR refers to the participant’s year of birth in YYYY format and BIRTH_MONTH refers to the month of birth in MM.

How do we handle participants who register more than once?

The intention of the Our Future Health programme is that participants register only once, and that each participant has exactly one unique participant identifier (PID) to identify their records in our data releases. We are aware, however, that a very small proportion participants have registered to join Our Future Health multiple times. This is possible if they enter a unique email address for each registration, because we do not check for uniqueness of other personal information or conduct an identity check.

If a participant registers more than once, then each of their registered accounts will be associated with a different PID. Although most participants who register more than once do not complete multiple questionnaires, nor attend multiple in-person appointments to give a blood sample, there are small proportion who have done so. We estimate that approximately 0.5% of participants completed multiple questionnaires, and that approximately 0.1% attended more than one in-person appointment.

Our current policy is to include participants in data releases if they have at least submitted a baseline questionnaire. However, we are not currently filtering out known or suspected multiple registrations from our data releases. A very small proportion of records in our data releases are therefore ‘duplicates’, meaning that they appear to be distinct, but pertain to the same person. Note that this does not mean that the records will be identical, because data will have been collected for each duplicate registration on different occasions and will therefore differ due to measurement and reporting error and other sources of variation.

We expect the practical consequences of duplicate records to be very small for most researchers, but we aim to ensure that the quality of data in our data releases is as high as possible. We are therefore working to ensure that we can accurately identify duplicate records, and in future data releases to either flag them in the data or remove them entirely.

We further describe how this issue affects each dataset in the documentation for our latest Data releases. In brief, we do not believe that duplicate records can reliably be identified from questionnaire or clinic measurements data, but they may be detected from duplicated administrative identifiers (such as EPIKEY in HES Admitted Patient Care) provided with linked health records. They would also appear to be genetically identical, although a proportion of these are likely to be identical twins (higher-order genetically identical participants have had their genetic data excluded).

What metadata is available to help document the data release?

We provide the following data files on our Data and cohort page (external link):

  • Data dictionary – which defines the raw data fields and metadata information, such as labels, descriptions and units of measurements

  • Coding file – which contains the granular details of categorical or raw coded values

If using Microsoft Excel to browse these files, for an optimal viewing experience, ensure the encoding settings are set to UTF-8.

Last updated