Participant geographies data

Information on the national and regional geographic data associated with participants who have joined the Our Future health programme, including the scope, structure and processing of these data

The Participant geographies data provides geographic information derived from participants’ self-reported address at the time of registration to the Our Future Health programme. Currently, Participant geographies data consists of a single dataset containing both country and region data. This page outlines the structure, scope, and methodology used to generate and process the data for release.


What information is included in the participant geographies data?

The participant geographies data currently consists of a single data called the Country and Region table. This table includes the country and region linked to each participant’s self-reported address collected during their registration for the Our Future Health programme.

Country and region

The United Kingdom (UK) is made up of four countries, called the devolved nations:

  • England

  • Scotland

  • Wales

  • Northern Ireland

England itself is divided in to nine official regions:

  • North East

  • North West

  • Yorkshire and the Humber

  • East Midlands, West Midland

  • East of England

  • London

  • South East

  • South West

These regions, along with the three other devolved nations, form the political and geographical makeup of the UK.

Why country and region matters for health research

Understanding the UK’s countries and regions is important in health research because it helps identify demographic differences that affect health needs, outcomes, and service provision.

Future releases

Additional geographic levels will be introduced in future releases. Each level will be released as a distinct dataset.

Country and region data processing and release

We are releasing the geographic data in stages. The initial releases include a subset of participants who joined the programme earliest, specifically those who registered between 2021 and 2022. Future releases will expand to include all participants who:

  1. Registered before the cut-off date for that release, and

  2. Have fully completed and submitted their questionnaire.

Data for participants who have fully withdrawn from Our Future Health is not included, as those data are deleted routinely after they request to withdraw. Participants who have fully withdrawn from the programme since the last data release will not be included in the current data release.

Only registration address is used. No subsequent address changes or participant relocations are reflected.

How do we map participant addresses to geographic areas?

During registration, participants provide their postcode and address. Our registration system integrates the Ideal Postcodes lookup service, which returns additional location data associated with the address, including geocoded latitude and longitude coordinates. These coordinates are used to generate point geometries that represent each participant's location.

We use these coordinates to assign each participant to standard UK geographic boundaries, such as country and region. This is accomplished using a point-in-polygon spatial mapping technique, whereby each participant’s point geometry is overlaid onto official boundary shapefiles that divide the UK into discrete polygonal areas representing defined geographic unit. For details in the exact shapefiles, see the below section Geographic boundary file sources and versioning

Each point is evaluated for spatial containment within a given polygon. Once a match is identified, the spatial information is transformed into a structured tabular format. In this output, each participant is linked to the relevant geographic unit, including both the unit’s official code (e.g. country code as E92000001) and its corresponding label (e.g. country label as “England”).

Geographic boundary file sources and versioning

Country-level boundaries

Region-level boundaries

We chose ITLs because they provide a harmonised structure for regional geography across the UK, explicitly recognising the devolved nations as distinct regions. This framework is increasingly used in health, policy, and economic research and supports greater comparability across studies by aligning with national and international geographic standards. The Office for National Statistics (ONS), the UK’s official statistical agency, adopts ITLs as the standard for regional data collection and reporting, ensuring consistency across official statistics.

The shapefiles used are fully clipped, meaning that they include only the precise geographic extents of official administrative areas, excluding any extraneous or overlapping spatial features. This ensures accurate and unambiguous assignment of participants to geographic units, particularly for those located near administrative boundaries.

All geographic assignments are based on fixed versions of official boundary shapefiles. These files are used consistently across all data releases to ensure temporal stability and prevent inconsistencies that might arise from administrative boundary changes over time. Even when newer versions of boundary datasets become available, we do not adopt dynamic updates.

This decision is made to preserve longitudinal comparability and to reduce the risk of participant re-identification through geographic triangulation or shifts in classification over successive data releases.

Licensing

Geographic boundary data are © Crown copyright and database rights 2024. They contain public sector information licensed under the Open Government Licence v3.0.

Why use latitude and longitude coordinates?

Our Future Health uses latitude and longitude coordinates derived from individual participant residential addresses to map to administrative geographies.

The main alternative approach would be to use the locations of postcode centroids. Postcode centroids represent postcode areas using a single, central point and are more commonly used for aggregated spatial analysis. While they offer simplicity postcode centroids are inherently less precise and are subject to change due to updates in postal geography, such as the creation of new postcodes, boundary shifts, or service reorganisation by postal authorities. These changes can introduce inconsistencies across time and reduce the reliability of geographic classification for individual records.

In contrast, latitude and longitude coordinates of individual addresses offer superior accuracy for participants-level spatial assignment, particularly in cases where participants are located near the edges of administrative boundaries. This precision allows us to assign participants to defined geographic units through point-in-polygon mapping techniques. This approach enhances spatial accuracy and consistency, supporting the scientific objectives of the Our Future Health programme.

Limitations and caveats

A small proportion of participants could not be assigned to a geographic area due to incomplete, invalid, or unresolvable address data.

Additionally, all assignments are based solely on the residential address provided at the time of registration. Subsequent changes of address are not reflected in these data.

The quality and precision of geographic assignment are contingent upon the accuracy of the self-reported registration address and the reliability of the Ideal Postcodes service.

Data access and de-identification

Data access

Datasets are stored and maintained independently from other participant datasets within the Trusted Research Environment (TRE).

Access to any participant geographies dataset is restricted and requires a dedicated request. Requests are reviewed by an expert panel to ensure that the geographic information is necessary and appropriate for the intended research purpose.

How do we de-identify the data to minimise risks of identifying participants?

Participant geography datasets exclude all participant level address and postcode data and contain only non-identifiable geographic classifications derived from coordinate-based mapping.

How is the data organised in the Trusted Research Environment (TRE)?

In the TRE, all participant geographies datasets will be maintained as separate entities.

For the current release, the Country and Region table is organised as a single entity, containing one row per participant with three variables: Participant ID (PID), country at registration, and region at registration.

Each entity can be linked to other entities (for example, the questionnaire dataset) using the PID, which is a unique participant identifier. Aside from the PID, variable names are unique both within and across all entities.

For the geographies data, the release datasets always store participant information using the relevant codes. The codings file includes these codes along with their full textual labels (referred to as "meanings" in the codings file), which correspond to the codes and labels used in the shapefiles from which the data are derived as described in Geographic boundary file sources and versioning

Below is an example of the release candidate:

PID
COUNTRY_AT_REG
REGION_AT_REG

A1B2C3D

E92000001

TLF

M4N5K6L

E92000001

TLI

R7T9LQ2

W92000004

TLL

How do I interpret the structured field names?

Field names are short, descriptive, and often abbreviated labels used to indicate the contents of each column in the dataset.

These data are not versioned and contain only one value per column. The field names reflect the geographic level followed by the context or time point from which they were obtained, for example:

  • country_at_reg - the derived country at the time of registration.

  • region_at_reg - the derived region at the time of registration.

What metadata is available to help document the participant geographies releases?

We provide the following data files on our Data and cohort page (external link):

  • data dictionary - which defines the raw data fields and metadata information, such as labels, descriptions and units of measurements

  • coding file - which contains the granular details of categorical or raw coded values

If using Microsoft Excel to browse these files, for an optimal viewing experience, ensure the encoding settings are set to UTF-8.

Last updated