Analyzing FHIR Data in a Tabular Format With Python

Learning objectives
  1. Understand the high-level approaches for converting FHIR-formatted data into tabular for analysis in Python.
  2. Learn how the FHIR-PYrate library facilitates requesting data from a FHIR server, and creating tidy tabular data tables.
Relevant roles:
  • Informaticist

Data analysis approaches in Python often use Pandas DataFrames to store tabular data. There are two primary approaches to loading FHIR-formatted data into Pandas DataFrames:

  1. Writing Python code to manually convert FHIR instances in JSON format into DataFrames.

    This does not require any special skills beyond data manipulation in Python, but in practice can be laborious (especially with large number of data elements) and prone to bugs.

  2. Using a purpose-built library like FHIR-PYrate to automatically convert FHIR instances into DataFrames.

    It is recommended to try this approach first, and only fall back to (1) if needed.

To use FHIR-PYrate, you will need a Python 3 runtime with FHIR-PYrate and Pandas installed.

1 FHIR testing server

The examples in this module use a FHIR testing server populated with Synthea data in FHIR R4 format via public HAPI Test Server operated by HAPI FHIR.

The endpoint for this testing server is:

https://hapi.fhir.org/baseR4

However, any FHIR server loaded with testing data can be used. See Standing up a FHIR Testing Server for instructions to set up your own test server.

The code blocks in the following section show sample output immediately after. This is similar to the code cells and results in a Jupyter notebook.

2 Retrieving FHIR data

Once your environment is set up, you can run the following Python code to retrieve instances of the Patient resource from a test server:

# Load dependencies
from fhir_pyrate import Pirate
import pandas as pd

# Instantiate a Pirate object using the FHIR-PYrate library to query a test FHIR server
search = Pirate(
    auth=None,
    base_url="https://hapi.fhir.org/baseR4",
    print_request_url=True,
)

# Use the whimsically named `steal_bundles()` method to instantiate a search interaction
#
# For more information, see https://github.com/UMEssen/FHIR-PYrate/#pirate
bundles = search.steal_bundles(
    resource_type="Patient",
    request_params={
        "_count": 10,  # Get 10 instances per page
        "identifier": "https://github.com/synthetichealth/synthea|",
    },
    num_pages=1,  # Get 1 page (so a total of 10 instances)
)

# Execute the search and convert to a Pandas DataFrame
df = search.bundles_to_dataframe(bundles)

df.head(5)
https://hapi.fhir.org/baseR4/Patient?_count=10&identifier=https://github.com/synthetichealth/synthea|
Query (Patient):   0%|          | 0/1 [00:00<?, ?it/s]Query (Patient): 100%|██████████| 1/1 [00:00<00:00, 898.52it/s]
resourceType id meta_versionId meta_lastUpdated meta_source text_status text_div extension_0_url extension_0_extension_0_url extension_0_extension_0_valueCoding_system ... maritalStatus_coding_0_code maritalStatus_coding_0_display maritalStatus_text multipleBirthBoolean communication_0_language_coding_0_system communication_0_language_coding_0_code communication_0_language_coding_0_display communication_0_language_text address_0_postalCode communication_0_preferred
0 Patient 258974 4 2023-09-28T18:03:34.414+00:00 #0oBHuipVVwzUjQdl generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... S Single S False urn:ietf:bcp:47 en-US English English NaN NaN
1 Patient 298666 5 2023-09-28T22:01:11.961+00:00 #CLEMnh2cjZt823TI generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... S Married S False urn:ietf:bcp:47 en-US English English 78945 True
2 Patient 597991 1 2020-02-06T13:54:01.532+00:00 #EF8l4i0AB5VVT4EV generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... S S S False urn:ietf:bcp:47 en-US English English 02125 NaN
3 Patient 599723 1 2020-02-07T07:49:49.310+00:00 #kaVgc0UOS8gvNzmQ generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... S Never Married Never Married False urn:ietf:bcp:47 en-US English English 01730 NaN
4 Patient 599918 1 2020-02-07T10:50:01.703+00:00 #Zto9Eq5TphGgz8ZD generated <div xmlns="http://www.w3.org/1999/xhtml">Gene... http://hl7.org/fhir/us/core/StructureDefinitio... ombCategory urn:oid:2.16.840.1.113883.6.238 ... S Never Married Never Married False urn:ietf:bcp:47 en-US English English 01001 NaN

5 rows × 90 columns

It is easier to see the contents of this DataFrame by printing out its first row vertically:

# Print the first row of the DataFrame vertically for easier reading.
pd.set_option("display.max_rows", 100)  # Show all rows
df.head(1).T
0
resourceType Patient
id 258974
meta_versionId 4
meta_lastUpdated 2023-09-28T18:03:34.414+00:00
meta_source #0oBHuipVVwzUjQdl
text_status generated
text_div <div xmlns="http://www.w3.org/1999/xhtml">Gene...
extension_0_url http://hl7.org/fhir/us/core/StructureDefinitio...
extension_0_extension_0_url ombCategory
extension_0_extension_0_valueCoding_system urn:oid:2.16.840.1.113883.6.238
extension_0_extension_0_valueCoding_code 2106-3
extension_0_extension_0_valueCoding_display White
extension_0_extension_1_url text
extension_0_extension_1_valueString White
extension_1_url http://hl7.org/fhir/us/core/StructureDefinitio...
extension_1_extension_0_url ombCategory
extension_1_extension_0_valueCoding_system urn:oid:2.16.840.1.113883.6.238
extension_1_extension_0_valueCoding_code 2186-5
extension_1_extension_0_valueCoding_display Not Hispanic or Latino
extension_1_extension_1_url text
extension_1_extension_1_valueString Not Hispanic or Latino
extension_2_url http://hl7.org/fhir/StructureDefinition/patien...
extension_2_valueString Ying817 Eichmann909
extension_3_url http://hl7.org/fhir/us/core/StructureDefinitio...
extension_3_valueCode F
extension_4_url http://hl7.org/fhir/StructureDefinition/patien...
extension_4_valueAddress_city Worcester
extension_4_valueAddress_state Massachusetts
extension_4_valueAddress_country US
extension_5_url http://synthetichealth.github.io/synthea/disab...
extension_5_valueDecimal 7.222524
extension_6_url http://synthetichealth.github.io/synthea/quali...
extension_6_valueDecimal 39.777476
identifier_0_system https://github.com/synthetichealth/synthea
identifier_0_value bf23e283-4791-46e1-9d79-9e0ad9edd436
identifier_1_type_coding_0_system http://terminology.hl7.org/CodeSystem/v2-0203
identifier_1_type_coding_0_code MR
identifier_1_type_coding_0_display Medical Record Number
identifier_1_type_text Medical Record Number
identifier_1_system http://hospital.smarthealthit.org
identifier_1_value bf23e283-4791-46e1-9d79-9e0ad9edd436
identifier_2_type_coding_0_system http://terminology.hl7.org/CodeSystem/v2-0203
identifier_2_type_coding_0_code SS
identifier_2_type_coding_0_display Social Security Number
identifier_2_type_text Social Security Number
identifier_2_system http://hl7.org/fhir/sid/us-ssn
identifier_2_value 999-21-6325
identifier_3_type_coding_0_system http://terminology.hl7.org/CodeSystem/v2-0203
identifier_3_type_coding_0_code DL
identifier_3_type_coding_0_display Driver's License
identifier_3_type_text Driver's License
identifier_3_system urn:oid:2.16.840.1.113883.4.3.25
identifier_3_value S99948444
identifier_4_type_coding_0_system http://terminology.hl7.org/CodeSystem/v2-0203
identifier_4_type_coding_0_code PPN
identifier_4_type_coding_0_display Passport Number
identifier_4_type_text Passport Number
identifier_4_system http://standardhealthrecord.org/fhir/Structure...
identifier_4_value X30821805X
active True
name_0_use official
name_0_family Keebler
name_0_given_0 Kina
name_0_prefix_0 Ms.
telecom_0_system phone
telecom_0_value 555-939-7778
telecom_0_use home
gender female
birthDate 1971-01-13
deceasedBoolean False
address_0_extension_0_url http://hl7.org/fhir/StructureDefinition/geoloc...
address_0_extension_0_extension_0_url latitude
address_0_extension_0_extension_0_valueDecimal 42.5917
address_0_extension_0_extension_1_url longitude
address_0_extension_0_extension_1_valueDecimal -70.641346
address_0_line_0 1038 Harvey Green
address_0_city Gloucester
address_0_state Massachusetts
address_0_country US
maritalStatus_coding_0_system http://terminology.hl7.org/CodeSystem/v3-Marit...
maritalStatus_coding_0_code S
maritalStatus_coding_0_display Single
maritalStatus_text S
multipleBirthBoolean False
communication_0_language_coding_0_system urn:ietf:bcp:47
communication_0_language_coding_0_code en-US
communication_0_language_coding_0_display English
communication_0_language_text English
address_0_postalCode NaN
communication_0_preferred NaN

If you look at the output above, you can see FHIR-PYrate collapsed the hierarchical FHIR data structure into DataFrame columns. FHIR-PYrate does this by taking an element from the FHIR-formatted data like Patient.identifier[0].value and converting to an underscore-delimited column name like identifier_0_value. (Note that Patient.identifier has multiple values in the FHIR data, so there are multiple identifier_N_... columns in the DataFrame.)

3 Selecting specific columns

Usually not every single value from a FHIR instance is needed for analysis. There are two ways to get a more concise DataFrame:

  1. Use the approach above to load all elements into a DataFrame, remove the unneeded columns, and rename the remaining columns as needed. The process_function capability in FHIR-PYrate allows you to integrate this approach into the bundles_to_dataframe() method call.
  2. Use FHIRPath to select specific elements and map them onto column names.

The second approach is typically more concise. For example, to generate a DataFrame like this…

id gender date_of_birth marital_status

…you could use the following code:

# Instantiate and perform the FHIR search interaction in a single function call
df = search.steal_bundles_to_dataframe(
    resource_type="Patient",
    request_params={
        "_count": 10,  # Get 10 instances per page
        "identifier": "https://github.com/synthetichealth/synthea|",
    },
    num_pages=1,  # Get 1 page (so a total of 10 instances)
    fhir_paths=[
        ("id", "identifier[0].value"),
        ("gender", "gender"),
        ("date_of_birth", "birthDate"),
        ("marital_status", "maritalStatus.coding[0].code"),
    ],
)
df
https://hapi.fhir.org/baseR4/Patient?_count=10&identifier=https://github.com/synthetichealth/synthea|
Query & Build DF (Patient):   0%|          | 0/1 [00:00<?, ?it/s]Query & Build DF (Patient): 100%|██████████| 1/1 [00:00<00:00, 272.23it/s]
id gender date_of_birth marital_status
0 bf23e283-4791-46e1-9d79-9e0ad9edd436 female 1971-01-13 S
1 bf23e283-4791-46e1-9d79-9e0ad9edd436 female 1971-01-14 S
2 7a2886c6-67cd-40bb-87a9-13423e051102 male 1958-04-05 S
3 b9a32653-9fde-401f-bb32-9932e680c456 female 2019-09-06 S
4 0eb8992d-42cb-4828-b743-77246ce98f97 female 2012-06-20 S
5 0eb8992d-42cb-4828-b743-77246ce98f97 female 2012-06-20 S
6 0eb8992d-42cb-4828-b743-77246ce98f97 female 2012-06-20 S
7 0eb8992d-42cb-4828-b743-77246ce98f97 female 2012-06-20 S
8 0eb8992d-42cb-4828-b743-77246ce98f97 female 2012-06-20 S
9 6ad7ab9c-5609-4da7-861e-657127e2d210 female 2013-11-30 S

While FHIRPath can be quite complex, its use in FHIR-PYrate is often straight forward. Nested elements are separated with ., and elements with multiple sub-values are identified by [N] where N is an integer starting at 0. The element paths can typically be constructed by loading all elements into a DataFrame and then manually deriving the FHIRPaths from the column names, or by looking at the hierarchy resource pages in the FHIR specification (see Key FHIR Resources for more information on reading the FHIR specification).

4 Elements with multiple sub-values

There are multiple identifier[N].value values for each instance of Patient in this dataset.

# Instantiate and perform the FHIR search interaction in a single function call
df = search.steal_bundles_to_dataframe(
    resource_type="Patient",
    request_params={
        "_count": 10,  # Get 10 instances per page
        "identifier": "https://github.com/synthetichealth/synthea|",
    },
    num_pages=1,  # Get 1 page (so a total of 10 instances)
    fhir_paths=[("id", "identifier[0].value"), ("identifiers", "identifier.value")],
)
df
https://hapi.fhir.org/baseR4/Patient?_count=10&identifier=https://github.com/synthetichealth/synthea|
Query & Build DF (Patient):   0%|          | 0/1 [00:00<?, ?it/s]Query & Build DF (Patient): 100%|██████████| 1/1 [00:00<00:00, 500.75it/s]
id identifiers
0 bf23e283-4791-46e1-9d79-9e0ad9edd436 [bf23e283-4791-46e1-9d79-9e0ad9edd436, bf23e28...
1 bf23e283-4791-46e1-9d79-9e0ad9edd436 [bf23e283-4791-46e1-9d79-9e0ad9edd436, bf23e28...
2 7a2886c6-67cd-40bb-87a9-13423e051102 [7a2886c6-67cd-40bb-87a9-13423e051102, 7a2886c...
3 b9a32653-9fde-401f-bb32-9932e680c456 [b9a32653-9fde-401f-bb32-9932e680c456, b9a3265...
4 0eb8992d-42cb-4828-b743-77246ce98f97 [0eb8992d-42cb-4828-b743-77246ce98f97, 0eb8992...
5 0eb8992d-42cb-4828-b743-77246ce98f97 [0eb8992d-42cb-4828-b743-77246ce98f97, 0eb8992...
6 0eb8992d-42cb-4828-b743-77246ce98f97 [0eb8992d-42cb-4828-b743-77246ce98f97, 0eb8992...
7 0eb8992d-42cb-4828-b743-77246ce98f97 [0eb8992d-42cb-4828-b743-77246ce98f97, 0eb8992...
8 0eb8992d-42cb-4828-b743-77246ce98f97 [0eb8992d-42cb-4828-b743-77246ce98f97, 0eb8992...
9 6ad7ab9c-5609-4da7-861e-657127e2d210 [6ad7ab9c-5609-4da7-861e-657127e2d210, 6ad7ab9...

To convert to separate columns, you can do the following:

df.join(pd.DataFrame(df.pop("identifiers").values.tolist()).add_prefix("identifier_"))
id identifier_0 identifier_1 identifier_2 identifier_3 identifier_4
0 bf23e283-4791-46e1-9d79-9e0ad9edd436 bf23e283-4791-46e1-9d79-9e0ad9edd436 bf23e283-4791-46e1-9d79-9e0ad9edd436 999-21-6325 S99948444 X30821805X
1 bf23e283-4791-46e1-9d79-9e0ad9edd436 bf23e283-4791-46e1-9d79-9e0ad9edd436 bf23e283-4791-46e1-9d79-9e0ad9edd436 999-21-6325 S99948444 X30821805X
2 7a2886c6-67cd-40bb-87a9-13423e051102 7a2886c6-67cd-40bb-87a9-13423e051102 7a2886c6-67cd-40bb-87a9-13423e051102 999-91-9486 S99942298 X72218123X
3 b9a32653-9fde-401f-bb32-9932e680c456 b9a32653-9fde-401f-bb32-9932e680c456 b9a32653-9fde-401f-bb32-9932e680c456 999-62-8542 None None
4 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 999-31-3830 None None
5 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 999-31-3830 None None
6 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 999-31-3830 None None
7 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 999-31-3830 None None
8 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 0eb8992d-42cb-4828-b743-77246ce98f97 999-31-3830 None None
9 6ad7ab9c-5609-4da7-861e-657127e2d210 6ad7ab9c-5609-4da7-861e-657127e2d210 6ad7ab9c-5609-4da7-861e-657127e2d210 999-70-8572 None None

This will give you separate identifier_0, identifier_1, … columns for each Patient.identifier[N] value.