Understand the high-level approaches for converting FHIR-formatted data into tabular for analysis in R.
Learn how the fhircrackr library facilitates requesting data from a FHIR server, and creating tidy tabular data tables.
Relevant roles:
Informaticist
Data analysis approaches in R typically uses data frames to store tabular data. There are two primary approaches to loading FHIR-formatted data into Pandas DataFrames:
Writing R code to manually convert FHIR instances in JSON format into data frames.
Using a purpose-built library like fhircrackr to automatically convert FHIR instances into DataFrames.
It is recommended to try this approach first. If it is not possible to use fhircrackr for your use case, it may be easier to convert the data from FHIR to tabular format using Python and then export it to R format compared to doing this completely in R. The Reticulate package may facilitate this by allowing Python and R code to share data objects within RStduio.
To use fhircrackr, you will need a R runtime with fhircrackr installed. Typically R users work in the RStudio IDE but this is not strictly necessary.
However, any FHIR server loaded with testing data can be used. See Standing up a FHIR Testing Server for instructions to set up your own test server.
The code blocks in the following section show sample output immediately after. This is similar to the code blocks and results in a rendered RMarkdown file.
2 Retrieving FHIR data
Once your environment is set up, you can run the following R code to retrieve instances of the Patient resource from a test server:
# Load dependencieslibrary(fhircrackr)library(tidyverse) # Not strictly necessary, but helpful for working with data in R
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Define the URL of the FHIR server and the request that will be maderequest <-fhir_url(url ="https://api.logicahealth.org/FHIRResearchSynthea/open", resource ="Patient")# Perform the requestpatient_bundle <-fhir_search(request = request, max_bundles =1, verbose =0)# This method defines the mapping from FHIR to data frame columns.# If the `cols` argument is omitted, all data elements will be included in the data frame.table_desc_patient <-fhir_table_description(resource ="Patient")# Convert to R data framedf_patient <-fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose =0)df_patient %>%head(5)
It is easier to see the contents of this DataFrame by printing out its first row vertically:
If you look at the output above, you can see fhircrackr collapsed the hierarchical FHIR data structure into data frame columns, with multiple values delimited by ::: by default. For example, Patient.identifier has multiple values that appear in the data frame as:
Column name
Example Values
identifier.type.text
Medical Record Number:::Social Security Number:::Driver's License:::Passport Number
Usually not every single value from a FHIR instance is needed for analysis. There are two ways to get a more concise data frame:
Use the approach above to load all elements into a data frame, remove the unneeded columns, and rename the remaining columns as needed.
Use XPath to select specific elements and map them onto column names.
The second approach is typically more concise. For example, to generate a DataFrame like this…
id
gender
date_of_birth
marital_status
…
…
…
…
…you could use the following code:
table_desc_patient <-fhir_table_description(resource ="Patient",cols =c(id ="id",gender ="gender",date_of_birth ="birthDate",# Rather than having fhircrackr concatenate all `Patient.maritalStatus` values# into one cell, you can select a specific value with XPath:marital_status ="maritalStatus/coding[1]/code" ))df_patient <-fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose=0)df_patient %>%head(5)
While XPath expressions can be quite complex, thier use in fhircrackr is often straight-forward. Nested elements are separated with /, and elements with multiple sub-values are identified by [N] where N is an integer starting at 1.
There are two approaches to identifying element paths to construct XPath expressions:
Print out the raw data returned by the FHIR server. Fhircrackr uses XML-formatted data, and the following code will print out one of the instances of Patient requested above:
In some cases, you may need to construct more complex expressions like the one to extract marital_status from Patient.maritalStatus.coding[0].code. You can use a tool like this XPath tester to help generate XPath expressions, though online tools such as these should not be used with real patient data. For more information on XPath, see this guide.
4 Elements with multiple sub-values
There are multiple identifier[N].value values for each instance of Patient in this dataset. By default, fhircrackr will concatenate these into a single cell per row, delimited with ::: (this is configurable; use fhir_table_description(..., sep = ' | ', ...) to delimit with | instead).
Fhircrackr provides some tools to split up multiple values stored in the same cell into separate rows in a “long” data frame:
table_desc_patient <-fhir_table_description(resource ="Patient",# Prefix values in cells with indices to facilitate handling cells that contain# multiple valuesbrackets =c("[", "]"))df_patient_indexed <-fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose =0)df_patient_identifiers <-fhir_melt(indexed_data_frame = df_patient_indexed,columns =c("identifier.type.text", "identifier.value"),brackets =c("[", "]"),sep =":::",all_columns =FALSE)df_patient_identifiers %>%head(10)
The df_patient_identifiers data frame printed above has one row for each value of Patient.identifier for each instance of Patient. The in-cell indices (surrounded by [ ]) can be removed:
These can then be merged back into the original data frame as needed. For example, if you want to include the synthetic “Social Security Number” in the original data:
df_patient %>%# Add in row numbers for joiningmutate(row_number =row_number() ) %>%left_join( df_patient_identifiers %>%# Note: this assumes there is just one social security number for each patient in the data.# If this was not true, it would be necessary to remove extra data before joining so there# was one row per patient.filter(`identifier.type.text`=="Social Security Number") %>%rename("ssn"="identifier.value" ) %>%# Exclude the `identifier.type.text` column so it doesn't appear in the joined data frameselect(resource_identifier, ssn) %>%# Fhircrackr generates the `resource_identifier` column as a string, but it needs to be# an integer for joining.mutate(resource_identifier =as.integer(resource_identifier)) ,by=c("row_number"="resource_identifier") ) %>%head(5)
You can see that the synthetic SSNs are now split out into a separate column.
5 Retrieving related data
To retrieve instances of related resources, additional request_params can be added. See Using the FHIR API to Access Data for more information on constructing the parameters for FHIR search interactions.
In the example below, instances of Patient and instances of related Observation resources are requested:
request <-fhir_url(url ="https://api.logicahealth.org/FHIRResearchSynthea/open",resource ="Patient",parameters =c("_revinclude"="Observation:patient","_count"="10"# Limit the number of patients returned to 10 ))response <-fhir_search(request = request, max_bundles =1, verbose =0)table_desc_patient <-fhir_table_description(resource ="Patient",cols =c(id ="id",gender ="gender",date_of_birth ="birthDate",marital_status ="maritalStatus/coding[1]/code" ))df_patient <-fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose =0)table_desc_observation <-fhir_table_description(resource ="Observation")df_observation <-fhir_crack(bundles = response, design = table_desc_observation, verbose =0)df_observation %>% glimpse
This includes many different kinds of observations. The Observation.code element identifies the type of Observation. In this case,http://loinc.org|72166-2 is the LOINC for smoking status. To get smoking status records identified by this LOINC:
request <-fhir_url(url ="https://api.logicahealth.org/FHIRResearchSynthea/open",resource ="Observation",parameters =c("_include"="Observation:patient","code"="http://loinc.org|72166-2" ))# `max_bundles = 1` limits the responses to a subset of Observations for the purposes of# this example -- this argument can be removed to get all relevant Observations/Patients (but# the query takes longer to run)response <-fhir_search(request = request, max_bundles =1, verbose =0)table_desc_patient <-fhir_table_description(resource ="Patient",cols =c(id ="id",gender ="gender",date_of_birth ="birthDate",marital_status ="maritalStatus/coding[1]/code" ))df_patient <-fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose =0)table_desc_observation <-fhir_table_description(resource ="Observation")df_observation <-fhir_crack(bundles = response, design = table_desc_observation, verbose =0)df_observation %>%glimpse()
The df_observation data frame contains just smoking status Observations. The df_patient data frame contains the Patients referenced by the Observations in df_observation.
NIH’s Office of Data Science Strategy has online exercises for converting FHIR-formatted data into tabular format for further analysis. These exercises include implementations in both Python and R. The R exercises go into greater depth on using fhircrackr to access FHIR data in R, including integrating FHIR data with data from other web APIs.