Workshop attendees will learn how to query FHIR resources in various ways, to enable visualizing and analyzing data.
What will participants do as part of the exercise?
📘 A link to a useful external reference related to the section the icon appears in
🖐 A hands-on section where you will code something or interact with the server
In this exercise we’re going to explore how to access the data needed to generate the summary information from the Kids First dashboard in a few different ways. A snapshot of the Kids First dashboard is shown below:
The Kids First Data Portal is accessible at https://portal.kidsfirstdrc.org/explore (login required, though signup is free with any Google account)
For this exercise we’ll be focusing on the following 4 graphs: - Demographics - Most frequent diagnoses - Age at diagnosis - Overall survival
(Note that the image shown depicts the statistics for the entire Kids First population, whereas all graphs in this exercise will be based on specific sub-cohorts of the population, so the graphs we generate today will look a little different.)
Load needed libraries:
library(fhircrackr)
source("exercise_2_fhircrackr_patch.R") # Support Kids First cookie
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.1.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(skimr)
library(summarytools)
##
## Attaching package: 'summarytools'
## The following object is masked from 'package:tibble':
##
## view
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:summarytools':
##
## label, label<-
## The following objects are masked from 'package:base':
##
## units, units<-
# Used for direct RESTful queries against the FHIR server
library(httr)
library(jsonlite)
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
# Visualizations
library(ggthemes)
theme_set(ggthemes::theme_economist_white())
# Survival analysis
library(survival)
library(survminer)
## Loading required package: ggpubr
Kids First uses an HTTP cookie for authentication, which isn’t supported natively by fhircrackr
. The setup
block above loads a patched version of fhircrackr
to support this.
If you see the message “Could not authenticate with Kids First. The cookie may need to be updated” when running the code block above, then let the instructors know ASAP so they can fetch a new cookie, or see these instructions to fetch a cookie and then re-run the setup block above.
Our first step will be show how to review basic demographic information for a patient cohort. Let’s explore a few approaches for constructing a patient cohort.
For the simplest example, let’s just query for the first set of Patients on the server and see what that looks like.
🖐 Knowledge Check: Fill in the query to select Patients on the server.
(Note that there are over 10,000 Patient resources on this server, so we don’t want to query them all or follow all the pagination. For performance reasons, all the examples in this notebook are intended to run with only a single page of results, but in a real-world use case, you would want to follow the pagination as shown in the previous exercise, to make sure you fetched all the requested data for a given query.)
fhir_server <- "https://kf-api-fhir-service.kidsfirstdrc.org"
request <- fhir_url(url = fhir_server, resource = "Patient")
patient_bundle <- fhir_search(request = request, max_bundles = 10)
## Starting download of 10 bundles of resource type https://kf-api-fhir-service.kidsfirstdrc.org/Patient from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org/Patient.
## This may take a while...
## Patched 'get_bundle' in use...
## Patched 'get_bundle' in use...
## Patched 'get_bundle' in use...
## Patched 'get_bundle' in use...
## Patched 'get_bundle' in use...
## Patched 'get_bundle' in use...
## Patched 'get_bundle' in use...
## Patched 'get_bundle' in use...
## Patched 'get_bundle' in use...
## Patched 'get_bundle' in use...
##
## Download completed. Number of downloaded bundles was limited to 10 bundles, this is less than the total number of bundles available.
Let’s filter the bundle down to just the first Patient resource to see what it contains:
xml2::xml_find_first(x = patient_bundle[[1]], xpath = "./entry[1]/resource") %>%
paste0 %>%
cat
## <resource>
## <Patient>
## <id value="103070"/>
## <meta>
## <versionId value="2"/>
## <lastUpdated value="2021-11-16T09:49:02.048+00:00"/>
## <source value="#yOJAbnQcyXm5DGen"/>
## <profile value="http://hl7.org/fhir/StructureDefinition/Patient"/>
## <tag>
## <code value="SD_0TYVY1TW"/>
## </tag>
## </meta>
## <extension url="http://hl7.org/fhir/us/core/StructureDefinition/us-core-race">
## <extension url="text">
## <valueString value="Not Reported"/>
## </extension>
## </extension>
## <extension url="http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity">
## <extension url="text">
## <valueString value="Not Reported"/>
## </extension>
## </extension>
## <identifier>
## <value value="2-4F"/>
## </identifier>
## <identifier>
## <system value="https://kf-api-dataservice.kidsfirstdrc.org/participants/"/>
## <value value="PT_803DN7MS"/>
## </identifier>
## <identifier>
## <system value="urn:kids-first:unique-string"/>
## <value value="Patient-SD_0TYVY1TW-PT_803DN7MS"/>
## </identifier>
## <gender value="male"/>
## </Patient>
## </resource>
Looking at this XML, it appears to contain the data to construct a data frame of patients with some basic demographics:
id |
gender |
race |
ethnicity |
---|---|---|---|
103070 | male | Not Reported | Not Reported |
… | … | … | … |
Gender is relatively easy to extract, but race and Ethnicity are a little trickier to extract because they are recorded as extensions. Extensions are used to represent information that is not part of the basic definition of a resource.
Every element in a resource or data type includes an optional “extension” child element that may be present any number of times. Extensions contain a defining url
and either a value[x]
or sub-extensions (but not both).
This also leads into choice types, ie, that value[x]
. Choice types allow for different instances to use different data types as appropriate. Only one of the choices is allowed at a time on a given resource instance.
A simple example of choice types is the Patient.deceased[x]
field indicating if the individual is deceased or not. deceased[x]
is allowed to be either a boolean
or dateTime
.
Note that extensions are also allowed on primitive types. If you are looking at the JSON representation of FHIR resources (see Exercise 1), extensions on primitive types are represented by prepending the field name with an underscore _
to create a new object-type field where the extension field can be added. The following example demonstrates the “birthTime” extension on the Patient.birthDate
field:
{
"resourceType": "Patient",
...
"birthDate": "1987-06-05",
"_birthDate": {
"extension": [
{
"url": "http://hl7.org/fhir/StructureDefinition/patient-birthTime",
"valueDateTime": "1987-06-05T04:32:01Z"
}
]
}
}
The XML version looks like this:
<birthDate value="1987-06-05">
<extension url="http://hl7.org/fhir/StructureDefinition/patient-birthTime">
<valueDateTime value="1987-06-05T04:32:01Z"/>
</extension>
</birthDate>
We’ll see more instances like this later in the exercise.
📘Read more about Extensions in FHIR
Getting back to Race and Ethnicity, these extensions are defined within US Core which is an implementation guide that defines the base set of requirements for FHIR implementation in the US and reflects the ONC U.S. Core Data for Interoperability required data fields. Further details about US Core are outside the scope of this exercise, but for now understand that nearly all FHIR data within the US will use US Core.
Both the Race and Ethnicity extension use subextensions to represent the concept in 3 possible ways: - OMB Category, based on the (https://www.govinfo.gov/content/pkg/FR-1997-10-30/pdf/97-28653.pdf) - url
is “ombCategory” - valueCoding
from the OMB Race Categories ValueSet or OMB Ethnicity Categories ValueSet - Detailed, based on CDC Race and Ethnicity codes - url
is “detailed” - valueCoding
from the Detailed race ValueSet or Detailed ethnicity ValueSet - Text, free text (required) - url
is “text” - valueString
is free text
📘Read more about the US Core Race Extension
📘Read more about the US Core Ethnicity Extension
Given the above let’s define functions to find the Race and Ethnicity on a Patient resource.
🖐 Fill in the blank XPath queries below to extract the race and ethnicity values out of the extensions on a Patient resource:
# Identify which elements of the FHIR resource we want to capture in our data frame - see Exercise 0 for details
table_desc_patient <- fhir_table_description(
resource = "Patient",
cols = c(
id = "id",
gender = "gender",
race_string = str_c(
"extension[@url=\"http://hl7.org/fhir/us/core/StructureDefinition/us-core-race\"]",
"/extension[@url=\"text\"]",
"/valueString"
),
# The resources we are working with store race and ethincity as strings rather than
# codes. If you did need to extract the codes, this is what the XPath queries would
# look like:
#
# race_coding_display = str_c(
# "extension[@url=\"http://hl7.org/fhir/us/core/StructureDefinition/us-core-race\"]",
# "/extension[@url=\"text\"]",
# "/valueCoding",
# "/display"
# ),
# race_coding_code = str_c(
# "extension[@url=\"http://hl7.org/fhir/us/core/StructureDefinition/us-core-race\"]",
# "/extension[@url=\"text\"]",
# "/valueCoding",
# "/code"
# ),
# 🖐 Fill in the XPath query to extract the ethnicity from the `valueString` of the extension:
ethnicity_string = str_c(
"extension[@url=\"http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity\"]",
"/extension[@url=\"text\"]",
"/valueString"
)
)
)
# Convert to R data frame
df_patient <- fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose = 0)
## Warning in fhir_crack(bundles = patient_bundle, design = table_desc_patient, : In the cols element of the design, you specified XPath expressions containing '//' which point to an arbitrary level in the resource.
## This can result in unexpected behaviour, e.g. when the searched element appears on different levels of the resource.
## We strongly advise to only use the fully specified relative XPath in the cols element, e.g. 'ingredient/strength/numerator/code' instead of search paths like '//code'.
## This warning is thrown for the following data.frame descriptions: race_string, ethnicity_string
df_patient
Let’s look at some descriptive statistics:
df_patient %>% freq(gender)
## Frequencies
## df_patient$gender
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ------------ ------ --------- -------------- --------- --------------
## female 240 48.00 48.00 48.00 48.00
## male 260 52.00 100.00 52.00 100.00
## <NA> 0 0.00 100.00
## Total 500 100.00 100.00 100.00 100.00
df_patient %>% freq(race_string)
## Frequencies
## df_patient$race_string
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## -------------------------------------- ------ --------- -------------- --------- --------------
## American Indian or Alaska Native 19 3.82 3.82 3.80 3.80
## Asian 11 2.21 6.04 2.20 6.00
## Black or African American 40 8.05 14.08 8.00 14.00
## More Than One Race 18 3.62 17.71 3.60 17.60
## Not Allowed To Collect 109 21.93 39.64 21.80 39.40
## Not Reported 39 7.85 47.48 7.80 47.20
## White 261 52.52 100.00 52.20 99.40
## <NA> 3 0.60 100.00
## Total 500 100.00 100.00 100.00 100.00
df_patient %>% freq(ethnicity_string)
## Frequencies
## df_patient$ethnicity_string
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ---------------------------- ------ --------- -------------- --------- --------------
## Hispanic or Latino 183 42.46 42.46 36.60 36.60
## Not Allowed To Collect 12 2.78 45.24 2.40 39.00
## Not Hispanic or Latino 217 50.35 95.59 43.40 82.40
## Not Reported 19 4.41 100.00 3.80 86.20
## <NA> 69 13.80 100.00
## Total 500 100.00 100.00 100.00 100.00
This data frame can also easily produce charts:
ggplot(df_patient, aes(x="", y=factor(1), fill=gender)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Blues")
ggplot(df_patient, aes(x="", y=factor(1), fill=race_string)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Blues")
ggplot(df_patient, aes(x="", y=factor(1), fill=ethnicity_string)) +
geom_bar(stat="identity", width=1) +
coord_polar("y", start=0) +
theme_void() +
scale_fill_brewer(palette="Blues")
In the previous steps we reviewed what is essentially a random set of Patients, just the first set that the server returned when we asked for all Patients. Now let’s get more targeted and query for just patients who have a diagnosis of a particular Condition. Then we can use the same process and functions we’ve already defined to analyze/visualize it.
Kids First uses the Mondo Disease Ontology for describing Conditions. Other servers may use different one or code systems such as SNOMED-CT, ICD-10, or others. A simple browser for finding Mondo codes by description is available at https://www.ebi.ac.uk/ols/ontologies/mondo . Using this browser, we can look at a few sample codes:
code | description |
---|---|
MONDO:0005015 | diabetes mellitus |
MONDO:0005961 | sinusitis |
MONDO:0008903 | lung cancer |
MONDO:0021640 | grade III glioma |
Let’s use grade III glioma as our condition of interest, with MONDO:0021640 as our code of interest going forward.
In Exercise 1 we saw an instance of basic querying, when we searched for MedicationRequests associated to a given Patient. (Reminder: "{FHIR_SERVER}/MedicationRequest?patient=10098"
) This is one of the most basic and fundamental types of query, where we get resources from a server, filtered by some aspect of the resource itself. In the previous example with medications, the MedicationRequest resource has a reference back to the Patient in the patient
field so we can query that directly. But what if we want to go in the other direction? For example, find all Patients that are taking a given Medication, or Patients that have been diagnosed with a given Condition?
Enter “chaining” and “reverse chaining”. These are capabilities of FHIR that allow for more complex queries that can save a client and/or server from having to perform a series of operations.
The FHIR documentation offers the following examples of chaining:
In order to save a client from performing a series of search operations, reference parameters may be “chained” by appending them with a period (.) followed by the name of a search parameter defined for the target resource. This can be done recursively, following a logical path through a graph of related resources, separated by
.
. For instance, given that the resourceDiagnosticReport
has a search parameter namedsubject
, which is usually a reference to aPatient
resource, and thePatient
resource includes a parametername
which searches on patient name, then the search
GET [base]/DiagnosticReport?subject.name=peter
is a request to return all the lab reports that have a
subject
whosename
includes “peter”. Because the Diagnostic Report subject can be one of a set of different resources, it’s necessary to limit the search to a particular type:
GET [base]/DiagnosticReport?subject:Patient.name=peter
This request returns all the lab reports that have a subject which is a patient, whose name includes “peter”.
In the case of “Patients diagnosed with a given Condition”, we want the opposite direction - search for resources based on what links back to them. This is done with the _has
search parameter.
The _has
search parameter uses the colon character :
to separate fields, and requires a few sub-parameters:
A complete example is:
[base]/Patient?_has:Observation:patient:code=1234-5
This requests the server to return Patient resources, where the patient resource is referred to by at least one Observation where the observation has a code of 1234, and where the Observation refers to the patient resource in the patient search parameter.
Unfortunately we acknowledge the syntax is a little confusing. It may be easiest to read this query as as “Get Patients that have an Observation that links back to this Patient having a code of 1234-5”
📘 Read more about FHIR Search Chaining and Reverse Chaining
Let’s use this approach to find Patients based on a diagnosis.
🖐 Fill in the search query (in the parameters
argument) to find Patients that have a Condition of grade III glioma.
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list("_has:Condition:patient:code" = "MONDO:0021640"))
patient_bundle <- fhir_search(request = request, max_bundles = 1)
## Starting download of 1 bundles of resource type Patient from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org.
## Patched 'get_bundle' in use...
##
## Download completed. Number of downloaded bundles was limited to 1 bundles, this is less than the total number of bundles available.
# Can use the same table description as we set up above
df_patient_glioma <- fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose = 0)
## Warning in fhir_crack(bundles = patient_bundle, design = table_desc_patient, : In the cols element of the design, you specified XPath expressions containing '//' which point to an arbitrary level in the resource.
## This can result in unexpected behaviour, e.g. when the searched element appears on different levels of the resource.
## We strongly advise to only use the fully specified relative XPath in the cols element, e.g. 'ingredient/strength/numerator/code' instead of search paths like '//code'.
## This warning is thrown for the following data.frame descriptions: race_string, ethnicity_string
Let’s look at the descriptive statistics for the first 50 glioma patients – will use the excellent table1
library this time:
table1(~ gender + race_string + ethnicity_string, data = df_patient_glioma, overall = "Glioma")
Glioma (N=50) |
|
---|---|
gender | |
female | 19 (38.0%) |
male | 28 (56.0%) |
unknown | 3 (6.0%) |
race_string | |
Asian | 3 (6.0%) |
Black or African American | 4 (8.0%) |
Not Available | 1 (2.0%) |
Other | 15 (30.0%) |
White | 27 (54.0%) |
ethnicity_string | |
Hispanic or Latino | 7 (14.0%) |
Not Available | 1 (2.0%) |
Not Hispanic or Latino | 31 (62.0%) |
Not Reported | 11 (22.0%) |
The Kids First portal is comprised of multiple research studies. See more at https://portal.kidsfirstdrc.org/studies or https://www.notion.so/Studies-and-Access-a5d2f55a8b40461eac5bf32d9483e90f
In this step we’ll explore how to query for patients specifically associated to one of these research studies. Let’s pick the “Pediatric Brain Tumor Atlas: CBTTC” as an example, because it has a large number of participants.
First let’s find the study we are interested in as a ResearchStudy. There are a few possible ways we can do this, for example a search on ResearchStudy.title, but we don’t necessarily know the title of the FHIR resource is going to match those other lists.
Let’s list all the ResearchStudies on the server and see what we can find.
request <- fhir_url(url = fhir_server, resource = "ResearchStudy")
research_study_bundle <- fhir_search(request = request)
## Starting download of ALL! bundles of resource type https://kf-api-fhir-service.kidsfirstdrc.org/ResearchStudy from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org/ResearchStudy.
## This may take a while...
## Patched 'get_bundle' in use...
##
## Download completed. All available bundles were downloaded.
Let’s look at the XML for the first ResearchStudy resource instance returned:
xml2::xml_find_first(x = research_study_bundle[[1]], xpath = "./entry[1]/resource") %>%
paste0 %>%
cat
## <resource>
## <ResearchStudy>
## <id value="276195"/>
## <meta>
## <versionId value="2"/>
## <lastUpdated value="2022-01-19T01:38:53.070+00:00"/>
## <source value="#Vi7u1eZZ5de8QLJp"/>
## <profile value="http://hl7.org/fhir/StructureDefinition/ResearchStudy"/>
## </meta>
## <identifier>
## <system value="https://kf-api-dataservice.kidsfirstdrc.org/studies/"/>
## <value value="SD_9PYZAHHE"/>
## </identifier>
## <identifier>
## <system value="urn:kids-first:unique-string"/>
## <value value="ResearchStudy-SD_9PYZAHHE"/>
## </identifier>
## <identifier>
## <system value="https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id="/>
## <value value="phs001168.v2.p2"/>
## </identifier>
## <title value="Genomic Studies of Orofacial Cleft Birth Defects"/>
## <status value="completed"/>
## <category>
## <coding>
## <system value="http://snomed.info/sct"/>
## <code value="276720006"/>
## <display value="Dysmorphism (disorder)"/>
## </coding>
## <text value="BIRTHDEFECT"/>
## </category>
## <keyword>
## <coding>
## <code value="Kids First"/>
## </coding>
## </keyword>
## <keyword>
## <coding>
## <code value="KF-OCEA"/>
## </coding>
## </keyword>
## <principalInvestigator>
## <reference value="PractitionerRole/117866"/>
## </principalInvestigator>
## </ResearchStudy>
## </resource>
Based on this, we can construct the XPath queries to pull these resources into a data frame:
table_desc_research_study <- fhir_table_description(
resource = "ResearchStudy",
cols = c(
id = "id",
title = "title"
)
)
# Convert to R data frame
df_study <- fhir_crack(bundles = research_study_bundle, design = table_desc_research_study, verbose = 0)
df_study
We want ID 76758, which actually has title “Pediatric Brain Tumor Atlas - Children’s Brain Tumor Tissue Consortium”. We’ll continue to use this ResearchStudy for future steps in this exercise.
df_study %>% filter(id == 76758)
We can query for Patient resources by ResearchStudy via those ResearchSubjects (notice the reference to a Patient in the individual
field), and again run our same analysis. (hint: sounds like reverse-chaining again!)
🖐 Fill in the query to find Patients that are associated to ResearchStudy 76758
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list("_has:ResearchSubject:individual:study" = "76758"))
patient_bundle <- fhir_search(request = request, max_bundles = 1)
## Starting download of 1 bundles of resource type Patient from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org.
## Patched 'get_bundle' in use...
##
## Download completed. Number of downloaded bundles was limited to 1 bundles, this is less than the total number of bundles available.
# Can use the same table description as we set up above
df_patient_study <- fhir_crack(bundles = patient_bundle, design = table_desc_patient, verbose = 0)
## Warning in fhir_crack(bundles = patient_bundle, design = table_desc_patient, : In the cols element of the design, you specified XPath expressions containing '//' which point to an arbitrary level in the resource.
## This can result in unexpected behaviour, e.g. when the searched element appears on different levels of the resource.
## We strongly advise to only use the fully specified relative XPath in the cols element, e.g. 'ingredient/strength/numerator/code' instead of search paths like '//code'.
## This warning is thrown for the following data.frame descriptions: race_string, ethnicity_string
table1(~ gender + race_string + ethnicity_string, data = df_patient_study, overall = "Study 76758")
Study 76758 (N=50) |
|
---|---|
gender | |
female | 24 (48.0%) |
male | 23 (46.0%) |
unknown | 3 (6.0%) |
race_string | |
Asian | 3 (6.0%) |
Black or African American | 2 (4.0%) |
Native Hawaiian or Other Pacific Islander | 2 (4.0%) |
Not Available | 1 (2.0%) |
Other | 13 (26.0%) |
White | 29 (58.0%) |
ethnicity_string | |
Hispanic or Latino | 8 (16.0%) |
Not Available | 1 (2.0%) |
Not Hispanic or Latino | 35 (70.0%) |
Not Reported | 6 (12.0%) |
Our second step will be show how to perform queries that enable basic prevalence analysis. Again there are a few different ways we can build a cohort for this. In this step we’ll be looking at diagnoses, which are represented by the Condition resource.
📘 Read more about the FHIR Condition resource.
As before, let’s start with the simplest possible approach of just selecting an unfiltered and unsorted set of Condition resources. This time, let’s tell the server we want 250 Conditions.
(Why 250? In this case it’s the most the server will return in one response.)
📘 Refresher: read more about requesting a certain number of resources.
🖐 Fill in the query to select 250 Condition resources from the server
request <- fhir_url(url = fhir_server, resource = "Condition", parameters = list("_count" = "250"))
condition_bundle <- fhir_search(request = request, max_bundles = 1)
## Starting download of 1 bundles of resource type Condition from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org.
## Patched 'get_bundle' in use...
##
## Download completed. Number of downloaded bundles was limited to 1 bundles, this is less than the total number of bundles available.
# The first few only had `code.text` - change $n `entry[$n]` to integers until
# you see the expected nested `code.coding.code` structure
xml2::xml_find_first(x = condition_bundle[[1]], xpath = "./entry[4]/resource") %>%
paste0 %>%
cat
## <resource>
## <Condition>
## <id value="105028"/>
## <meta>
## <versionId value="2"/>
## <lastUpdated value="2021-11-16T09:49:55.772+00:00"/>
## <source value="#2aibRHvPubS7Bc9Y"/>
## <profile value="https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/phenotype"/>
## <tag>
## <code value="SD_0TYVY1TW"/>
## </tag>
## </meta>
## <identifier>
## <system value="https://kf-api-dataservice.kidsfirstdrc.org/phenotypes/"/>
## <value value="PH_5EX7F9JV"/>
## </identifier>
## <identifier>
## <system value="urn:kids-first:unique-string"/>
## <value value="Condition-SD_0TYVY1TW-PH_5EX7F9JV"/>
## </identifier>
## <verificationStatus>
## <coding>
## <system value="http://terminology.hl7.org/CodeSystem/condition-ver-status"/>
## <code value="confirmed"/>
## <display value="Confirmed"/>
## </coding>
## <text value="Positive"/>
## </verificationStatus>
## <code>
## <coding>
## <system value="http://purl.obolibrary.org/obo/hp.owl"/>
## <code value="HP:0002575"/>
## </coding>
## <text value="Tracheoesophageal fistula"/>
## </code>
## <subject>
## <reference value="Patient/102997"/>
## </subject>
## </Condition>
## </resource>
The key to what this Condition represents is nested within the code
field, but there’s a lot of information there. Let’s dig into three very important types in FHIR: code
, Coding
, and CodeableConcept
.
code
is a FHIR primitive based on string. code
s are generally taken from a controlled set of strings defined elsewhere, and are restricted in that code
s may not contain leading whitespace, trailing whitespace, or more than 1 consecutive whitespace character. "9283-4"
is an example of a code
.
Coding
is a general purpose datatype that builds on top of code
. A Coding
is a representaton of a defined concept using a symbol from a defined code system. Coding
includes fields for code
, the code system
it comes from, the version
of the system, a human-readable display
, and userSelected
to indicate if this coding was chosen directly by the user. An example Coding
:
In JSON:
{
"system": "http://snomed.info/sct",
"code": "444814009",
"display": "Viral sinusitis (disorder)"
}
In XML:
<coding>
<system value="http://snomed.info/sct"/>
<code value="444814009"/>
<display value="Viral sinusitis (disorder)"/>
</coding>
CodeableConcept
is a general purpose datatype builds further on top of Coding
. A CodeableConcept
represents a value that is usually supplied by providing a reference to one or more terminologies or ontologies but may also be defined by the provision of text. Most resources that are defined by specific clinical concepts will include a CodeableConcept
type field. CodeableConcept
includes fields for an array of coding
s and optional text
.
An example CodeableConcept
in JSON:
{
"coding": [
{
"system": "http://snomed.info/sct",
"code": "260385009",
"display": "Negative"
}, {
"system": "https://acme.lab/resultcodes",
"code": "NEG",
"display": "Negative"
}
],
"text": "Negative for Chlamydia Trachomatis rRNA"
}
And in XML:
<valueCodeableConcept>
<coding>
<system value="http://snomed.info/sct"/>
<code value="260385009"/>
<display value="Negative"/>
</coding>
<coding>
<system value="https://acme.lab/resultcodes"/>
<code value="NEG"/>
<display value="Negative"/>
</coding>
<text value="Negative for Chlamydia Trachomatis rRNA"/>
</valueCodeableConcept>
In this case all we really want is a consistent human-readable display, so let’s get these into a data frame and map that code
field into something appropriate.
🖐 Fill in the XPath queries below to extract the text
of the CodeableConcept, and the code
, display
, and system
of the contained Coding.
table_desc_condition <- fhir_table_description(
resource = "Condition",
cols = c(
id = "id",
patient_id = "subject/reference",
codeableconcept_text = "code/text",
coding_code = "code/coding/code",
coding_display = "code/coding/display",
coding_system = "code/coding/system"
)
)
# Convert to R data frame
df_condition <- fhir_crack(bundles = condition_bundle, design = table_desc_condition, verbose = 0)
df_condition
Now let’s create a table of the top 10 most prevalent conditions:
df_condition %>% count(codeableconcept_text, sort = TRUE)
Now let’s create a graph of the top 10 most prevalent conditions:
ggplot(
df_condition %>% count(codeableconcept_text, sort = TRUE) %>% head(10),
aes(x = reorder(codeableconcept_text, n), y = n)
) +
geom_bar(stat="identity") +
coord_flip() +
xlab("Condition") +
scale_y_continuous(breaks=c(0,2,4,6,8,10))
**************** stopped
In the previous steps, we looked at just a random sampling of Conditions: the first 250 that the server happened to return. Now let’s return to the Research Study and see how we can query for just those Conditions.
One might expect we can just chain even further, for example:
/Condition?subject._has:ResearchSubject:individual:study=76758
However, that’s not going to work here. (it seems to hang the entire server for about 2 minutes so we request you not to actually run it)
Instead, let’s combine two search concepts: - get the Patients by ResearchStudy, as we saw before (“reverse chaining”) - include the Conditions that reference back to each Patient
We’ve seen how to find a resource, based on another resource that references it, but we haven’t yet seen how to include multiple resource types in a single search. This leads us to new search parameters we haven’t seen before: _include
and _revinclude
.
_include
allows for including resources that the queried resource references out to. (For example, Condition references out to a Patient and Encounter) _revinclude
ie, “reverse include”, allows for including resources that reference back to the queried resource. (For example, Patient is referenced by Condition)
These parameters specify a search parameter to search on, which includes 3 parts: - The name of the source resource where the reference field exists - The field name of the reference - (optionally) a specific type of target resource, for cases when multiple resource types are allowed.
Some simple examples:
GET [base]/MedicationRequest?_include=MedicationRequest:patient
GET [base]/MedicationRequest?_revinclude=Provenance:target
The first search requests all matching MedicationRequests, to include any patient that the medication prescriptions in the result set refer to. The second search requests all matching prescriptions, return all the provenance resources that refer to them.
📘Read more about including other resources in search results
🖐 Implement the query to select Patients within the ResearchStudy of interest and include their Conditions
Reminder: the ResearchStudy id = 76758
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list("_has:ResearchSubject:individual:study" = "76758", "_revinclude" = "Condition:subject"))
bundle <- fhir_search(request = request, max_bundles = 1)
## Starting download of 1 bundles of resource type Patient from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org.
## Patched 'get_bundle' in use...
##
## Download completed. Number of downloaded bundles was limited to 1 bundles, this is less than the total number of bundles available.
# Can use the same table description as we set up above
df_condition_study <- fhir_crack(bundles = bundle, design = table_desc_condition, verbose = 0)
df_condition_study %>% count(codeableconcept_text, sort = TRUE)
Here’s the graph version:
ggplot(
df_condition_study %>% count(codeableconcept_text, sort = TRUE) %>% head(10),
aes(x = reorder(codeableconcept_text, n), y = n)
) +
geom_bar(stat="identity") +
coord_flip() +
xlab("Condition")
Now we have a more useful graph - the most common diagnoses among a research study cohort. (Note however that this represents only the first page of results from the server, not necessarily the entire cohort. Pagination, as seen in the previous exercise, may be necessary to fetch the entire cohort.)
Our third step will be to see how we can recreate the Age at Diagnosis chart.
To calculate age at diagnosis, we need two pieces of information: - Date of Birth - Date of Diagnosis
However in order to de-identify the data, Kids First has removed date of birth information from Patient resources. Instead they use relative dates via an extension.
In FHIR these may be captured in different resources that we may need to cross-reference:
Patient.birthDate
Condition.onset[x]
Condition.recordedDate
Let’s take a look at how the Kids First server represents these important concepts
Let’s start by querying for Conditions of a given code. We’ll stick with MONDO:0021640 (grade III glioma) as our condition of interest.
🖐 Fill in the query to select Conditions by this code
Then we’ll look at one instance to see what it contains.
request <- fhir_url(url = fhir_server, resource = "Condition", parameters = list("code" = "MONDO:0021640"))
bundle <- fhir_search(request = request, max_bundles = 1)
## Starting download of 1 bundles of resource type Condition from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org.
## Patched 'get_bundle' in use...
##
## Download completed. Number of downloaded bundles was limited to 1 bundles, this is less than the total number of bundles available.
xml2::xml_find_first(x = bundle[[1]], xpath = "./entry[1]/resource") %>%
paste0 %>%
cat
## <resource>
## <Condition>
## <id value="89562"/>
## <meta>
## <versionId value="1"/>
## <lastUpdated value="2021-10-14T21:22:47.805+00:00"/>
## <source value="#XfvdJU5OrVvL5tZt"/>
## <profile value="https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/disease"/>
## </meta>
## <identifier>
## <system value="https://kf-api-dataservice.kidsfirstdrc.org/diagnoses/"/>
## <value value="DG_4YHT9XXH"/>
## </identifier>
## <identifier>
## <system value="urn:kids-first:unique-string"/>
## <value value="Condition-SD_BHJXBDQK-DG_4YHT9XXH"/>
## </identifier>
## <clinicalStatus>
## <coding>
## <system value="http://terminology.hl7.org/CodeSystem/condition-clinical"/>
## <code value="active"/>
## <display value="Active"/>
## </coding>
## <text value="Active"/>
## </clinicalStatus>
## <verificationStatus>
## <coding>
## <system value="http://terminology.hl7.org/CodeSystem/condition-ver-status"/>
## <code value="confirmed"/>
## <display value="Confirmed"/>
## </coding>
## <text value="True"/>
## </verificationStatus>
## <category>
## <coding>
## <system value="http://terminology.hl7.org/CodeSystem/condition-category"/>
## <code value="encounter-diagnosis"/>
## <display value="Encounter Diagnosis"/>
## </coding>
## </category>
## <code>
## <coding>
## <system value="http://purl.obolibrary.org/obo/mondo.owl"/>
## <code value="MONDO:0021640"/>
## </coding>
## <text value="High-grade glioma/astrocytoma (WHO grade III/IV)"/>
## </code>
## <bodySite>
## <text value="Thalamus"/>
## </bodySite>
## <subject>
## <reference value="Patient/76734"/>
## </subject>
## <recordedDate>
## <extension url="http://hl7.org/fhir/StructureDefinition/relative-date">
## <extension url="event">
## <valueCodeableConcept>
## <coding>
## <system value="http://snomed.info/sct"/>
## <code value="3950001"/>
## <display value="Birth"/>
## </coding>
## </valueCodeableConcept>
## </extension>
## <extension url="relationship">
## <valueCode value="after"/>
## </extension>
## <extension url="offset">
## <valueDuration>
## <value value="7004"/>
## <unit value="days"/>
## <system value="http://unitsofmeasure.org"/>
## <code value="d"/>
## </valueDuration>
## </extension>
## </extension>
## </recordedDate>
## </Condition>
## </resource>
What we see here is that the Condition has a recordedDate
field with an extension “http://hl7.org/fhir/StructureDefinition/relative-date”, then nested below that are 3 sub-extensions representing the parts of a “relative date”: - The event that this Condition is relative to - The relationship (before/after) - The numerical offset
See more about the relative-date extension here: http://hl7.org/fhir/R4/extension-relative-date.html
Now let’s put this into a data frame:
🖐 Fill in the blank parts of the XPath query to extract the value and units.
table_desc_condition_glioma <- fhir_table_description(
resource = "Condition",
cols = c(
id = "id",
patient_id = "subject/reference",
recorded_duration = str_c(
"recordedDate",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/value"
),
recorded_duration_units = str_c(
"recordedDate",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/unit"
)
)
)
df_condition_glioma <- fhir_crack(bundles = bundle, design = table_desc_condition_glioma, verbose = 0)
## Warning in fhir_crack(bundles = bundle, design = table_desc_condition_glioma, : In the cols element of the design, you specified XPath expressions containing '//' which point to an arbitrary level in the resource.
## This can result in unexpected behaviour, e.g. when the searched element appears on different levels of the resource.
## We strongly advise to only use the fully specified relative XPath in the cols element, e.g. 'ingredient/strength/numerator/code' instead of search paths like '//code'.
## This warning is thrown for the following data.frame descriptions: recorded_duration, recorded_duration_units
df_condition_glioma
Note: for data aggregated from multiple sources, you may encounter data in very different forms. For the dataset we are working with in this step, we can safely assume that all recordedDate extensions will be of this form if present: relative to birth, after birth, and recorded in days.
Given this assumption, convert the recorded_duratrion
column into age in years:
df_condition_glioma <- df_condition_glioma %>%
mutate(
onsetAgeInYears = as.numeric(recorded_duration) / 365
)
df_condition_glioma
Now let’s graph the ages with a basic histogram:
ggplot(df_condition_glioma, aes(onsetAgeInYears)) +
geom_histogram(binwidth = 1)
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Now let’s go back to our selected Research Study and see how we can get the Conditions for those Patients in the study. We’ve seen before that doubly-nested references may not work, so instead we can combine multiple approaches as we saw in section 2.2, to fetch Patients by ResearchStudy, and then include their diagnosed Conditions.
(Note: this is the same query we did back in Section 2.2.)
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list("_has:ResearchSubject:individual:study" = "76758", "_revinclude" = "Condition:subject"))
bundle <- fhir_search(request = request, max_bundles = 1)
## Starting download of 1 bundles of resource type Patient from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org.
## Patched 'get_bundle' in use...
##
## Download completed. Number of downloaded bundles was limited to 1 bundles, this is less than the total number of bundles available.
# Can use the same table description as we set up above
df_condition_study <- fhir_crack(bundles = bundle, design = table_desc_condition_glioma, verbose = 0)
## Warning in fhir_crack(bundles = bundle, design = table_desc_condition_glioma, : In the cols element of the design, you specified XPath expressions containing '//' which point to an arbitrary level in the resource.
## This can result in unexpected behaviour, e.g. when the searched element appears on different levels of the resource.
## We strongly advise to only use the fully specified relative XPath in the cols element, e.g. 'ingredient/strength/numerator/code' instead of search paths like '//code'.
## This warning is thrown for the following data.frame descriptions: recorded_duration, recorded_duration_units
df_condition_study
Note that this query also gets the Patient resources, but we don’t need these for our analysis so we can ignore them.
Not all Conditions may have a recordedDate
, so filter to just those that do and convert to onset in age:
df_condition_study <- df_condition_study %>%
mutate(
recorded_duration = as.numeric(recorded_duration)
) %>%
filter(
!is.na(recorded_duration)
) %>%
mutate(
onsetAgeInYears = recorded_duration / 365
)
Now let’s graph the ages again with a basic histogram:
ggplot(df_condition_study, aes(onsetAgeInYears)) +
geom_histogram(binwidth = 1)
Our final step in this exercise will be to reproduce the Overall Survival graph. The data requirements for this graph build on top of the previous steps, so now we need to know the relationship between date of death, or last recorded survival, and date of onset.
As before, Kids First data has been deidentified so there generally are no absolute dates, but relative dates are enough as long as there is a common reference point. Fortunately most of KF uses dates relative to birth or enrollment into a clinical trial.
First let’s see how KF reports death information. One possibility is in the Patient.deceased[x]
field, so let’s see if anything on the server has that populated.
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = c("deceased" = "true"))
bundle <- fhir_search(request = request, max_bundles = 1)
## Starting download of 1 bundles of resource type Patient from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org.
## Patched 'get_bundle' in use...
##
## Download completed. All available bundles were downloaded.
# Show total records returned by the query
xml2::xml_find_first(x = bundle[[1]], xpath = "./total") %>%
paste0 %>%
cat
## <total value="0"/>
Looks like that’s a no. That’s fine, there are other options. We’ll spare the reader the full exploration process, but we know that in this case, Clinical Status of “Alive” or “Dead” is captured as an Observation with SNOMED code “263493007”. Observations can be thought of as a clinical question of sorts, where the question is captured as the code
and the answer is captured as the value
.
📘 Read more about the Observation resource
Let’s look at an example of one of these:
request <- fhir_url(url = fhir_server, resource = "Observation", parameters = c("code" = "263493007"))
bundle <- fhir_search(request = request, max_bundles = 1)
## Starting download of 1 bundles of resource type Observation from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org.
## Patched 'get_bundle' in use...
##
## Download completed. Number of downloaded bundles was limited to 1 bundles, this is less than the total number of bundles available.
xml2::xml_find_first(x = bundle[[1]], xpath = "./entry[1]/resource") %>%
paste0 %>%
cat
## <resource>
## <Observation>
## <id value="22368"/>
## <meta>
## <versionId value="2"/>
## <lastUpdated value="2021-11-16T08:17:35.829+00:00"/>
## <source value="#oBSdLxVf5YJRPPRO"/>
## <profile value="https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/vital-status"/>
## <tag>
## <code value="SD_ZXJFFMEF"/>
## </tag>
## </meta>
## <identifier>
## <system value="https://kf-api-dataservice.kidsfirstdrc.org/outcomes/"/>
## <value value="OC_NTDP26AN"/>
## </identifier>
## <identifier>
## <system value="urn:kids-first:unique-string"/>
## <value value="Observation-SD_ZXJFFMEF-OC_NTDP26AN"/>
## </identifier>
## <status value="final"/>
## <code>
## <coding>
## <system value="http://snomed.info/sct"/>
## <code value="263493007"/>
## <display value="Clinical status"/>
## </coding>
## <text value="Clinical status"/>
## </code>
## <subject>
## <reference value="Patient/21975"/>
## </subject>
## <effectiveDateTime>
## <extension url="http://hl7.org/fhir/StructureDefinition/relative-date">
## <extension url="event">
## <valueCodeableConcept>
## <coding>
## <system value="http://snomed.info/sct"/>
## <code value="3950001"/>
## <display value="Birth"/>
## </coding>
## </valueCodeableConcept>
## </extension>
## <extension url="relationship">
## <valueCode value="after"/>
## </extension>
## <extension url="offset">
## <valueDuration>
## <value value="19249"/>
## <unit value="day"/>
## <system value="http://unitsofmeasure.org"/>
## <code value="d"/>
## </valueDuration>
## </extension>
## </extension>
## </effectiveDateTime>
## <valueCodeableConcept>
## <coding>
## <system value="http://snomed.info/sct"/>
## <code value="419099009"/>
## <display value="Dead"/>
## </coding>
## <text value="Deceased"/>
## </valueCodeableConcept>
## </Observation>
## </resource>
Note: there are other ways this query could have been run. For example the code system could have been specified like fhir_url(url = fhir_server, resource = "Observation", parameters = c("code" = "http://snomed.info/sct|263493007"))
.
📘 Read more about token search
Like with the Condition resources in previous examples, we see an _effectiveDate
with extensions describing a date relative to birth. There’s our common reference point, so let’s gather all our data and put it together.
In this case, we want Patients, Conditions, and Observations. There are multiple possible approaches we could take here. One possible approach is to make 1 query to find Patients, 1 query to find all Conditions, then 1 query to find all Observations, then join the results and drop any mismatches. In this case let’s see if we can do it in one single query.
🖐 Fill in the query to fetch Patients, Conditions, and Observations, for Patients in our ResearchStudy of interest.
Reminder: the ResearchStudy id = 76758
request <- fhir_url(url = fhir_server, resource = "Patient", parameters = list("_has:ResearchSubject:individual:study" = "76758", "_revinclude" = "Observation:subject", "_revinclude" = "Condition:subject"))
bundle <- fhir_search(request = request, max_bundles = 1)
## Starting download of 1 bundles of resource type Patient from FHIR base URL https://kf-api-fhir-service.kidsfirstdrc.org.
## Patched 'get_bundle' in use...
##
## Download completed. Number of downloaded bundles was limited to 1 bundles, this is less than the total number of bundles available.
Note that this query we just ran selected ALL Conditions and Observations linked to the selected Patients. If we need to filter the results further, we can only do that by post-processing and not within the FHIR query itself.
Fortunately there only appears to be one Observation per Patient in this dataset, so there is no need to filter further.
Let’s break this Bundle out into separate data frames.
We’ll first inspect an Observation resource first because this is the first time we’re seeing Observations.
xml2::xml_find_first(x = bundle[[1]], xpath = "//*[contains (name(), \"Observation\")]") %>%
paste0 %>%
cat
## <Observation>
## <id value="94597"/>
## <meta>
## <versionId value="1"/>
## <lastUpdated value="2021-10-14T21:37:55.941+00:00"/>
## <source value="#gUyoHXQC5nIdnOvI"/>
## <profile value="https://nih-ncpi.github.io/ncpi-fhir-ig/StructureDefinition/vital-status"/>
## </meta>
## <identifier>
## <system value="https://kf-api-dataservice.kidsfirstdrc.org/outcomes/"/>
## <value value="OC_7DSFMZ60"/>
## </identifier>
## <identifier>
## <system value="urn:kids-first:unique-string"/>
## <value value="Observation-SD_BHJXBDQK-OC_7DSFMZ60"/>
## </identifier>
## <status value="final"/>
## <code>
## <coding>
## <system value="http://snomed.info/sct"/>
## <code value="263493007"/>
## <display value="Clinical status"/>
## </coding>
## <text value="Clinical status"/>
## </code>
## <subject>
## <reference value="Patient/76695"/>
## </subject>
## <effectiveDateTime>
## <extension url="http://hl7.org/fhir/StructureDefinition/relative-date">
## <extension url="event">
## <valueCodeableConcept>
## <coding>
## <system value="http://snomed.info/sct"/>
## <code value="3950001"/>
## <display value="Birth"/>
## </coding>
## </valueCodeableConcept>
## </extension>
## <extension url="relationship">
## <valueCode value="after"/>
## </extension>
## <extension url="offset">
## <valueDuration>
## <value value="2522"/>
## <unit value="days"/>
## <system value="http://unitsofmeasure.org"/>
## <code value="d"/>
## </valueDuration>
## </extension>
## </extension>
## </effectiveDateTime>
## <valueCodeableConcept>
## <coding>
## <system value="http://snomed.info/sct"/>
## <code value="438949009"/>
## <display value="Alive"/>
## </coding>
## <text value="Alive"/>
## </valueCodeableConcept>
## </Observation>
To calculate survival, we have to get subtract onset date from the latest clinical status date (Observation). As with Condition _recordedDate
, these Observations use a relative date via an extension on _effectiveDateTime
.
Let’s break that out into a single number. Fortunately the format is exactly the same as before, so we can reuse the same approach we used earlier with Condition.
table_desc_observation <- fhir_table_description(
resource = "Observation",
cols = c(
id = "id",
patient_id = "subject/reference",
effectiveDateTime_duration = str_c(
"effectiveDateTime",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/value"
),
effectiveDateTime_duration_units = str_c(
"effectiveDateTime",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/unit"
),
# Get the code identifying the Observation as well
code = "code/coding/code",
code_display = "code/coding/display",
code_system = "code/coding/system",
# Get the value for the observation
value_code = "valueCodeableConcept/coding/code",
value_display = "valueCodeableConcept/coding/display",
value_system = "valueCodeableConcept/coding/system"
)
)
df_observation <- fhir_crack(bundles = bundle, design = table_desc_observation, verbose = 0)
## Warning in fhir_crack(bundles = bundle, design = table_desc_observation, : In the cols element of the design, you specified XPath expressions containing '//' which point to an arbitrary level in the resource.
## This can result in unexpected behaviour, e.g. when the searched element appears on different levels of the resource.
## We strongly advise to only use the fully specified relative XPath in the cols element, e.g. 'ingredient/strength/numerator/code' instead of search paths like '//code'.
## This warning is thrown for the following data.frame descriptions: effectiveDateTime_duration, effectiveDateTime_duration_units
df_observation
We expect all observations to have code=263493007 (Clinical status)
. Let’s verify:
df_observation$code %>% freq
## Frequencies
## df_observation$code
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## --------------- ------ --------- -------------- --------- --------------
## 263493007 48 100.00 100.00 100.00 100.00
## <NA> 0 0.00 100.00
## Total 48 100.00 100.00 100.00 100.00
And we expect all observations to have either alive
or deceased
as the value (stored in valueCodeableConcept
):
ctable(df_observation$value_code, df_observation$value_display)
## Cross-Tabulation, Row Proportions
## value_code * value_display
## Data Frame: df_observation
##
## ------------ --------------- ------------- ------------ -------------
## value_display Alive Dead Total
## value_code
## 419099009 0 ( 0.0%) 6 (100.0%) 6 (100.0%)
## 438949009 42 (100.0%) 0 ( 0.0%) 42 (100.0%)
## Total 42 ( 87.5%) 6 ( 12.5%) 48 (100.0%)
## ------------ --------------- ------------- ------------ -------------
We also expect only one observation per Patient – let’s verify:
(df_observation %>% count(patient_id))$n %>% max
## [1] 1
Looks like this is true, so we can use df_observation
both to calculate the time under observation and the endpoint for the survival analysis.
For time under observation, we will use the effectiveDateTime_duration
variable, which is time since birth. Let’s verify the units are consistent:
df_observation$effectiveDateTime_duration_units %>% freq
## Frequencies
## df_observation$effectiveDateTime_duration_units
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## days 47 100.00 100.00 97.92 97.92
## <NA> 1 2.08 100.00
## Total 48 100.00 100.00 100.00 100.00
If there are any NA
values for the units, that means no effectiveDateTime
is recorded. Let’s drop any such records for simplicity of this exercise, but for research this would warrant deeper investigation.
df_observation <- df_observation %>%
filter(!is.na(effectiveDateTime_duration_units))
df_observation$effectiveDateTime_duration_units %>% freq
## Frequencies
## df_observation$effectiveDateTime_duration_units
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## days 47 100.00 100.00 100.00 100.00
## <NA> 0 0.00 100.00
## Total 47 100.00 100.00 100.00 100.00
For easier interpretability, let’s change this from days to years:
df_observation <- df_observation %>%
mutate(
effectiveDateTime_duration = as.numeric(effectiveDateTime_duration)
) %>%
mutate(
observationEndAgeInYears = as.numeric(effectiveDateTime_duration) / 365.25
)
df_observation
The Condition resource gives us the age at which observation began, so let’s extract what we need:
table_desc_condition <- fhir_table_description(
resource = "Condition",
cols = c(
id = "id",
patient_id = "subject/reference",
recorded_duration = str_c(
"recordedDate",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/value"
),
recorded_duration_units = str_c(
"recordedDate",
"/extension[@url=\"http://hl7.org/fhir/StructureDefinition/relative-date\"]",
"/extension[@url=\"offset\"]",
"/valueDuration",
"/unit"
),
code_code = "code/coding/code",
code_display = "code/text"
)
)
df_condition<- fhir_crack(bundles = bundle, design = table_desc_condition, verbose = 0)
## Warning in fhir_crack(bundles = bundle, design = table_desc_condition, verbose = 0): In the cols element of the design, you specified XPath expressions containing '//' which point to an arbitrary level in the resource.
## This can result in unexpected behaviour, e.g. when the searched element appears on different levels of the resource.
## We strongly advise to only use the fully specified relative XPath in the cols element, e.g. 'ingredient/strength/numerator/code' instead of search paths like '//code'.
## This warning is thrown for the following data.frame descriptions: recorded_duration, recorded_duration_units
df_condition
There are multiple Conditions for each Patient. For the purposes of this analysis, we will assume the smallest recorded_duration
(i.e., closest to birth) Condition represents the beginning of observed time.
First, let’s sanity check the units:
df_condition$recorded_duration_units %>% freq
## Frequencies
## df_condition$recorded_duration_units
## Type: Character
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## days 81 100.00 100.00 90.00 90.00
## <NA> 9 10.00 100.00
## Total 90 100.00 100.00 100.00 100.00
df_condition_min_recorded_duration <- df_condition %>%
mutate(
recorded_duration = as.numeric(recorded_duration)
) %>%
# Remove rows with null recorded duration
filter(!is.na(recorded_duration_units)) %>%
# Get the minimum recorded duration for each patient_id
group_by(patient_id) %>%
summarize(
min_recorded_duration_years = min(recorded_duration) / 365.25
)
df_condition_min_recorded_duration
Now we can merge with the observations to get our final data frame for input into the survival analysis:
df_survival <- df_observation %>%
left_join(
df_condition_min_recorded_duration,
by = "patient_id"
)
df_survival
Let’s sanity check the two key variables we need for time in observation:
df_survival %>% select(min_recorded_duration_years, observationEndAgeInYears) %>% skim
Name | Piped data |
Number of rows | 47 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
min_recorded_duration_years | 0 | 1 | 8.43 | 6.19 | 0.35 | 2.65 | 7.96 | 13.28 | 22.37 | ▇▇▅▃▂ |
observationEndAgeInYears | 0 | 1 | 11.46 | 6.73 | 0.39 | 6.59 | 11.56 | 16.85 | 30.05 | ▇▇▇▃▁ |
Recall that we checked to make sure there was one row per patient id in df_observation
, which had this many rows:
df_observation %>% nrow
## [1] 47
If that matches with the number of rows in df_survival
, we know that we haven’t introduced any extra rows. And if there are no missing data in the skim()
output above, know we have the data we need.
We can now run the survival analysis:
df_surv_input <- df_survival %>%
select(patient_id, min_recorded_duration_years, observationEndAgeInYears, value_code, value_display) %>%
mutate(
observation_time_years = observationEndAgeInYears - min_recorded_duration_years,
event = case_when(
value_code == "438949009" ~ 0, # Alive
value_code == "419099009" ~ 1, # Dead
T ~ NaN # null for all other values
)
)
df_surv_input
Let’s sanity check our event
variable:
ctable(df_surv_input$event, df_surv_input$value_display)
## Cross-Tabulation, Row Proportions
## event * value_display
## Data Frame: df_surv_input
##
## ------- --------------- ------------- ------------ -------------
## value_display Alive Dead Total
## event
## 0 42 (100.0%) 0 ( 0.0%) 42 (100.0%)
## 1 0 ( 0.0%) 5 (100.0%) 5 (100.0%)
## Total 42 ( 89.4%) 5 ( 10.6%) 47 (100.0%)
## ------- --------------- ------------- ------------ -------------
Looks good! Now we can run the survival analysis and generate a Kaplan-Meier survival curves:
surv_obj <- Surv(time = df_surv_input$observation_time_years, event = df_surv_input$event)
fit <- survfit(surv_obj ~ 1, data = df_surv_input)
ggsurvplot(fit, data = df_surv_input)
We’re only using a small set of patients here so the graph is going to show a wide area of uncertainty. Consider changing the most recent FHIR query above to select 250 records, and running through the steps again to get here and seeing how the result changes. This is left as an exercise for the reader.
Well done! We’ve just walked through eight different sample queries to build out content for four fundamental concepts.
In this exercise, you practiced: