Synthea Synthetic Data Overview
Accessing real healthcare data for research purposes is difficult for many reasons including:
- Data format and processing challenges
- Privacy regulations
- Required approvals from bodies like internal review boards (IRBs)
These factors and more limit what researchers can do with real healthcare data.
Synthetic data is an alternative to real healthcare data that avoids these challenges. Synthetic data is artifically generated, by computer or by hand, rather than collected from the real world. When using real healthcare data isn’t feasible due to privacy, cost, or other restrictions, synthetic data is a good alternative.
Researchers often use real healthcare data that have had personal identifiers removed. This includes:
- Deidentified data
- Anonymized data
- Pseudonymized data
Because this is real data, it is valuable for research. However, this data also carries the risk of re-identification.1
In contrast, synthetic data is constructed so there is no privacy risk. When no individual’s data was used to create a dataset, no individual’s data can be in the dataset.
1 Synthea
Synthea™ is a synthetic data generator that models the life and medical history of synthetic patients. It creates realistic, but not real, synthetic electronic health records. The records are intended to be realistic at the individual level and population level.
Synthea is open source and built from publicly available information, so the resulting records are free of cost and free of privacy restrictions.
Synthea starts with demographic information for a region based on the US Census. Using these demographics, Synthea randomly creates individuals with realistic race, sex, target age, etc., for the region.
Synthea simulates each individual independently from birth until their death or the current day. As each individual lives out their synthetic life, they flow through disease modules that represent the progression and treatment of various diseases. Disease modules are built from publicly available incidence and prevalence statistics, along with care guidelines from medical institutions. No real person’s data is ever used to create a Synthea module.
Once the simulation is complete, the patient record is exported into industry-standard formats such as FHIR®, C-CDA®, CSV, or plain text.
{
"resourceType": "Patient",
"id": "2497ee24-c7c3-5d9a-3425-85da2e9e8b23",
"meta": {
"profile": [ "http://hl7.org/fhir/us/core/StructureDefinition/us-core-patient" ]
},
"text": {
"status": "generated",
"div": "<div xmlns=\"http://www.w3.org/1999/xhtml\">Generated by <a href=\"https://github.com/synthetichealth/synthea\">Synthea</a>.Version identifier: v3.1.0-354-g3a6a93487\n . Person seed: -2317076407365535282 Population seed: 123</div>"
},
"extension": [ {
"url": "http://hl7.org/fhir/us/core/StructureDefinition/us-core-race",
"extension": [ {
"url": "ombCategory",
"valueCoding": {
"system": "urn:oid:2.16.840.1.113883.6.238",
"code": "2106-3",
"display": "White"
}
}, {
"url": "text",
"valueString": "White"
} ]
}, {
"url": "http://hl7.org/fhir/us/core/StructureDefinition/us-core-ethnicity",
"extension": [ {
"url": "ombCategory",
"valueCoding": {
"system": "urn:oid:2.16.840.1.113883.6.238",
"code": "2186-5",
"display": "Not Hispanic or Latino"
}
}, {
"url": "text",
"valueString": "Not Hispanic or Latino"
} ]
}, {
"url": "http://hl7.org/fhir/StructureDefinition/patient-mothersMaidenName",
"valueString": "Nadine465 Wunsch504"
}, {
"url": "http://hl7.org/fhir/us/core/StructureDefinition/us-core-birthsex",
"valueCode": "M"
}, {
"url": "http://hl7.org/fhir/StructureDefinition/patient-birthPlace",
"valueAddress": {
"city": "Boston",
"state": "Massachusetts",
"country": "US"
}
}, {
"url": "http://synthetichealth.github.io/synthea/disability-adjusted-life-years",
"valueDecimal": 0.0
}, {
"url": "http://synthetichealth.github.io/synthea/quality-adjusted-life-years",
"valueDecimal": 18.0
} ],
"identifier": [ {
"system": "https://github.com/synthetichealth/synthea",
"value": "2497ee24-c7c3-5d9a-3425-85da2e9e8b23"
}, {
"type": {
"coding": [ {
"system": "http://terminology.hl7.org/CodeSystem/v2-0203",
"code": "MR",
"display": "Medical Record Number"
} ],
"text": "Medical Record Number"
},
"system": "http://hospital.smarthealthit.org",
"value": "2497ee24-c7c3-5d9a-3425-85da2e9e8b23"
}, {
"type": {
"coding": [ {
"system": "http://terminology.hl7.org/CodeSystem/v2-0203",
"code": "SS",
"display": "Social Security Number"
} ],
"text": "Social Security Number"
},
"system": "http://hl7.org/fhir/sid/us-ssn",
"value": "999-32-6148"
}, {
"type": {
"coding": [ {
"system": "http://terminology.hl7.org/CodeSystem/v2-0203",
"code": "DL",
"display": "Driver's license number"
} ],
"text": "Driver's license number"
},
"system": "urn:oid:2.16.840.1.113883.4.3.25",
"value": "S99930905"
} ],
"name": [ {
"use": "official",
"family": "Stoltenberg489",
"given": [ "Mitchell808" ],
"prefix": [ "Mr." ]
} ],
"telecom": [ {
"system": "phone",
"value": "555-726-6485",
"use": "home"
} ],
"gender": "male",
"birthDate": "2004-05-11",
"address": [ {
"extension": [ {
"url": "http://hl7.org/fhir/StructureDefinition/geolocation",
"extension": [ {
"url": "latitude",
"valueDecimal": 42.40293333299843
}, {
"url": "longitude",
"valueDecimal": -71.68746648659892
} ]
} ],
"line": [ "352 Bailey Neck Apt 40" ],
"city": "Clinton",
"state": "MA",
"postalCode": "01510",
"country": "US"
} ],
"maritalStatus": {
"coding": [ {
"system": "http://terminology.hl7.org/CodeSystem/v3-MaritalStatus",
"code": "S",
"display": "Never Married"
} ],
"text": "Never Married"
},
"multipleBirthBoolean": false,
"communication": [ {
"language": {
"coding": [ {
"system": "urn:ietf:bcp:47",
"code": "en-US",
"display": "English (United States)"
} ],
"text": "English (United States)"
}
} ]
}
{
"resourceType": "Observation",
"id": "f83286e0-5797-e51d-a7e6-6708d8085623",
"status": "final",
"category": [ {
"coding": [ {
"system": "http://terminology.hl7.org/CodeSystem/observation-category",
"code": "laboratory",
"display": "Laboratory"
} ]
} ],
"code": {
"coding": [ {
"system": "http://loinc.org",
"code": "4548-4",
"display": "Hemoglobin A1c/Hemoglobin.total in Blood"
} ],
"text": "Hemoglobin A1c/Hemoglobin.total in Blood"
},
"subject": {
"reference": "urn:uuid:3daf29a9-f7b1-9d9f-45ba-4be258308a75"
},
"encounter": {
"reference": "urn:uuid:91fe93c0-52ae-98bc-e14c-29df89c8119d"
},
"effectiveDateTime": "2013-10-05T08:13:20-04:00",
"issued": "2013-10-05T08:13:20.014-04:00",
"valueQuantity": {
"value": 6.38,
"unit": "%",
"system": "http://unitsofmeasure.org",
"code": "%"
}
}
This page overviews key features of Synthea. If you’d like more information, visit the Synthea wiki on GitHub.
2 Generic Modules
At Synthea’s core is a set of disease modules, representing the progression and treatment of various conditions. Below is a small snippet of the Appendicitis module.
Disease modules are state transition machines where each individual flows through the modules based on logical conditions and weighted randomness. Behind the scenes, modules are stored as JavaScript Object Notation (JSON) files, but nearly all users view or edit modules exclusively using a graphical interface called the Synthea Module Builder.
Every synthetic patient starts in each module’s Initial state at birth and immediately begins progressing through the module’s states. Each state represents a spot where something happens. There are two broad categories of states: control states, which drive the flow of a patient through the module, for example:
- Delay: Wait a certain amount of time before progressing, commonly used to represent how the risk of certain conditions changes with age.
- Guard: Wait until given criteria become true before progressing.
And clinical states, which add entries to a patient’s health record, for example:
- ConditionOnset: Represent the spot where the patient acquires a given condition, not necessarily where it is diagnosed.
- Procedure: Represent the point in time in a healthcare encounter that a procedure is performed.
For a full list of state types, see the Synthea wiki.
Each state has a transition, which points to the state the patient will progress to next:
- Direct transitions: Point to a single state.
- Distributed transitions: Point to multiple states, each with a weighted probability. A patient progresses to a randomly chosen state.
- Conditional transitions: Include logical rules showing which path to follow.
- Complex transitions: Are a combination of conditional and distributed transitions.
Modules will run until either the simulation ends (at patient death or when it reaches the current date) or until the module reaches a Terminal state.2
Combining these simple concepts allows module developers to build robust and detailed models of disease progression and treatment.
Because Synthea is open source and accepts contributions from a global user base, the level of detail varies across modules. For instance, the Appendicitis module was the first to be created, and the level of detail is minimal. On the other hand, the COVID-19 module and submodules were designed to replicate the disease’s progression as closely as possible, and is probably the largest and most detailed module.
Further, the number of disease modules is limited. Early efforts focused on the “top ten” causes of premature death and reasons people see their primary care provider. Further additions have added a large number of modules representing common conditions, but rarer and more complex conditions may not be represented.
The Synthea community encourages and welcomes users to create new modules representing conditions of interest or to improve the detail and realism of existing modules.
You can view, modify, and create Synthea modules with no programming experience using the Synthea Module Builder. For more information, read Customizing Synthea ). There is also a short video introduction to the Module Builder, and a tutorial on the Synthea Wiki.
3 FHIR Resources
Synthea generates basic FHIR resources: it includes required fields but rarely populates optional fields. If you require fields that Synthea doesn’t populate, you can customize Synthea to add those fields. Read Customizing Synthea for more information.
By default, Synthea exports one file per patient, as a Bundle with type: transaction
. This Bundle contains a single Patient resource as the first entry
, followed by other patient-specific resources such as Conditions, Observations, Procedures, etc, roughly grouped by Encounter in chronological order. Synthea exports Organizations and Practitioners separately since these resources may be referenced by multiple patients’ resources. Synthea may also be configured to export FHIR Bulk Data.
As of April 2023, Synthea can produce the following resource types:
AllergyIntolerance
Bundle
CarePlan
CareTeam
Claim
Condition
Coverage
Device
DiagnosticReport
DocumentReference
Encounter
ExplanationOfBenefit
Goal
ImagingStudy
Immunization
Location
Medication
MedicationRequest
MedicationAdministration
Observation
Organization
Patient
Practitioner
PractitionerRole
Procedure
Provenance
Note that not all patient records will contain instances of every resource type, and certain resource types will only be produced if certain settings are enabled. See Customizing Synthea for more information on settings.
4 Pre-generated Datasets
Instead of running Synthea yourself, you can use a pre-generated dataset. Pre-generated datasets are available at the following locations:
- https://synthea.mitre.org/downloads
- Centralized location for datasets created by the Synthea core development team
- https://confluence.hl7.org/display/COD/mCODE+Test+Data
- Sample cancer patients with data conformant to the mCODE FHIR IG
- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QDXLWR
- 10,000 synthetic Medicare patients spanning the entire United States
References
Footnotes
Perhaps the most well-known example was an instance in 1997 where Latanya Sweeney re-identified the record belonging to then-Governor of Massachusetts William Weld from a dataset where identifiers had been removed. See Ohm (2009) for details.↩︎
Terminal here means “the end of this module”, not “the patient has a terminal condition and died”. For instance, the Appendicitis module terminates after the patient has an appendectomy. Compare to the Sore Throat module which does not have a Terminal state since people are always at risk of common viral conditions that present as sore throat.↩︎