The CDMConnector package provides tools for working with OMOP Common
Data Model (CDM) tables using familiar dplyr syntax. After connecting to
a database containing data mapped to the OMOP CDM, we can use
cdm_from_con
from CDMConnector to create a CDM reference.
This CDM reference is a single object that contains table references
along with specific metadata.
For this example, we’ll use the Eunomia data contained in a duckdb database. First, if you haven’t previously done so, we need to download the data. And once downloaded, add the path to our Renviron.
library(CDMConnector)
library(dplyr)
downloadEunomiaData(
pathToData = here::here(), # change to the location you want to save the data
overwrite = TRUE
)# once downloaded, save path to your Renviron: EUNOMIA_DATA_FOLDER="......"
# (and then restart R)
With the Eunomia data now downloaded, we can connect to the database and create our reference.
<- DBI::dbConnect(duckdb::duckdb(), dbdir = eunomia_dir())
con <- cdm_from_con(con, cdm_schema = "main")
cdm
cdm#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: person, observation_period, visit_occurrence, condition_occurrence, drug_exposure, procedure_occurrence, measurement, observation, death, location, care_site, provider, drug_era, dose_era, condition_era, cdm_source, concept, vocabulary, concept_relationship, concept_synonym, concept_ancestor, drug_strength
Individual CDM table references can be accessed using $
and piped to dplyr verbs.
$person %>%
cdmglimpse()
#> Rows: ??
#> Columns: 18
#> Database: DuckDB 0.6.1 [root@Darwin 21.6.0:R 4.2.2//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpwF0Kwa/kwcvrphg]
#> $ person_id <dbl> 6, 123, 129, 16, 65, 74, 42, 187, 18, 111,…
#> $ gender_concept_id <dbl> 8532, 8507, 8507, 8532, 8532, 8532, 8532, …
#> $ year_of_birth <dbl> 1963, 1950, 1974, 1971, 1967, 1972, 1909, …
#> $ month_of_birth <dbl> 12, 4, 10, 10, 3, 1, 11, 7, 11, 5, 8, 3, 3…
#> $ day_of_birth <dbl> 31, 12, 7, 13, 31, 5, 2, 23, 17, 2, 19, 13…
#> $ birth_datetime <dttm> 1963-12-31, 1950-04-12, 1974-10-07, 1971-…
#> $ race_concept_id <dbl> 8516, 8527, 8527, 8527, 8516, 8527, 8527, …
#> $ ethnicity_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ location_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ provider_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ care_site_id <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ person_source_value <chr> "001f4a87-70d0-435c-a4b9-1425f6928d33", "0…
#> $ gender_source_value <chr> "F", "M", "M", "F", "F", "F", "F", "M", "F…
#> $ gender_source_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ race_source_value <chr> "black", "white", "white", "white", "black…
#> $ race_source_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ ethnicity_source_value <chr> "west_indian", "italian", "polish", "ameri…
#> $ ethnicity_source_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
If you do not need references to all tables you can easily select
only a subset of tables to include in the CDM reference. The
cdm_tables
argument of cdm_from_con
supports
the tidyselect
selection language and provides a new selection helper:
tbl_group
.
cdm_from_con(con, cdm_tables = c("person", "observation_period")) # character vector
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: person, observation_period
cdm_from_con(con, cdm_tables = starts_with("concept")) # tables that start with 'concept'
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: concept, concept_class, concept_relationship, concept_synonym, concept_ancestor
cdm_from_con(con, cdm_tables = contains("era")) # tables that contain the substring 'era'
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: drug_era, dose_era, condition_era
cdm_from_con(con, cdm_tables = matches("person|period")) # regular expression
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: person, observation_period, payer_plan_period
Predefined sets of tables can also be selected using
tbl_group
which supports several subsets of the CDM: “all”,
“clinical”, “vocab”, “derived”, and “default”.
# pre-defined groups
cdm_from_con(con, cdm_tables = tbl_group("clinical"))
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: person, observation_period, visit_occurrence, visit_detail, condition_occurrence, drug_exposure, procedure_occurrence, device_exposure, measurement, observation, death, note, note_nlp, specimen, fact_relationship
cdm_from_con(con, cdm_tables = tbl_group("vocab"))
#> # OMOP CDM reference (tbl_duckdb_connection)
#>
#> Tables: concept, vocabulary, domain, concept_class, concept_relationship, relationship, concept_synonym, concept_ancestor, source_to_concept_map, drug_strength
The default set of CDM tables included in a CDM object is:
tbl_group("default")
#> [1] "person" "observation_period" "visit_occurrence"
#> [4] "condition_occurrence" "drug_exposure" "procedure_occurrence"
#> [7] "measurement" "observation" "death"
#> [10] "location" "care_site" "provider"
#> [13] "drug_era" "dose_era" "condition_era"
#> [16] "cdm_source" "concept" "vocabulary"
#> [19] "concept_relationship" "concept_synonym" "concept_ancestor"
#> [22] "drug_strength"
It is common to use one or more cohort tables along with the CDM. We can include existing cohort tables by specifying the schema in which they reside and their name like so:
<- cdm_from_con(con,
cdm cdm_tables = c("person", "observation_period"),
write_schema = "write_schema",
cohort_tables = "cohort")
$cohort
cdm#> # Source: table<write_schema.cohort> [2 x 4]
#> # Database: DuckDB 0.6.1 [root@Darwin 21.6.0:R 4.2.2//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpwF0Kwa/kwcvrphg]
#> cohort_definition_id subject_id cohort_start_date cohort_end_date
#> <int> <int> <date> <date>
#> 1 1 1 2023-05-05 2023-05-05
#> 2 1 2 2020-02-03 2020-11-04
There are two ways to extract subsets of the CDM.
collect
pulls data into R
stow
saves the cdm subset to a set of files on disk
in either parquet, feather, or csv format
Note, in either case, you should think carefully about the data you are extracting and make sure to only get the data that you require (and will fit into memory!)
<- cdm %>%
local_cdm collect()
# The cdm tables are now dataframes
$person[1:4, 1:4]
local_cdm#> # A tibble: 4 × 4
#> person_id gender_concept_id year_of_birth month_of_birth
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 8532 1963 12
#> 2 123 8507 1950 4
#> 3 129 8507 1974 10
#> 4 16 8532 1971 10
<- file.path(tempdir(), "tmp")
save_path dir.create(save_path)
%>%
cdm stow(path = save_path)
list.files(save_path)
#> [1] "cohort.parquet" "observation_period.parquet"
#> [3] "person.parquet"
Close the database connection with dbDisconnect
. After
the connection is closed the cdm object can no longer be used.
::dbDisconnect(con, shutdown = TRUE) DBI