Getting Started

Creating a reference to the OMOP CDM

The CDMConnector package provides tools for working with OMOP Common Data Model (CDM) tables using familiar dplyr syntax. After connecting to a database containing data mapped to the OMOP CDM, we can use cdm_from_con from CDMConnector to create a CDM reference. This CDM reference is a single object that contains table references along with specific metadata.

For this example, we’ll use the Eunomia data contained in a duckdb database. First, if you haven’t previously done so, we need to download the data. And once downloaded, add the path to our Renviron.

library(CDMConnector)
library(dplyr)
downloadEunomiaData(
  pathToData = here::here(), # change to the location you want to save the data
  overwrite = TRUE
)
# once downloaded, save path to your Renviron: EUNOMIA_DATA_FOLDER="......"
# (and then restart R)

With the Eunomia data now downloaded, we can connect to the database and create our reference.

con <- DBI::dbConnect(duckdb::duckdb(), dbdir = eunomia_dir())
cdm <- cdm_from_con(con, cdm_schema = "main")
cdm
#> # OMOP CDM reference (tbl_duckdb_connection)
#> 
#> Tables: person, observation_period, visit_occurrence, condition_occurrence, drug_exposure, procedure_occurrence, measurement, observation, death, location, care_site, provider, drug_era, dose_era, condition_era, cdm_source, concept, vocabulary, concept_relationship, concept_synonym, concept_ancestor, drug_strength

Individual CDM table references can be accessed using $ and piped to dplyr verbs.

cdm$person %>% 
  glimpse()
#> Rows: ??
#> Columns: 18
#> Database: DuckDB 0.6.1 [root@Darwin 21.6.0:R 4.2.2//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpwF0Kwa/kwcvrphg]
#> $ person_id                   <dbl> 6, 123, 129, 16, 65, 74, 42, 187, 18, 111,…
#> $ gender_concept_id           <dbl> 8532, 8507, 8507, 8532, 8532, 8532, 8532, …
#> $ year_of_birth               <dbl> 1963, 1950, 1974, 1971, 1967, 1972, 1909, …
#> $ month_of_birth              <dbl> 12, 4, 10, 10, 3, 1, 11, 7, 11, 5, 8, 3, 3…
#> $ day_of_birth                <dbl> 31, 12, 7, 13, 31, 5, 2, 23, 17, 2, 19, 13…
#> $ birth_datetime              <dttm> 1963-12-31, 1950-04-12, 1974-10-07, 1971-…
#> $ race_concept_id             <dbl> 8516, 8527, 8527, 8527, 8516, 8527, 8527, …
#> $ ethnicity_concept_id        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ location_id                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ provider_id                 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ care_site_id                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ person_source_value         <chr> "001f4a87-70d0-435c-a4b9-1425f6928d33", "0…
#> $ gender_source_value         <chr> "F", "M", "M", "F", "F", "F", "F", "M", "F…
#> $ gender_source_concept_id    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ race_source_value           <chr> "black", "white", "white", "white", "black…
#> $ race_source_concept_id      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ ethnicity_source_value      <chr> "west_indian", "italian", "polish", "ameri…
#> $ ethnicity_source_concept_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

Selecting a subset of CDM tables

If you do not need references to all tables you can easily select only a subset of tables to include in the CDM reference. The cdm_tables argument of cdm_from_con supports the tidyselect selection language and provides a new selection helper: tbl_group.

cdm_from_con(con, cdm_tables = c("person", "observation_period")) # character vector
#> # OMOP CDM reference (tbl_duckdb_connection)
#> 
#> Tables: person, observation_period
cdm_from_con(con, cdm_tables = starts_with("concept")) # tables that start with 'concept'
#> # OMOP CDM reference (tbl_duckdb_connection)
#> 
#> Tables: concept, concept_class, concept_relationship, concept_synonym, concept_ancestor
cdm_from_con(con, cdm_tables = contains("era")) # tables that contain the substring 'era'
#> # OMOP CDM reference (tbl_duckdb_connection)
#> 
#> Tables: drug_era, dose_era, condition_era
cdm_from_con(con, cdm_tables = matches("person|period")) # regular expression
#> # OMOP CDM reference (tbl_duckdb_connection)
#> 
#> Tables: person, observation_period, payer_plan_period

Predefined sets of tables can also be selected using tbl_group which supports several subsets of the CDM: “all”, “clinical”, “vocab”, “derived”, and “default”.

# pre-defined groups
cdm_from_con(con, cdm_tables = tbl_group("clinical")) 
#> # OMOP CDM reference (tbl_duckdb_connection)
#> 
#> Tables: person, observation_period, visit_occurrence, visit_detail, condition_occurrence, drug_exposure, procedure_occurrence, device_exposure, measurement, observation, death, note, note_nlp, specimen, fact_relationship
cdm_from_con(con, cdm_tables = tbl_group("vocab")) 
#> # OMOP CDM reference (tbl_duckdb_connection)
#> 
#> Tables: concept, vocabulary, domain, concept_class, concept_relationship, relationship, concept_synonym, concept_ancestor, source_to_concept_map, drug_strength

The default set of CDM tables included in a CDM object is:

tbl_group("default")
#>  [1] "person"               "observation_period"   "visit_occurrence"    
#>  [4] "condition_occurrence" "drug_exposure"        "procedure_occurrence"
#>  [7] "measurement"          "observation"          "death"               
#> [10] "location"             "care_site"            "provider"            
#> [13] "drug_era"             "dose_era"             "condition_era"       
#> [16] "cdm_source"           "concept"              "vocabulary"          
#> [19] "concept_relationship" "concept_synonym"      "concept_ancestor"    
#> [22] "drug_strength"

Including existing cohort tables in the CDM reference

It is common to use one or more cohort tables along with the CDM. We can include existing cohort tables by specifying the schema in which they reside and their name like so:

cdm <- cdm_from_con(con, 
                    cdm_tables = c("person", "observation_period"), 
                    write_schema = "write_schema",
                    cohort_tables = "cohort") 

cdm$cohort
#> # Source:   table<write_schema.cohort> [2 x 4]
#> # Database: DuckDB 0.6.1 [root@Darwin 21.6.0:R 4.2.2//var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpwF0Kwa/kwcvrphg]
#>   cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                  <int>      <int> <date>            <date>         
#> 1                    1          1 2023-05-05        2023-05-05     
#> 2                    1          2 2020-02-03        2020-11-04

Extracting data

There are two ways to extract subsets of the CDM.

Note, in either case, you should think carefully about the data you are extracting and make sure to only get the data that you require (and will fit into memory!)

local_cdm <- cdm %>% 
  collect()

# The cdm tables are now dataframes
local_cdm$person[1:4, 1:4] 
#> # A tibble: 4 × 4
#>   person_id gender_concept_id year_of_birth month_of_birth
#>       <dbl>             <dbl>         <dbl>          <dbl>
#> 1         6              8532          1963             12
#> 2       123              8507          1950              4
#> 3       129              8507          1974             10
#> 4        16              8532          1971             10
save_path <- file.path(tempdir(), "tmp")
dir.create(save_path)

cdm %>% 
  stow(path = save_path)

list.files(save_path)
#> [1] "cohort.parquet"             "observation_period.parquet"
#> [3] "person.parquet"

Closing connections

Close the database connection with dbDisconnect. After the connection is closed the cdm object can no longer be used.

DBI::dbDisconnect(con, shutdown = TRUE)