Overview

wranglEHR is a data wrangling and cleaning tool for CC-HIC. It is designed to run against the CC-HIC EAV table structure (which at present exists in PostgreSQL and SQLite flavours). We are about to undergo a major rewrite to OHDSI CDM version 6, so this package will be in flux. Please see the R vignettes for further details on how to use the package to perform the most common tasks:

This package is designed to work in concert with inspectEHR which provides data quality evaluation for the CC-HIC.

Installation

# install directly from github with
remotes::install_github("DocEd/wranglEHR")
library(wranglEHR)

Usage

# Connect to the database (will use the internal test db)
ctn <- setup_dummy_db()

# Extract static variables. Rename on the fly.
dtb <- extract_demographics(
  connection = ctn,
  episode_ids = 1:10, # specify for episodes
  code_names = c("NIHR_HIC_ICU_0017", "NIHR_HIC_ICU_0019"),
  rename = c("height", "weight")
)

head(dtb)
#> # A tibble: 6 × 2
#>   episode_id height
#>        <int>  <dbl>
#> 1          1  2.34 
#> 2          2  2.01 
#> 3          3  4.00 
#> 4          4 -0.318
#> 5          5  2.44 
#> # … with 1 more row

# Extract time varying variables. Rename on the fly.
ltb <- extract_timevarying(
  ctn,
  episode_ids = 1:10,
  code_names = "NIHR_HIC_ICU_0108",
  rename = "hr")
#> 3e-04 hours to process
#> WEE! How sublime was that?!

head(ltb)
#> # A tibble: 6 × 3
#>   episode_id  time    hr
#>        <int> <dbl> <int>
#> 1          1     0    91
#> 2          1     1    78
#> 3          1     2   102
#> 4          1     3    94
#> 5          1     4    69
#> # … with 1 more row

# Pull out to any arbitrary temporal resolution and custom define the
# behaviour for information recorded at resolution higher than you are sampling.
# only extract the first 24 hours of data

ltb_2 <- extract_timevarying(
  ctn,
  episode_ids = 1:10,
  code_names = "NIHR_HIC_ICU_0108",
  rename = "hr",
  cadence = 2, # 1 row every 2 hours
  coalesce_rows = mean, # use mean to downsample to our 2 hour cadence
  time_boundaries = c(0, 24)
  )
#> 0.00026 hours to process
#> HUZZAH! How cat's meow was that?!

head(ltb_2)
#> # A tibble: 6 × 3
#>   episode_id  time    hr
#>        <int> <dbl> <dbl>
#> 1          1     0  84.5
#> 2          1     2 102  
#> 3          1     4  81.3
#> 4          1     6  80  
#> 5          1     8  80.3
#> # … with 1 more row

## Don't forget to turn the lights out as you leave.
DBI::dbDisconnect(ctn)

Getting help

If you find a bug, please file a minimal reproducible example on github.


  1. https://www.ohdsi.org/analytic-tools/achilles-for-data-characterization/
  2. Kahn, Michael G.; Callahan, Tiffany J.; Barnard, Juliana; Bauck, Alan E.; Brown, Jeff; Davidson, Bruce N.; Estiri, Hossein; Goerg, Carsten; Holve, Erin; Johnson, Steven G.; Liaw, Siaw-Teng; Hamilton-Lopez, Marianne; Meeker, Daniella; Ong, Toan C.; Ryan, Patrick; Shang, Ning; Weiskopf, Nicole G.; Weng, Chunhua; Zozus, Meredith N.; and Schilling, Lisa (2016) “A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data,” eGEMs (Generating Evidence & Methods to improve patient outcomes): Vol. 4: Iss. 1, Article 18.