epitrix implements small helper functions usefull in infectious disease modelling and epidemics analysis. This vignette provides a quick overview of the package’s features.
The main features of the package include:
gamma_shapescale2mucv
: convert
shape and scale of a Gamma distribution to mean and CV
gamma_mucv2shapescale
: convert mean
and CV of a Gamma distribution to shape and scale
gamma_log_likelihood
: Gamma
log-likelihood using mean and CV
r2R0
: convert growth rate into a
reproduction number
lm2R0_sample
: generates a
distribution of R0 from a log-incidence linear model
fit_disc_gamma
: fits a discretised
Gamma distribution to data (typically useful for describing
delays)
hash_names
: generate unique,
anonymised, reproducible labels from various data fields (e.g. First
name, Last name, Date of birth).
In this example, we simulate data which replicate the serial interval (SI), i.e. the delays between primary and secondary symptom onsets, in Ebola Virus Disease (EVD). We start by converting previously estimates of the mean and standard deviation of the SI (WHO Ebola Response Team (2014) NEJM 371:1481–1495) to the parameters of a Gamma distribution:
library(epitrix)
mu <- 15.3 # mean in days days
sigma <- 9.3 # standard deviation in days
cv <- sigma / mu # coefficient of variation
cv
#> [1] 0.6078431
param <- gamma_mucv2shapescale(mu, cv) # convertion to Gamma parameters
param
#> $shape
#> [1] 2.706556
#>
#> $scale
#> [1] 5.652941
The shape and scale are parameters of a Gamma
distribution we can use to generate delays. However, delays are
typically reported per days, which implies a discretisation (from
continuous time to discrete numbers). We use the package distcrete to
achieve this discretisation. It generates a list of functions, including
one to simulate data ($r
), which we use to simulate 500
delays:
si <- distcrete::distcrete("gamma", interval = 1,
shape = param$shape,
scale = param$scale, w = 0)
si
#> A discrete distribution
#> name: gamma
#> parameters:
#> shape: 2.70655567117586
#> scale: 5.65294117647059
set.seed(1)
x <- si$r(500)
head(x, 10)
#> [1] 8 10 15 28 7 27 32 17 16 4
hist(x, col = "grey", border = "white",
xlab = "Days between primary and secondary onset",
main = "Simulated serial intervals")
x
contains simulated data, for illustrative purpose. In
practice, one would use real data from an ongoing outbreaks. Now we use
fit_disc_gamma
to estimate the parameters of a dicretised
Gamma distribution from the data:
The package incidence can
fit a log-linear model to incidence curves (function fit
),
which produces a growth rate (r). This growth rate can in turn be
translated into a basic reproduction number (R0) using
r2R0
. We illustrate this using simulated Ebola data from
the outbreaks
package, and using the serial interval from the previous example:
library(outbreaks)
library(incidence)
i <- incidence(ebola_sim$linelist$date_of_onset)
i
#> <incidence object>
#> [5888 cases from days 2014-04-07 to 2015-04-30]
#>
#> $counts: matrix with 389 rows and 1 columns
#> $n: 5888 cases in total
#> $dates: 389 dates marking the left-side of bins
#> $interval: 1 day
#> $timespan: 389 days
#> $cumulative: FALSE
f <- fit(i[1:150]) # fit on first 150 days
#> Warning in fit(i[1:150]): 22 dates with incidence of 0 ignored for fitting
plot(i[1:200], fit = f, color = "#9fc2fc")
#> Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
#> ggplot2 3.3.4.
#> ℹ Please use "none" instead.
#> ℹ The deprecated feature was likely used in the incidence package.
#> Please report the issue at <https://github.com/reconhub/incidence/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
r2R0(f$info$r, si$d(1:100))
#> [1] 1.348624
r2R0(f$info$r.conf, si$d(1:100))
#> 2.5 % 97.5 %
#> [1,] 1.314055 1.383674
In addition, we can also use the function lm2R0_sample
to generate samples of R0 values compatible with a model fit:
If you want to use labels that will work across different computers,
independent of local encoding and operating systems,
clean_labels
will make your life easier. The function
transforms character strings by replacing diacritic symbols with their
closest alphanumeric matches, setting all characters to lower case, and
replacing various separators with a single, consistent one.
For instance:
x <- " Thîs- is A wêïrD LäBeL .."
x
#> [1] " Thîs- is A wêïrD LäBeL .."
clean_labels(x)
#> [1] "this_is_a_weird_label"
variables <- c("Date.of.ONSET ",
"/ date of hôspitalisation /",
"-DäTÈ--OF___DîSCHARGE-",
"GEndèr/",
" Location. ")
variables
#> [1] "Date.of.ONSET " "/ date of hôspitalisation /"
#> [3] "-DäTÈ--OF___DîSCHARGE-" "GEndèr/"
#> [5] " Location. "
clean_labels(variables)
#> [1] "date_of_onset" "date_of_hospitalisation"
#> [3] "date_of_discharge" "gender"
#> [5] "location"
If you happen to have informative labels in your data that are not
alphanumeric, you will want to protect them with the
protect
argument:
vars <- c("Death in Structure > 4h", "death in Structure < 4h")
clean_labels(vars, protect = "><")
#> [1] "death_in_structure_>_4h" "death_in_structure_<_4h"
If you don’t use the protect = "><"
, the two
variables above would appear to be exactly the same.
hash_names
can be used to generate hashed labels from
linelist data. Based on pre-defined fields, it will generate anonymous
labels. This system has the following desirable features:
given the same input, the output will always be the same, so this encoding system generates labels which can be used by different people and organisations
given different inputs, the output will always be different; even minor differences in input will result in entirely different outputs
given an output, it is very hard to infer the input (it requires hacking skills); if security is challenged, the hashing algorithm can be ‘salted’ to strengthen security
first_name <- c("Jane", "Joe", "Raoul", "Raoul")
last_name <- c("Doe", "Smith", "Dupont", "Dupond")
age <- c(25, 69, 36, 36)
## detailed output by default
hash_names(first_name, last_name, age)
#> label hash_short hash
#> 1 jane_doe_25 6485f2 6485f29654c5a9d55625cd6efeb96d569917e1c272790959ad3fa132c6d51648
#> 2 joe_smith_69 ea1ccc ea1cccce320aa45a0d694ea12c30ff6b4b52c67f69d58b23dad5441ea17c5807
#> 3 raoul_dupont_36 f60676 f60676d1c11ae5badc0e5ec4dfde06eaba817a78f3d54eb327a25df485ec1efd
#> 4 raoul_dupond_36 cd7104 cd7104e7e7009bfd988d5a4b46a930424908736065573e51a85d16575ed7c2a5
## short labels for practical use, using a faster (but less secure) algorithm
hash_names(first_name, last_name, age,
size = 8, full = FALSE, hashfun = sodium::sha256)
#> [1] "f3bb3fb9" "1a62b1ef" "3f2a0e20" "3aad952f"
## adding a salt for extra security
hash_names(first_name, last_name, age,
salt = "Keep it secret")
#> label hash_short hash
#> 1 jane_doe_25 374951 374951505153661b59661942f27736bd86f93779662c552beed1e00b3d47d47a
#> 2 joe_smith_69 4ca079 4ca079f3e3338ede974d8ecbcf474a2cf5c400da32a5f6ae116817f5d294083b
#> 3 raoul_dupont_36 1980c7 1980c7a52740917865d613d2250d549833ec82245b105229f06dc6c9df792b95
#> 4 raoul_dupond_36 faf491 faf491d52d05d2ae1cef7ef5277903cdadfee9b9d6aae88c60f3400ab08dc98f