Title: | Small Helpers and Tricks for Epidemics Analysis |
---|---|
Description: | A collection of small functions useful for epidemics analysis and infectious disease modelling. This includes computation of basic reproduction numbers from growth rates, generation of hashed labels to anonymize data, and fitting discretized Gamma distributions. |
Authors: | Thibaut Jombart [aut, cre], Anne Cori [aut], Zhian N. Kamvar [ctb], Dirk Schumacher [ctb], Flavio Finger [aut], Charlie Whittaker [ctb] |
Maintainer: | Thibaut Jombart <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.4.0 |
Built: | 2025-01-02 02:40:58 UTC |
Source: | https://github.com/reconhub/epitrix |
Title Calculate basic reproduction number from attack rate
AR2R0(AR)
AR2R0(AR)
AR |
the attack rate; a value or vector of values between 0 and 1 |
R0, the basic reproduction number, calculated as -log(1-AR)/AR
## Calculate R0 for an attack rate of 50% AR2R0(0.5) ## plot the relationship between R0 and attack rate x <- seq(0.01, 1, 0.01) plot(AR2R0(x), x, type = "l", xlab = "R0", ylab = "Attack rate")
## Calculate R0 for an attack rate of 50% AR2R0(0.5) ## plot the relationship between R0 and attack rate x <- seq(0.01, 1, 0.01) plot(AR2R0(x), x, type = "l", xlab = "R0", ylab = "Attack rate")
This function standardises labels e.g. used as variable names or character
string values, removing non-ascii characters, replacing diacritics (e.g. é,
ô) with their closest ascii equivalents, and standardises separating
characters. See details for more information on label transformation.
clean_labels( x, sep = "_", transformation = "Any-Latin; Latin-ASCII", protect = "" )
clean_labels( x, sep = "_", transformation = "Any-Latin; Latin-ASCII", protect = "" )
x |
A vector of labels, normally provided as characters. |
sep |
A character string used as separator, defaulting to '_'. |
transformation |
a string to be passed on to |
protect |
a character string defining the punctuation that should be protected. This helps prevent meaninful symbols like > and < from being removed. |
The following changes are performed:
all non-ascii characters are removed
all diacritics are replaced with their non-accentuated equivalents, e.g. 'é', 'ê' and 'è' become 'e'.
all characters are set to lower case
separators are standardised to the use of a single character provided
in sep
(defaults to '_'); heading and trailing separators are removed.
Because of differences between the underlying transliteration engine
(ICU), the default transformations will not transilierate German umlaute
correctly. You can add them by specifying "de-ASCII" in the transformation
string after "Any-Latin".
Thibaut Jombart [email protected], Zhian N. Kamvar
## Not run: clean_labels("-_-This is; A WeÏrD**./sêntënce...") clean_labels("-_-This is; A WeÏrD**./sêntënce...", sep = ".") input <- c("Peter and stëven", "peter-and.stëven", "pëtêr and stëven _-") input clean_labels(input) # Don't transliterate non-latin words clean_labels(input, transformation = "Latin-ASCII") # protect useful symbols clean_labels(c("energy > 9000", "energy < 9000"), protect = "><") # if you only want to clean accents, transform to lower, and transliterate, # you can specify "[:punct:][:space:]" for protect: clean_labels(input, protect = "[:punct:][:space:]") # appropriately transliterate Germanic umlaute if (stringi::stri_info()$ICU.system) { # This will only be true if you have the correct version of ICU installed clean_labels("'é', 'ê' and 'è' become 'e', 'ö' becomes 'oe', etc.", transformation = "Any-Latin; de-ASCII; Latin-ASCII") } ## End(Not run)
## Not run: clean_labels("-_-This is; A WeÏrD**./sêntënce...") clean_labels("-_-This is; A WeÏrD**./sêntënce...", sep = ".") input <- c("Peter and stëven", "peter-and.stëven", "pëtêr and stëven _-") input clean_labels(input) # Don't transliterate non-latin words clean_labels(input, transformation = "Latin-ASCII") # protect useful symbols clean_labels(c("energy > 9000", "energy < 9000"), protect = "><") # if you only want to clean accents, transform to lower, and transliterate, # you can specify "[:punct:][:space:]" for protect: clean_labels(input, protect = "[:punct:][:space:]") # appropriately transliterate Germanic umlaute if (stringi::stri_info()$ICU.system) { # This will only be true if you have the correct version of ICU installed clean_labels("'é', 'ê' and 'è' become 'e', 'ö' becomes 'oe', etc.", transformation = "Any-Latin; de-ASCII; Latin-ASCII") } ## End(Not run)
This function takes in a linelist data frame and extracts the empirical incubation period distribution and can take into account uncertainty in the dates of exposure.
empirical_incubation_dist(x, date_of_onset, exposure, exposure_end = NULL)
empirical_incubation_dist(x, date_of_onset, exposure, exposure_end = NULL)
x |
the linelist data (data.frame or linelist object) containing at least a column containing the exposure dates and one containing the onset dates. |
date_of_onset |
the name of the column containing the onset dates (bare variable name or in quotes) |
exposure |
the name of the column containing the exposure dates (bare variable name or in quotes) |
exposure_end |
the name of a column containing dates representing the end of the exposure period. This is 'NULL' by default, indicating all exposures are known and in the 'exposure' column. |
a data frame containing a column with the different incubation periods and a column containing their relative frequency
For exposure dates, each element can be a vector containing several possible exposure dates. Note that if the same exposure date appears twice in the list it is given twice as much weight.
Flavio Finger, [email protected], Zhian N. Kamvar
if (require(tibble)) { random_dates <- as.Date("2020-01-01") + sample(0:30, 50, replace = TRUE) x <- tibble(date_of_onset = random_dates) # Linelist with a list column of potential exposure dates ------------------ mkexposures <- function(x) x - round(rgamma(sample.int(5, size = 1), shape = 12, rate = 3)) exposures <- sapply(x$date_of_onset, mkexposures) x$date_exposure <- exposures incubation_period_dist <- empirical_incubation_dist(x, date_of_onset, date_exposure) incubation_period_dist # Linelist with exposure range --------------------------------------------- start_exposure <- round(rgamma(nrow(x), shape = 12, rate = 3)) end_exposure <- round(rgamma(nrow(x), shape = 12, rate = 7)) x$exposure_end <- x$date_of_onset - end_exposure x$exposure_start <- x$exposure_end - start_exposure incubation_period_dist <- empirical_incubation_dist(x, date_of_onset, exposure_start, exposure_end) incubation_period_dist plot(incubation_period_dist, type = "h", lwd = 10, lend = 2, col = "#49D193", xlab = "Days since exposure", ylab = "Probability", main = "Incubation time distribution") }
if (require(tibble)) { random_dates <- as.Date("2020-01-01") + sample(0:30, 50, replace = TRUE) x <- tibble(date_of_onset = random_dates) # Linelist with a list column of potential exposure dates ------------------ mkexposures <- function(x) x - round(rgamma(sample.int(5, size = 1), shape = 12, rate = 3)) exposures <- sapply(x$date_of_onset, mkexposures) x$date_exposure <- exposures incubation_period_dist <- empirical_incubation_dist(x, date_of_onset, date_exposure) incubation_period_dist # Linelist with exposure range --------------------------------------------- start_exposure <- round(rgamma(nrow(x), shape = 12, rate = 3)) end_exposure <- round(rgamma(nrow(x), shape = 12, rate = 7)) x$exposure_end <- x$date_of_onset - end_exposure x$exposure_start <- x$exposure_end - start_exposure incubation_period_dist <- empirical_incubation_dist(x, date_of_onset, exposure_start, exposure_end) incubation_period_dist plot(incubation_period_dist, type = "h", lwd = 10, lend = 2, col = "#49D193", xlab = "Days since exposure", ylab = "Probability", main = "Incubation time distribution") }
These functions performs maximum-likelihood (ML) fitting of a discretised
distribution. This is typically useful for describing delays between
epidemiological events, such as incubation period (infection to onset) or
serial intervals (primary to secondary onsets). The function
optim
is used internally for fitting.
fit_disc_gamma(x, mu_ini = NULL, cv_ini = NULL, interval = 1, w = 0, ...)
fit_disc_gamma(x, mu_ini = NULL, cv_ini = NULL, interval = 1, w = 0, ...)
x |
A vector of numeric data to fit; NAs will be removed with a warning. |
mu_ini |
The initial value for the mean 'mu', defaulting to the empirically calculated value. |
cv_ini |
The initial value for the coefficient of variation 'cv', defaulting to the empirically calculated value. |
interval |
The interval used for discretisation; see |
w |
The centering of the interval used for discretisation; see
|
... |
Further arguments passed to |
The function returns a list with human-readable parametrisation of
the discretised Gamma distibution (mean, sd, cv), convergence indicators,
and the discretised Gamma distribution itself as a distcrete
object
(from the distcrete
package).
Thibaut Jombart [email protected]
Charlie Whittaker [email protected]
The distcrete
package for discretising distributions, and
optim
for details on available optimisation procedures.
## generate data mu <- 15.3 # days sigma <- 9.3 # days cv <- sigma / mu cv param <- gamma_mucv2shapescale(mu, cv) if (require(distcrete)) { w <- distcrete("gamma", interval = 1, shape = param$shape, scale = param$scale, w = 0) x <- w$r(100) x fit_disc_gamma(x) }
## generate data mu <- 15.3 # days sigma <- 9.3 # days cv <- sigma / mu cv param <- gamma_mucv2shapescale(mu, cv) if (require(distcrete)) { w <- distcrete("gamma", interval = 1, shape = param$shape, scale = param$scale, w = 0) x <- w$r(100) x fit_disc_gamma(x) }
A wrapper around fit_disc_gamma to fit a discrete gamma distribution to incubation periods derived from exposure and onset dates. Can take into account uncertain dates of exposure.
fit_gamma_incubation_dist( x, date_of_onset, exposure, exposure_end = NULL, nsamples = 1000, ... )
fit_gamma_incubation_dist( x, date_of_onset, exposure, exposure_end = NULL, nsamples = 1000, ... )
x |
the linelist data (data.frame or linelist object) containing at least a column containing the exposure dates and one containing the onset dates. |
date_of_onset |
the name of the column containing the onset dates (bare variable name or in quotes) |
exposure |
the name of the column containing the exposure dates (bare variable name or in quotes) |
exposure_end |
the name of a column containing dates representing the end of the exposure period. This is 'NULL' by default, indicating all exposures are known and in the 'exposure' column. |
nsamples |
The number of samples to draw from the empirical distribution to fit on (dafaults to 1000) |
... |
passed to fit_disc_gamma |
see [fit_disc_gamma()]
Flavio Finger, [email protected]
random_dates <- as.Date("2020-01-01") + sample(0:30, 50, replace = TRUE) x <- data.frame(date_of_onset = random_dates) mkexposures <- function(x) x - round(rgamma(sample.int(5, size = 1), shape = 12, rate = 3)) exposures <- sapply(x$date_of_onset, mkexposures) x$date_exposure <- exposures fit <- fit_gamma_incubation_dist(x, date_of_onset, date_exposure) plot(0:20, fit$distribution$d(0:20), type = "h", lwd = 10, lend = 2, col = "#49D193", xlab = "Days since exposure", ylab = "Probability", main = "Incubation time distribution")
random_dates <- as.Date("2020-01-01") + sample(0:30, 50, replace = TRUE) x <- data.frame(date_of_onset = random_dates) mkexposures <- function(x) x - round(rgamma(sample.int(5, size = 1), shape = 12, rate = 3)) exposures <- sapply(x$date_of_onset, mkexposures) x$date_exposure <- exposures fit <- fit_gamma_incubation_dist(x, date_of_onset, date_exposure) plot(0:20, fit$distribution$d(0:20), type = "h", lwd = 10, lend = 2, col = "#49D193", xlab = "Days since exposure", ylab = "Probability", main = "Incubation time distribution")
These functions permit to use alternate parametrisations for Gamma
distributions, from 'shape and scale' to 'mean (mu) and coefficient of
variation (cv), and back. gamma_shapescale2mucv
does the first
conversion, while gamma_mucv2shapescale
does the second. The function
gamma_log_likelihood
is a shortcut for computing Gamma log-likelihood
with the alternative parametrisation (mean, cv). See 'details' for a guide of
which parametrisation to use.
gamma_shapescale2mucv(shape, scale) gamma_mucv2shapescale(mu, cv) gamma_log_likelihood( x, mu, cv, discrete = TRUE, interval = 1, w = 0, anchor = 0.5 )
gamma_shapescale2mucv(shape, scale) gamma_mucv2shapescale(mu, cv) gamma_log_likelihood( x, mu, cv, discrete = TRUE, interval = 1, w = 0, anchor = 0.5 )
shape |
The shape parameter of the Gamma distribution. |
scale |
The scale parameter of the Gamma distribution. |
mu |
The mean of the Gamma distribution. |
cv |
The coefficient of variation of the Gamma distribution, i.e. the standard deviation divided by the mean. |
x |
A vector of data treated as observations drawn from a Gamma distribution, for which the likelihood is to be computed. |
discrete |
A logical indicating if the distribution should be discretised; TRUE by default. |
interval |
The interval used for discretisation; see
|
w |
The centering of the interval used for discretisation, defaulting to
0; see |
anchor |
The anchor used for discretisation, i.e. starting point of the
discretisation process; defaults to 0; see |
The gamma distribution is described in ?dgamma
is
parametrised using shape and scale (or rate). However, these parameters are
naturally correlated, which make them poor choices whenever trying to fit
data to a Gamma distribution. Their interpretation is also less clear than
the traditional mean and variance. When fitting the data, or reporting
results, it is best to use the alternative parametrisation using the mean
(mu
) and the coefficient of variation (cv
), i.e. the standard
deviation divided by the mean.
A named list containing 'shape' and 'scale', or mean ('mean') and coefficient of variation ('cv').
Code by Anne Cori [email protected], packaging by Thibaut Jombart [email protected]
## set up some parameters mu <- 10 cv <- 1 ## transform into shape scale tmp <- gamma_mucv2shapescale (mu, cv) shape <- tmp$shape scale <- tmp$scale ## recover original parameters when applying the revert function gamma_shapescale2mucv(shape, scale) # compare with mu, cv ## empirical validation: ## check mean / cv of a sample derived using rgamma with ## shape and scale computed from mu and cv gamma_sample <- rgamma(n = 10000, shape = shape, scale = scale) mean(gamma_sample) # compare to mu sd(gamma_sample) / mean(gamma_sample) # compare to cv
## set up some parameters mu <- 10 cv <- 1 ## transform into shape scale tmp <- gamma_mucv2shapescale (mu, cv) shape <- tmp$shape scale <- tmp$scale ## recover original parameters when applying the revert function gamma_shapescale2mucv(shape, scale) # compare with mu, cv ## empirical validation: ## check mean / cv of a sample derived using rgamma with ## shape and scale computed from mu and cv gamma_sample <- rgamma(n = 10000, shape = shape, scale = scale) mean(gamma_sample) # compare to mu sd(gamma_sample) / mean(gamma_sample) # compare to cv
This function uses the scrypt algorithm from libsodium to anonymise data, based on user-indicated data fields. Data fields are concatenated first, then each entry is hashed. The function can either return a full detailed output, or short labels ready to use for 'anonymised data'. Before concatenation (using "_" as a separator) to form labels, inputs are modified using [clean_labels()]
hash_names( ..., size = 6, full = TRUE, hashfun = "secure", salt = NULL, clean_labels = TRUE )
hash_names( ..., size = 6, full = TRUE, hashfun = "secure", salt = NULL, clean_labels = TRUE )
... |
Data fields to be hashed. |
size |
The number of characters retained in the hash. |
full |
A logical indicating if the a full output should be returned as a
|
hashfun |
This defines the hashing function to be used. If you specify "secure" (default), it will use [sodium::scrypt()], which will be secure, but will be slow for large data sets. For fast hashing with no colisions, you can sepecify "fast", and it will use [sodium::sha256()], which is several orders of magnitude faster than [sodium::scrypt()]. You can also specify a hashing function that takes and returns a [raw][base::raw] vector of bytes that can be converted to character with [rawToChar()]. |
salt |
An optional object that can be coerced to a character to be used to 'salt' the hashing algorithm (see details). Ignored if 'NULL'. |
clean_labels |
A logical indicating if labels of variables should be standardized; defaults to 'TRUE' |
The argument 'salt' should be used for salting the algorithm, i.e. adding an extra input to the input fields (the 'salt') to change the resulting hash and prevent identification of individuals via pre-computed hash tables.
It is highly recommend to choose a secret, random salt in order make it harder for an attacker to decode the hash.
Thibaut Jombart [email protected], Dirk Shchumacher [email protected], Zhian N. Kamvar [email protected]
[clean_labels()], used to clean labels prior to hashing
[sodium::hash()] for available hashing functions.
first_name <- c("Jane", "Joe", "Raoul") last_name <- c("Doe", "Smith", "Dupont") age <- c(25, 69, 36) # secure hashing hash_names(first_name, last_name, age, hashfun = "secure") # fast hashing hash_names(first_name, last_name, age, size = 8, full = FALSE, hashfun = "fast") ## salting the hashing (more secure!) hash_names(first_name, last_name) # unsalted - less secure hash_names(first_name, last_name, salt = 123) # salted with an integer hash_names(first_name, last_name, salt = "foobar") # salted with an character ## using a different hash algorithm if you want things to run faster hash_names(first_name, last_name, hashfun = "fast") # use sha256 algorithm
first_name <- c("Jane", "Joe", "Raoul") last_name <- c("Doe", "Smith", "Dupont") age <- c(25, 69, 36) # secure hashing hash_names(first_name, last_name, age, hashfun = "secure") # fast hashing hash_names(first_name, last_name, age, size = 8, full = FALSE, hashfun = "fast") ## salting the hashing (more secure!) hash_names(first_name, last_name) # unsalted - less secure hash_names(first_name, last_name, salt = 123) # salted with an integer hash_names(first_name, last_name, salt = "foobar") # salted with an character ## using a different hash algorithm if you want things to run faster hash_names(first_name, last_name, hashfun = "fast") # use sha256 algorithm
Title Calculate attack rate from basic reproduction number
R02AR(R0, tol = 0.01)
R02AR(R0, tol = 0.01)
R0 |
a value or vector of values representing the basic reproduction number, must be >=0 |
tol |
a single >=0 value giving the tolerance for the calculated attack rate |
AR, the attack rate, calculated using the relationship: R0 = -log(1-AR)/AR
## Calculate the attack rate for a specific value of the reproduction number R02AR(2) # returns the AR for an R0 of 2 ## plot the relationship between R0 and attack rate x <- seq(1.01, 5, 0.01) plot(x, R02AR(x), type = "l", xlab = "R0", ylab = "Attack rate")
## Calculate the attack rate for a specific value of the reproduction number R02AR(2) # returns the AR for an R0 of 2 ## plot the relationship between R0 and attack rate x <- seq(1.01, 5, 0.01) plot(x, R02AR(x), type = "l", xlab = "R0", ylab = "Attack rate")
Title Calculate herd immunity threshold from basic reproduction number
R02herd_immunity_threshold(R0)
R02herd_immunity_threshold(R0)
R0 |
a value or vector of values representing the basic reproduction number, must be >=0 |
The herd immunity threshold, calculated as 1 - 1 / R0
## Calculate the herd immunity threshold for a specific value of the ## reproduction number (here 2) R02herd_immunity_threshold(2) ## plot the relationship between R0 and herd immunity threshold x <- seq(1.01, 15, 0.01) plot(x, R02herd_immunity_threshold(x), type = "l", xlab = "R0", ylab = "Herd immunity threshold")
## Calculate the herd immunity threshold for a specific value of the ## reproduction number (here 2) R02herd_immunity_threshold(2) ## plot the relationship between R0 and herd immunity threshold x <- seq(1.01, 15, 0.01) plot(x, R02herd_immunity_threshold(x), type = "l", xlab = "R0", ylab = "Herd immunity threshold")
The function r2R0
can be used to transform a growth rate into a
reproduction number estimate, given a generation time distribution. This uses
the approach described in Wallinga and Lipsitch (2007, Proc Roy Soc B
274:599–604) for empirical distributions. The function lm2R0_sample
generates a sample of R0 values from a log-linear regression of incidence
data stored in a lm
object.
r2R0(r, w, trunc = 1000) lm2R0_sample(x, w, n = 100, trunc = 1000)
r2R0(r, w, trunc = 1000) lm2R0_sample(x, w, n = 100, trunc = 1000)
r |
A vector of growth rate values. |
w |
The serial interval distribution, either provided as a
|
trunc |
The number of time units (most often, days), used for truncating
|
x |
A |
n |
The number of draws of R0 values, defaulting to 100. |
It is assumed that the growth rate ('r') is measured in the same time unit as the serial interval ('w' is the SI distribution, starting at time 0).
Code by Anne Cori [email protected], packaging by Thibaut Jombart [email protected]
## Ebola estimates of the SI distribution from the first 9 months of ## West-African Ebola oubtreak mu <- 15.3 # days sigma <- 9.3 # days param <- gamma_mucv2shapescale(mu, sigma / mu) if (require(distcrete)) { w <- distcrete("gamma", interval = 1, shape = param$shape, scale = param$scale, w = 0) r2R0(c(-1, -0.001, 0, 0.001, 1), w) ## Use simulated Ebola outbreak and 'incidence' to get a log-linear ## model of daily incidence. if (require(outbreaks) && require(incidence)) { i <- incidence(ebola_sim$linelist$date_of_onset) plot(i) f <- fit(i[1:100]) f plot(i[1:150], fit = f) R0 <- lm2R0_sample(f$model, w) hist(R0, col = "grey", border = "white", main = "Distribution of R0") summary(R0) } }
## Ebola estimates of the SI distribution from the first 9 months of ## West-African Ebola oubtreak mu <- 15.3 # days sigma <- 9.3 # days param <- gamma_mucv2shapescale(mu, sigma / mu) if (require(distcrete)) { w <- distcrete("gamma", interval = 1, shape = param$shape, scale = param$scale, w = 0) r2R0(c(-1, -0.001, 0, 0.001, 1), w) ## Use simulated Ebola outbreak and 'incidence' to get a log-linear ## model of daily incidence. if (require(outbreaks) && require(incidence)) { i <- incidence(ebola_sim$linelist$date_of_onset) plot(i) f <- fit(i[1:100]) f plot(i[1:150], fit = f) R0 <- lm2R0_sample(f$model, w) hist(R0, col = "grey", border = "white", main = "Distribution of R0") summary(R0) } }
This function simulates a simple linelist data including dates of epidemiological events and basic patient information. No underlying epidemiological model is used.
sim_linelist( n = 1, onset_from = as.Date("2020-01-01"), onset_span = 60, report_delay = 7, cfr = 0.1 )
sim_linelist( n = 1, onset_from = as.Date("2020-01-01"), onset_span = 60, report_delay = 7, cfr = 0.1 )
n |
Number of entries to simulate. |
onset_from |
The earliest date of symptom onset which can be generated. |
onset_span |
The time span over which to generate dates of onset. |
report_delay |
The mean delay between onset and reporting, using a Poisson distribution. |
cfr |
The case fatality ratio, i.e. the proportion of patient dying from the infection (used to generate the 'outcome' variable). |
Thibaut Jombart [email protected]
sim_linelist(10)
sim_linelist(10)