---
title: "Overview of the epitrix package"
author: "Thibaut Jombart"
date: "`r Sys.Date()`"
output:
   rmarkdown::html_vignette:
     toc: true
     toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Overview}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  fig.width=7, 
  fig.height=5, 
  fig.path="figs-overview/"
)
```


*epitrix* implements small helper functions usefull in infectious disease
modelling and epidemics analysis. This vignette provides a quick overview of the
package's features.


# What does it do?

The main features of the package include:

- **`gamma_shapescale2mucv`**: convert shape and scale of a Gamma distribution
    to mean and CV

- **`gamma_mucv2shapescale`**: convert mean and CV of a Gamma distribution to
    shape and scale

- **`gamma_log_likelihood`**: Gamma log-likelihood using mean and CV

- **`r2R0`**: convert growth rate into a reproduction number

- **`lm2R0_sample`**: generates a distribution of R0 from a log-incidence linear
    model

- **`fit_disc_gamma`**: fits a discretised Gamma distribution to data (typically
    useful for describing delays)

- **`hash_names`**: generate unique, anonymised, reproducible labels from
    various data fields (e.g. First name, Last name, Date of birth).


# Fitting a gamma distribution to delay data

In this example, we simulate data which replicate the serial interval (SI),
i.e. the delays between primary and secondary symptom onsets, in Ebola Virus
Disease (EVD). We start by converting previously estimates of the mean and
standard deviation of the SI (WHO Ebola Response Team (2014) NEJM 371:1481–1495)
to the parameters of a Gamma distribution:

```{r generate_data}
library(epitrix)

mu <- 15.3 # mean in days days
sigma <- 9.3 # standard deviation in days
cv <- sigma / mu # coefficient of variation
cv
param <- gamma_mucv2shapescale(mu, cv) # convertion to Gamma parameters
param

```

The *shape* and *scale* are parameters of a Gamma distribution we can use to
generate delays. However, delays are typically reported per days, which implies
a discretisation (from continuous time to discrete numbers). We use the package
[*distcrete*](https://github.com/reconhub/distcrete) to achieve this
discretisation. It generates a list of functions, including one to simulate data
(`$r`), which we use to simulate 500 delays:

```{r si}

si <- distcrete::distcrete("gamma", interval = 1,
               shape = param$shape,
               scale = param$scale, w = 0)
si
set.seed(1)
x <- si$r(500)
head(x, 10)
hist(x, col = "grey", border = "white",
     xlab = "Days between primary and secondary onset",
     main = "Simulated serial intervals")

```

`x` contains simulated data, for illustrative purpose. In practice, one would
use real data from an ongoing outbreaks. Now we use `fit_disc_gamma` to estimate
the parameters of a dicretised Gamma distribution from the data:

```{r fit}

si_fit <- fit_disc_gamma(x)
si_fit

```


# Converting a growth rate (r) to a reproduction number (R0)

The package [*incidence*](https://github.com/reconhub/incidence) can fit a
log-linear model to incidence curves (function `fit`), which produces a growth
rate (r). This growth rate can in turn be translated into a basic reproduction
number (R0) using `r2R0`. We illustrate this using simulated Ebola data from the
[*outbreaks*](https://github.com/reconverse/outbreaks) package, and using the
serial interval from the previous example:

```{r fit_i}

library(outbreaks)
library(incidence)
i <- incidence(ebola_sim$linelist$date_of_onset)
i
f <- fit(i[1:150]) # fit on first 150 days
plot(i[1:200], fit = f, color = "#9fc2fc")

r2R0(f$info$r, si$d(1:100))
r2R0(f$info$r.conf, si$d(1:100))

```

In addition, we can also use the function `lm2R0_sample` to generate samples of
R0 values compatible with a model fit:

```{r sample_R0}

R0_val <- lm2R0_sample(f$model, si$d(1:100), n = 100)
head(R0_val)
hist(R0_val, col = "grey", border = "white")

```


# Standardising labels

If you want to use labels that will work across different computers, independent
of local encoding and operating systems, `clean_labels` will make your life
easier. The function transforms character strings by replacing diacritic symbols
with their closest alphanumeric matches, setting all characters to lower case,
and replacing various separators with a single, consistent one.


For instance:
```{r clean_labels}
x <- " Thîs- is A   wêïrD LäBeL .."
x
clean_labels(x)

variables <- c("Date.of.ONSET ",
               "/  date of hôspitalisation  /",
               "-DäTÈ--OF___DîSCHARGE-",
               "GEndèr/",
               "  Location. ")
variables
clean_labels(variables)

```


If you happen to have informative labels in your data that are not alphanumeric,
you will want to protect them with the `protect` argument:

```{r protect_labels}
vars <- c("Death in Structure  > 4h", "death in Structure < 4h")
clean_labels(vars, protect = "><")
```

If you don't use the `protect = "><"`, the two variables above would appear to
be exactly the same.


# Anonymising data

`hash_names` can be used to generate hashed labels from linelist data. Based on
pre-defined fields, it will generate anonymous labels. This system has the
following desirable features:

- given the same input, the output will always be the same, so this encoding
  system generates labels which can be used by different people and
  organisations

- given different inputs, the output will always be different; even minor
  differences in input will result in entirely different outputs

- given an output, it is very hard to infer the input (it requires hacking
  skills); if security is challenged, the hashing algorithm can be 'salted' to
  strengthen security

```{r, R.options = list(width = 100)}

first_name <- c("Jane", "Joe", "Raoul", "Raoul")
last_name <- c("Doe", "Smith", "Dupont", "Dupond")
age <- c(25, 69, 36, 36)

## detailed output by default
hash_names(first_name, last_name, age)

## short labels for practical use, using a faster (but less secure) algorithm
hash_names(first_name, last_name, age,
           size = 8, full = FALSE, hashfun = sodium::sha256)

## adding a salt for extra security
hash_names(first_name, last_name, age,
           salt = "Keep it secret")

```