---
title: "Overview of the incidence package"
author: "Thibaut Jombart, Zhian N. Kamvar"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 2
vignette: >
%\VignetteIndexEntry{Overview}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width=7,
fig.height=5
)
```
*incidence* implements functions and classes to compute, handle, visualise and model incidences from
dates data. This vignette provides an overview of current features. It largely reproduces the
content of `REAME.md`.
# Installing the package
To install the current stable, CRAN version of the package, type:
```{r install, eval=FALSE}
install.packages("incidence")
```
To benefit from the latest features and bug fixes, install the development, *github* version of the package using:
```{r install2, eval=FALSE}
devtools::install_github("reconhub/incidence")
```
Note that this requires the package *devtools* installed.
# What does it do?
The main functions of the package include:
- **`incidence`**: compute incidence from dates in various formats; any fixed
time interval can be used; the returned object is an instance of the (S3)
class *incidence*.
- **`plot`**: this method (see `?plot.incidence` for details) plots *incidence*
objects, and can also add predictions of the model(s) contained in an
*incidence_fit* object (or a list of such objects).
- **`fit`**: fit one or two exponential models (i.e. linear regression on
log-incidence) to an *incidence* object; two models are calibrated only if a
date is provided to split the time series in two (argument `split`); this is
typically useful to model the two phases of exponential growth, and decrease
of an outbreak; each model returned is an instance of the (S3) class
*incidence_fit*, each of which contains various useful information
(e.g. growth rate *r*, doubling/halving time, predictions and confidence
intervals); results can be plotted using `plot`, or added to an existing
`uncudence` plot using the piping-friendly function `add_incidence_fit`.
- **`fit_optim_split`**: finds the optimal date to split the time series in two,
typically around the peak of the epidemic.
- **`[`**: lower-level subsetting of *incidence* objects, permitting to specify
which dates and groups to retain; uses a syntax similar to matrices,
i.e. `x[i, j]`, where `x` is the *incidence* object, `i` a subset of dates,
and `j` a subset of groups.
- **`subset`**: subset an *incidence* object by specifying a time window.
- **`pool`**: pool incidence from different groups into one global incidence
time series.
- **`cumulate`**: computes cumulative incidence over time from and `incidence`
object.
- **`as.data.frame`**: converts an *incidence* object into a `data.frame`
containing dates and incidence values.
- **`bootstrap`**: generates a bootstrapped *incidence* object by re-sampling,
with replacement, the original dates of events.
- **`find_peak`**: locates the peak time of the epicurve.
- **`estimate_peak`**: uses bootstrap to estimate the peak time (and related
confidence interval) of a partially observed outbreak.
# Worked example: simulated Ebola outbreak
## Loading the data
This example uses the simulated Ebola Virus Disease (EVD) outbreak from the package
[*outbreaks*](https://cran.r-project.org/package=outbreaks). We will compute incidence for various time
steps, calibrate two exponential models around the peak of the epidemic, and analyse the results.
First, we load the data:
```{r, data}
library(outbreaks)
library(ggplot2)
library(incidence)
dat <- ebola_sim$linelist$date_of_onset
class(dat)
head(dat)
```
## Computing and plotting incidence
We compute the daily incidence:
```{r, incid1}
i <- incidence(dat)
i
plot(i)
```
The daily incidence is quite noisy, but we can easily compute other incidence
using larger time intervals:
```{r, interv}
# weekly, starting on Monday (ISO week, default)
i.7 <- incidence(dat, interval = "1 week")
plot(i.7)
# semi-weekly, starting on Saturday
i.14 <- incidence(dat, interval = "2 saturday weeks")
plot(i.14, border = "white")
## monthly
i.month <- incidence(dat, interval = "1 month")
plot(i.month, border = "white")
```
`incidence` can also compute incidence by specified groups using the `groups`
argument. For instance, we can compute incidence by gender:
```{r, gender}
i.7.sex <- incidence(dat, interval = "1 week", groups = ebola_sim$linelist$gender)
i.7.sex
plot(i.7.sex, stack = TRUE, border = "grey")
```
We can do the same for hospitals, using the 'clean' version of the dataset, with
some customization of the legend:
```{r, hosp}
i.7.hosp <- with(ebola_sim_clean$linelist,
incidence(date_of_onset, interval = "week", groups = hospital))
i.7.hosp
head(get_counts(i.7.hosp))
plot(i.7.hosp, stack=TRUE) +
theme(legend.position= "top") +
labs(fill="")
```
## Handling `incidence` objects
`incidence` objects can be manipulated easily. The `[` operator implements
subsetting of dates (first argument) and groups (second argument). For
instance, to keep only the peak of the distribution:
```{r, middle}
i[100:250]
plot(i[100:250])
```
Or to keep every other week:
```{r, stripes}
i.7[c(TRUE,FALSE)]
plot(i.7[c(TRUE,FALSE)])
```
Some temporal subsetting can be even simpler using `subset`, which permits to
retain data within a specified time window:
```{r, tail}
i.tail <- subset(i, from=as.Date("2015-01-01"))
i.tail
plot(i.tail, border="white")
```
Subsetting groups can also matter. For instance, let's try and visualise the
incidence based on onset of symptoms by outcome:
```{r, i7outcome}
i.7.outcome <- incidence(dat, 7, groups=ebola_sim$linelist$outcome)
i.7.outcome
plot(i.7.outcome, stack = TRUE, border = "grey")
```
By default, `incidence` treats missing data (NA) as a separate group (see
argument `na_as_group`). We could disable this to retain only known outcomes,
but alternatively we can simply subset the object to exclude the last (3rd)
group:
```{r, groupsub}
i.7.outcome[,1:2]
plot(i.7.outcome[,1:2], stack = TRUE, border = "grey")
```
Groups can also be collapsed into a single time series using `pool`:
```{r, pool}
i.pooled <- pool(i.7.outcome)
i.pooled
identical(i.7$counts, i.pooled$counts)
```
## Modelling incidence
Incidence data, excluding zeros, can be modelled using log-linear regression of
the form: log(*y*) = *r* x *t* + *b*
where *y* is the incidence, *r* is the growth rate, *t* is the number of days
since a specific point in time (typically the start of the outbreak), and *b*
is the intercept.
Such model can be fitted to any incidence object using `fit`. Of course, a
single log-linear model is not sufficient for modelling our epidemic curve, as
there is clearly an growing and a decreasing phase. As a start, we can
calibrate a model on the first 20 weeks of the epidemic:
```{r, fit1}
plot(i.7[1:20])
early.fit <- fit(i.7[1:20])
early.fit
```
The resulting objects (known as `incidence_fit` objects) can be plotted, in
which case the prediction and its confidence interval is displayed:
```{r}
plot(early.fit)
```
However, a better way to display these predictions is adding them to the
incidence plot using the argument `fit`:
```{r}
plot(i.7[1:20], fit = early.fit)
```
In this case, we would ideally like to fit two models, before and after the
peak of the epidemic. This is possible using the following approach, if you
know what date to use to split the data in two phases:
```{r, fit.both}
fit.both <- fit(i.7, split=as.Date("2014-10-15"))
fit.both
plot(i.7, fit=fit.both)
```
This is much better, but the splitting date is not completely optimal. To look
for the best possible splitting date (i.e. the one maximizing the average fit
of both models), we use:
```{r, optim}
best.fit <- fit_optim_split(i.7)
best.fit
plot(i.7, fit=best.fit$fit)
```
These models are very good approximation of these data, showing a doubling time
of `r round(get_info(best.fit$fit, "doubling"), 1)` days during the first
phase, and a halving time of `r round(get_info(best.fit$fit, "halving"), 1)`
days during the second.
To access these parameters, you can use the `get_info()` function.
The possible parameters are:
- "r", the daily growth rate
- "doubling" the rate of doubling in days (if "r" is positive)
- "halving" the rate of halving in days (if "r" is negative)
- "pred" a data frame of incidence predictions
For "r", "doubling", and "halving", you can also add ".conf" to get the
confidence intervals. Here's how you can get the doubling and halving times of
the above epi curve:
```{r, get_info}
get_info(best.fit$fit, "doubling") # doubling time
get_info(best.fit$fit, "doubling.conf") # confidence interval
get_info(best.fit$fit, "halving")
get_info(best.fit$fit, "halving.conf")
```
Note that `fit` will also take groups into account if incidence has been
computed for several groups:
```{r, optim2}
best.fit2 <- fit_optim_split(i.7.sex)
best.fit2
plot(i.7.sex, fit=best.fit2$fit)
```
Using `get_info()` on this fit object will return all groups together:
```{r, get_info_groups}
get_info(best.fit2$fit, "doubling") # doubling time
get_info(best.fit2$fit, "doubling.conf") # confidence interval
get_info(best.fit2$fit, "halving")
get_info(best.fit2$fit, "halving.conf")
```