Skip to content

Commit 3d53dd5

Browse files
committed
add a number of epiprocess vignettes
1 parent 602cd2c commit 3d53dd5

38 files changed

+46195
-37
lines changed

.Rprofile

-1
This file was deleted.

DESCRIPTION

+1
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ Imports:
1515
epipredict (>=0.0.5),
1616
epiprocess,
1717
modeldata,
18+
outbreaks,
1819
ranger,
1920
see,
2021
tidyverse,

_common.R

+4
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ knitr::opts_chunk$set(
2020
)
2121

2222
suppressPackageStartupMessages(library(tidyverse))
23+
suppressPackageStartupMessages(library(epiprocess))
24+
suppressPackageStartupMessages(library(epipredict))
25+
suppressPackageStartupMessages(library(epidatr))
26+
suppressPackageStartupMessages(library(epidatasets))
2327

2428
options(
2529
dplyr.print_min = 6,

_freeze/epidf/execute-results/html.json

+14
Large diffs are not rendered by default.

_freeze/epidf/figure-html/unnamed-chunk-11-1.svg

+520
Loading

_freeze/epidf/figure-html/unnamed-chunk-12-1.svg

+572
Loading

_freeze/epidf/figure-html/unnamed-chunk-13-1.svg

+884
Loading

_freeze/epidf/figure-html/unnamed-chunk-14-1.svg

+2,078
Loading

_freeze/epidf/figure-html/unnamed-chunk-15-1.svg

+2,098
Loading

_freeze/growth-rates/execute-results/html.json

+14
Large diffs are not rendered by default.

_freeze/growth-rates/figure-html/unnamed-chunk-10-1.svg

+666
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-10-2.svg

+725
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-11-1.svg

+686
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-11-2.svg

+745
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-4-1.svg

+5,007
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-5-1.svg

+436
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-6-1.svg

+688
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-7-1.svg

+708
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-8-1.svg

+688
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-8-2.svg

+725
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-9-1.svg

+708
Loading

_freeze/growth-rates/figure-html/unnamed-chunk-9-2.svg

+725
Loading

_freeze/slide/execute-results/html.json

+14
Large diffs are not rendered by default.
Loading

_freeze/slide/figure-html/unnamed-chunk-11-1.svg

+3,237
Loading

_freeze/slide/figure-html/unnamed-chunk-12-1.svg

+2,883
Loading

_freeze/slide/figure-html/unnamed-chunk-13-1.svg

+3,417
Loading
Loading

_freeze/slide/figure-html/unnamed-chunk-8-1.svg

+16,922
Loading

_quarto.yml

+13-5
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,19 @@ book:
2626
chapters:
2727
- index.qmd
2828
- why-this-package.qmd
29-
- flatline-forecaster.qmd
30-
- tidymodels-intro.qmd
31-
- tidymodels-regression.qmd
32-
- preprocessing-and-models.qmd
33-
- sliding-forecasters.qmd
29+
- part: "epiprocess"
30+
chapters:
31+
- epiprocess.qmd
32+
- epidf.qmd
33+
- slide.qmd
34+
- growth-rates.qmd
35+
- part: "epipredict"
36+
chapters:
37+
- flatline-forecaster.qmd
38+
- tidymodels-intro.qmd
39+
- tidymodels-regression.qmd
40+
- preprocessing-and-models.qmd
41+
- sliding-forecasters.qmd
3442
- references.qmd
3543

3644
bibliography: [packages.bib, references.bib]

epidf.qmd

+276
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,276 @@
1+
# Getting data into epi_df format
2+
3+
```{r}
4+
#| include: false
5+
source("_common.R")
6+
```
7+
8+
We'll start by showing how to get data into
9+
`epi_df`, which is just
10+
a tibble with a bit of special structure, and is the format assumed by all of
11+
the functions in the `epiprocess` package. An `epi_df` object has (at least) the
12+
following columns:
13+
14+
* `geo_value`: the geographic value associated with each row of measurements.
15+
* `time_value`: the time value associated with each row of measurements.
16+
17+
It can have any number of other columns which can serve as measured variables,
18+
which we also broadly refer to as signal variables. The documentation for
19+
gives more details about this data format.
20+
21+
A data frame or tibble that has `geo_value` and `time_value` columns can be
22+
converted into an `epi_df` object, using the function `as_epi_df()`. As an
23+
example, we'll work with daily cumulative COVID-19 cases from four U.S. states:
24+
CA, FL, NY, and TX, over time span from mid 2020 to early 2022, and we'll use
25+
the [`epidatr`](https://github.com/cmu-delphi/epidatr) package
26+
to fetch this data from the [COVIDcast
27+
API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html).
28+
29+
```{r, message = FALSE}
30+
library(epidatr)
31+
library(epiprocess)
32+
library(withr)
33+
34+
cases <- covidcast(
35+
data_source = "jhu-csse",
36+
signals = "confirmed_cumulative_num",
37+
time_type = "day",
38+
geo_type = "state",
39+
time_values = epirange(20200301, 20220131),
40+
geo_values = "ca,fl,ny,tx"
41+
) %>% fetch()
42+
43+
colnames(cases)
44+
```
45+
46+
As we can see, a data frame returned by `epidatr::covidcast()` has the
47+
columns required for an `epi_df` object (along with many others). We can use
48+
`as_epi_df()`, with specification of some relevant metadata, to bring the data
49+
frame into `epi_df` format.
50+
51+
```{r, message = FALSE}
52+
x <- as_epi_df(cases,
53+
geo_type = "state",
54+
time_type = "day",
55+
as_of = max(cases$issue)) %>%
56+
select(geo_value, time_value, total_cases = value)
57+
58+
class(x)
59+
summary(x)
60+
head(x)
61+
attributes(x)$metadata
62+
```
63+
64+
## Some details on metadata
65+
66+
In general, an `epi_df` object has the following fields in its metadata:
67+
68+
* `geo_type`: the type for the geo values.
69+
* `time_type`: the type for the time values.
70+
* `as_of`: the time value at which the given data were available.
71+
72+
Metadata for an `epi_df` object `x` can be accessed (and altered) via
73+
`attributes(x)$metadata`. The first two fields here, `geo_type` and `time_type`,
74+
are not currently used by any downstream functions in the `epiprocess` package,
75+
and serve only as useful bits of information to convey about the data set at
76+
hand. The last field here, `as_of`, is one of the most unique aspects of an
77+
`epi_df` object.
78+
79+
In brief, we can think of an `epi_df` object as a single snapshot of a data set
80+
that contains the most up-to-date values of some signals of interest, as of the
81+
time specified `as_of`. For example, if `as_of` is January 31, 2022, then the
82+
`epi_df` object has the most up-to-date version of the data available as of
83+
January 31, 2022. The `epiprocess` package also provides a companion data
84+
structure called `epi_archive`, which stores the full version history of a given
85+
data set. See the [archive
86+
vignette](https://cmu-delphi.github.io/epiprocess/articles/archive.html) for
87+
more.
88+
89+
If any of the `geo_type`, `time_type`, or `as_of` arguments are missing in a
90+
call to `as_epi_df()`, then this function will try to infer them from the passed
91+
object. Usually, `geo_type` and `time_type` can be inferred from the `geo_value`
92+
and `time_value` columns, respectively, but inferring the `as_of` field is not
93+
as easy. See the documentation for `as_epi_df()` more details.
94+
95+
```{r}
96+
x <- as_epi_df(cases) %>%
97+
select(geo_value, time_value, total_cases = value)
98+
99+
attributes(x)$metadata
100+
```
101+
102+
## Using additional key columns in `epi_df`
103+
104+
In the following examples we will show how to create an `epi_df` with additional keys.
105+
106+
### Converting a `tsibble` that has county code as an extra key
107+
108+
```{r}
109+
set.seed(12345)
110+
ex1 <- tibble(
111+
geo_value = rep(c("ca", "fl", "pa"), each = 3),
112+
county_code = c("06059", "06061", "06067", "12111", "12113", "12117",
113+
"42101", "42103", "42105"),
114+
time_value = rep(
115+
seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "1 day"),
116+
length.out = 9),
117+
value = rpois(9, 5)
118+
) %>%
119+
as_tsibble(index = time_value, key = c(geo_value, county_code))
120+
121+
ex1 <- as_epi_df(x = ex1, geo_type = "state", time_type = "day", as_of = "2020-06-03")
122+
```
123+
124+
The metadata now includes `county_code` as an extra key.
125+
```{r}
126+
attr(ex1, "metadata")
127+
```
128+
129+
130+
### Dealing with misspecified column names
131+
132+
`epi_df` requires there to be columns `geo_value` and `time_value`, if they do not exist then `as_epi_df()` throws an error.
133+
134+
```{r, error = TRUE}
135+
ex2 <- data.frame(
136+
state = rep(c("ca", "fl", "pa"), each = 3), # misnamed
137+
pol = rep(c("blue", "swing", "swing"), each = 3), # extra key
138+
reported_date = rep(
139+
seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"),
140+
length.out = 9), # misnamed
141+
value = rpois(9, 5)
142+
)
143+
ex2 %>% as_epi_df()
144+
```
145+
146+
The columns should be renamed to match `epi_df` format.
147+
148+
```{r}
149+
ex2 <- ex2 %>%
150+
rename(geo_value = state, time_value = reported_date) %>%
151+
as_epi_df(geo_type = "state",
152+
as_of = "2020-06-03",
153+
additional_metadata = list(other_keys = "pol")
154+
)
155+
156+
attr(ex2, "metadata")
157+
```
158+
159+
160+
### Adding additional keys to an `epi_df` object
161+
162+
In the above examples, all the keys are added to objects prior to conversion to
163+
`epi_df` objects. But this can also be accomplished afterward.
164+
We'll look at an included dataset and filter to a single state for simplicity.
165+
166+
```{r}
167+
ex3 <- jhu_csse_county_level_subset %>%
168+
filter(time_value > "2021-12-01", state_name == "Massachusetts") %>%
169+
slice_tail(n = 6)
170+
171+
attr(ex3, "metadata") # geo_type is county currently
172+
```
173+
174+
Now we add `state` (MA) and `pol` as new columns to the data and as new keys to the metadata. The "state" `geo_type` anticipates lower-case abbreviations, so we'll match that.
175+
176+
```{r}
177+
ex3 <- ex3 %>%
178+
as_tibble() %>% # drop the `epi_df` class before adding additional metadata
179+
mutate(
180+
state = rep(tolower("MA"), 6),
181+
pol = rep(c("blue", "swing", "swing"), each = 2)) %>%
182+
as_epi_df(additional_metadata = list(other_keys = c("state", "pol")))
183+
184+
attr(ex3,"metadata")
185+
```
186+
187+
Note that the two additional keys we added, `state` and `pol`, are specified as a character vector in the `other_keys` component of the `additional_metadata` list. They must be specified in this manner so that downstream actions on the `epi_df`, like model fitting and prediction, can recognize and use these keys.
188+
189+
<!--
190+
Currently `other_keys` metadata in `epi_df` doesn't impact `epi_slide()`, contrary to `other_keys` in `as_epi_archive` which affects how the update data is interpreted.
191+
-->
192+
193+
## Working with `epi_df` objects downstream
194+
195+
Data in `epi_df` format should be easy to work with downstream, since it is a
196+
very standard tabular data format; in the other vignettes, we'll walk through
197+
some basic signal processing tasks using functions provided in the `epiprocess`
198+
package. Of course, we can also write custom code for other downstream uses,
199+
like plotting, which is pretty easy to do `ggplot2`.
200+
201+
```{r, message = FALSE, warning = FALSE}
202+
ggplot(x, aes(x = time_value, y = total_cases, color = geo_value)) +
203+
geom_line() +
204+
scale_color_brewer(palette = "Set1") +
205+
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
206+
labs(x = "Date", y = "Cumulative COVID-19 cases", color = "State")
207+
```
208+
209+
Finally, we'll examine some data from other packages just to show how
210+
we might get them into `epi_df` format.
211+
The first is data on daily new (not cumulative) SARS
212+
cases in Canada in 2003, from the
213+
[outbreaks](https://github.com/reconverse/outbreaks) package. New cases are
214+
broken into a few categories by provenance.
215+
216+
```{r}
217+
x <- outbreaks::sars_canada_2003 %>%
218+
mutate(geo_value = "ca") %>%
219+
select(geo_value, time_value = date, starts_with("cases")) %>%
220+
as_epi_df(geo_type = "nation")
221+
222+
head(x)
223+
```
224+
225+
```{r}
226+
#| code-fold: true
227+
x <- x %>%
228+
pivot_longer(starts_with("cases"), names_to = "type") %>%
229+
mutate(type = substring(type, 7))
230+
231+
ggplot(x, aes(x = time_value, y = value)) +
232+
geom_col(aes(fill = type), just = 0.5) +
233+
scale_y_continuous(breaks = 0:4*2, expand = expansion(c(0, 0.05))) +
234+
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
235+
labs(x = "Date", y = "SARS cases in Canada", fill = "Type")
236+
```
237+
238+
This next example examines data on new cases of Ebola in Sierra Leone in 2014 (from the same package).
239+
240+
```{r, message = FALSE}
241+
x <- outbreaks::ebola_sierraleone_2014 %>%
242+
mutate(
243+
cases = ifelse(status == "confirmed", 1, 0),
244+
province = case_when(
245+
district %in% c("Kailahun", "Kenema", "Kono") ~ "Eastern",
246+
district %in% c("Bombali", "Kambia", "Koinadugu",
247+
"Port Loko", "Tonkolili") ~ "Northern",
248+
district %in% c("Bo", "Bonthe", "Moyamba", "Pujehun") ~ "Sourthern",
249+
district %in% c("Western Rural", "Western Urban") ~ "Western")
250+
) %>%
251+
select(geo_value = province, time_value = date_of_onset, cases) %>%
252+
filter(cases == 1) %>%
253+
group_by(geo_value, time_value) %>%
254+
summarise(cases = sum(cases)) %>%
255+
as_epi_df(geo_type = "province")
256+
```
257+
258+
```{r}
259+
#| code-fold: true
260+
#| fig-width: 8
261+
#| fig-height: 6
262+
ggplot(x, aes(x = time_value, y = cases)) +
263+
geom_col(aes(fill = geo_value), show.legend = FALSE) +
264+
facet_wrap(~ geo_value, scales = "free_y") +
265+
scale_x_date(minor_breaks = "month", date_labels = "%b %Y") +
266+
labs(x = "Date", y = "Confirmed cases of Ebola in Sierra Leone")
267+
```
268+
269+
270+
271+
## Attribution {.unnumbered}
272+
273+
This document contains a dataset that is a modified part of the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) as [republished in the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). This data set is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
274+
275+
[From the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html):
276+
These signals are taken directly from the JHU CSSE [COVID-19 GitHub repository](https://github.com/CSSEGISandData/COVID-19) without changes.

epiprocess.qmd

+64
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Overview
2+
3+
This package introduces a common data structure for epidemiological data sets
4+
measured over space and time, and offers associated utilities to perform basic
5+
signal processing tasks.
6+
7+
## `epi_df`: snapshot of a data set
8+
9+
The first main data structure in the `epiprocess` package is called
10+
[`epi_df`]. This is simply a tibble with a couple of
11+
required columns, `geo_value` and `time_value`. It can have any other number of
12+
columns, which can be seen as measured variables, which we also call signal
13+
variables. In brief, an `epi_df` object represents a snapshot of a data set that
14+
contains the most up-to-date values of the signals variables, as of a given
15+
time.
16+
17+
By convention, functions in the `epiprocess` package that operate on `epi_df`
18+
objects begin with `epi`. For example:
19+
20+
- `epi_slide()`, for iteratively applying a custom computation to a variable in
21+
an `epi_df` object over sliding windows in time;
22+
23+
- `epi_cor()`, for computing lagged correlations between variables in an
24+
`epi_df` object, (allowing for grouping by geo value, time value, or any other
25+
variables).
26+
27+
Functions in the package that operate directly on given variables do not begin
28+
with `epi`. For example:
29+
30+
- `growth_rate()`, for estimating the growth rate of a given signal at given
31+
time values, using various methodologies;
32+
33+
- `detect_outlr()`, for detecting outliers in a given signal over time, using
34+
either built-in or custom methodologies.
35+
36+
## `epi_archive`: full version history of a data set
37+
38+
The second main data structure in the package is called
39+
[`epi_archive`]. This is a special class (R6 format)
40+
wrapped around a data table that stores the archive (version history) of some
41+
signal variables of interest.
42+
43+
By convention, functions in the `epiprocess` package that operate on
44+
`epi_archive` objects begin with `epix` (the "x" is meant to remind you of
45+
"archive"). These are just wrapper functions around the public methods for the
46+
`epi_archive` R6 class. For example:
47+
48+
- `epix_as_of()`, for generating a snapshot in `epi_df` format from the data
49+
archive, which represents the most up-to-date values of the signal variables,
50+
as of the specified version;
51+
52+
- `epix_fill_through_version()`, for filling in some fake version data following
53+
simple rules, for use when downstream methods expect an archive that is more
54+
up-to-date (e.g., if it is a forecasting deadline date and one of our data
55+
sources cannot be accessed to provide the latest versions of its data)
56+
57+
- `epix_merge()`, for merging two data archives with each other, with support
58+
for various approaches to handling when one of the archives is more up-to-date
59+
version-wise than the other;
60+
61+
- `epix_slide()`, for sliding a custom computation to a data archive over local
62+
windows in time, much like `epi_slide` for an `epi_df` object, but with one
63+
key difference: the sliding computation at any given reference time $t$ is
64+
performed only on the **data that would have been available as of $t$**.

0 commit comments

Comments
 (0)