|
| 1 | +# Getting data into epi_df format |
| 2 | + |
| 3 | +```{r} |
| 4 | +#| include: false |
| 5 | +source("_common.R") |
| 6 | +``` |
| 7 | + |
| 8 | +We'll start by showing how to get data into |
| 9 | +`epi_df`, which is just |
| 10 | +a tibble with a bit of special structure, and is the format assumed by all of |
| 11 | +the functions in the `epiprocess` package. An `epi_df` object has (at least) the |
| 12 | +following columns: |
| 13 | + |
| 14 | +* `geo_value`: the geographic value associated with each row of measurements. |
| 15 | +* `time_value`: the time value associated with each row of measurements. |
| 16 | + |
| 17 | +It can have any number of other columns which can serve as measured variables, |
| 18 | +which we also broadly refer to as signal variables. The documentation for |
| 19 | + gives more details about this data format. |
| 20 | + |
| 21 | +A data frame or tibble that has `geo_value` and `time_value` columns can be |
| 22 | +converted into an `epi_df` object, using the function `as_epi_df()`. As an |
| 23 | +example, we'll work with daily cumulative COVID-19 cases from four U.S. states: |
| 24 | +CA, FL, NY, and TX, over time span from mid 2020 to early 2022, and we'll use |
| 25 | +the [`epidatr`](https://github.com/cmu-delphi/epidatr) package |
| 26 | +to fetch this data from the [COVIDcast |
| 27 | +API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html). |
| 28 | + |
| 29 | +```{r, message = FALSE} |
| 30 | +library(epidatr) |
| 31 | +library(epiprocess) |
| 32 | +library(withr) |
| 33 | +
|
| 34 | +cases <- covidcast( |
| 35 | + data_source = "jhu-csse", |
| 36 | + signals = "confirmed_cumulative_num", |
| 37 | + time_type = "day", |
| 38 | + geo_type = "state", |
| 39 | + time_values = epirange(20200301, 20220131), |
| 40 | + geo_values = "ca,fl,ny,tx" |
| 41 | +) %>% fetch() |
| 42 | +
|
| 43 | +colnames(cases) |
| 44 | +``` |
| 45 | + |
| 46 | +As we can see, a data frame returned by `epidatr::covidcast()` has the |
| 47 | +columns required for an `epi_df` object (along with many others). We can use |
| 48 | +`as_epi_df()`, with specification of some relevant metadata, to bring the data |
| 49 | +frame into `epi_df` format. |
| 50 | + |
| 51 | +```{r, message = FALSE} |
| 52 | +x <- as_epi_df(cases, |
| 53 | + geo_type = "state", |
| 54 | + time_type = "day", |
| 55 | + as_of = max(cases$issue)) %>% |
| 56 | + select(geo_value, time_value, total_cases = value) |
| 57 | +
|
| 58 | +class(x) |
| 59 | +summary(x) |
| 60 | +head(x) |
| 61 | +attributes(x)$metadata |
| 62 | +``` |
| 63 | + |
| 64 | +## Some details on metadata |
| 65 | + |
| 66 | +In general, an `epi_df` object has the following fields in its metadata: |
| 67 | + |
| 68 | +* `geo_type`: the type for the geo values. |
| 69 | +* `time_type`: the type for the time values. |
| 70 | +* `as_of`: the time value at which the given data were available. |
| 71 | + |
| 72 | +Metadata for an `epi_df` object `x` can be accessed (and altered) via |
| 73 | +`attributes(x)$metadata`. The first two fields here, `geo_type` and `time_type`, |
| 74 | +are not currently used by any downstream functions in the `epiprocess` package, |
| 75 | +and serve only as useful bits of information to convey about the data set at |
| 76 | +hand. The last field here, `as_of`, is one of the most unique aspects of an |
| 77 | +`epi_df` object. |
| 78 | + |
| 79 | +In brief, we can think of an `epi_df` object as a single snapshot of a data set |
| 80 | +that contains the most up-to-date values of some signals of interest, as of the |
| 81 | +time specified `as_of`. For example, if `as_of` is January 31, 2022, then the |
| 82 | +`epi_df` object has the most up-to-date version of the data available as of |
| 83 | +January 31, 2022. The `epiprocess` package also provides a companion data |
| 84 | +structure called `epi_archive`, which stores the full version history of a given |
| 85 | +data set. See the [archive |
| 86 | +vignette](https://cmu-delphi.github.io/epiprocess/articles/archive.html) for |
| 87 | +more. |
| 88 | + |
| 89 | +If any of the `geo_type`, `time_type`, or `as_of` arguments are missing in a |
| 90 | +call to `as_epi_df()`, then this function will try to infer them from the passed |
| 91 | +object. Usually, `geo_type` and `time_type` can be inferred from the `geo_value` |
| 92 | +and `time_value` columns, respectively, but inferring the `as_of` field is not |
| 93 | +as easy. See the documentation for `as_epi_df()` more details. |
| 94 | + |
| 95 | +```{r} |
| 96 | +x <- as_epi_df(cases) %>% |
| 97 | + select(geo_value, time_value, total_cases = value) |
| 98 | +
|
| 99 | +attributes(x)$metadata |
| 100 | +``` |
| 101 | + |
| 102 | +## Using additional key columns in `epi_df` |
| 103 | + |
| 104 | +In the following examples we will show how to create an `epi_df` with additional keys. |
| 105 | + |
| 106 | +### Converting a `tsibble` that has county code as an extra key |
| 107 | + |
| 108 | +```{r} |
| 109 | +set.seed(12345) |
| 110 | +ex1 <- tibble( |
| 111 | + geo_value = rep(c("ca", "fl", "pa"), each = 3), |
| 112 | + county_code = c("06059", "06061", "06067", "12111", "12113", "12117", |
| 113 | + "42101", "42103", "42105"), |
| 114 | + time_value = rep( |
| 115 | + seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "1 day"), |
| 116 | + length.out = 9), |
| 117 | + value = rpois(9, 5) |
| 118 | +) %>% |
| 119 | + as_tsibble(index = time_value, key = c(geo_value, county_code)) |
| 120 | +
|
| 121 | +ex1 <- as_epi_df(x = ex1, geo_type = "state", time_type = "day", as_of = "2020-06-03") |
| 122 | +``` |
| 123 | + |
| 124 | +The metadata now includes `county_code` as an extra key. |
| 125 | +```{r} |
| 126 | +attr(ex1, "metadata") |
| 127 | +``` |
| 128 | + |
| 129 | + |
| 130 | +### Dealing with misspecified column names |
| 131 | + |
| 132 | +`epi_df` requires there to be columns `geo_value` and `time_value`, if they do not exist then `as_epi_df()` throws an error. |
| 133 | + |
| 134 | +```{r, error = TRUE} |
| 135 | +ex2 <- data.frame( |
| 136 | + state = rep(c("ca", "fl", "pa"), each = 3), # misnamed |
| 137 | + pol = rep(c("blue", "swing", "swing"), each = 3), # extra key |
| 138 | + reported_date = rep( |
| 139 | + seq(as.Date("2020-06-01"), as.Date("2020-06-03"), by = "day"), |
| 140 | + length.out = 9), # misnamed |
| 141 | + value = rpois(9, 5) |
| 142 | +) |
| 143 | +ex2 %>% as_epi_df() |
| 144 | +``` |
| 145 | + |
| 146 | +The columns should be renamed to match `epi_df` format. |
| 147 | + |
| 148 | +```{r} |
| 149 | +ex2 <- ex2 %>% |
| 150 | + rename(geo_value = state, time_value = reported_date) %>% |
| 151 | + as_epi_df(geo_type = "state", |
| 152 | + as_of = "2020-06-03", |
| 153 | + additional_metadata = list(other_keys = "pol") |
| 154 | + ) |
| 155 | +
|
| 156 | +attr(ex2, "metadata") |
| 157 | +``` |
| 158 | + |
| 159 | + |
| 160 | +### Adding additional keys to an `epi_df` object |
| 161 | + |
| 162 | +In the above examples, all the keys are added to objects prior to conversion to |
| 163 | +`epi_df` objects. But this can also be accomplished afterward. |
| 164 | +We'll look at an included dataset and filter to a single state for simplicity. |
| 165 | + |
| 166 | +```{r} |
| 167 | +ex3 <- jhu_csse_county_level_subset %>% |
| 168 | + filter(time_value > "2021-12-01", state_name == "Massachusetts") %>% |
| 169 | + slice_tail(n = 6) |
| 170 | + |
| 171 | +attr(ex3, "metadata") # geo_type is county currently |
| 172 | +``` |
| 173 | + |
| 174 | +Now we add `state` (MA) and `pol` as new columns to the data and as new keys to the metadata. The "state" `geo_type` anticipates lower-case abbreviations, so we'll match that. |
| 175 | + |
| 176 | +```{r} |
| 177 | +ex3 <- ex3 %>% |
| 178 | + as_tibble() %>% # drop the `epi_df` class before adding additional metadata |
| 179 | + mutate( |
| 180 | + state = rep(tolower("MA"), 6), |
| 181 | + pol = rep(c("blue", "swing", "swing"), each = 2)) %>% |
| 182 | + as_epi_df(additional_metadata = list(other_keys = c("state", "pol"))) |
| 183 | +
|
| 184 | +attr(ex3,"metadata") |
| 185 | +``` |
| 186 | + |
| 187 | +Note that the two additional keys we added, `state` and `pol`, are specified as a character vector in the `other_keys` component of the `additional_metadata` list. They must be specified in this manner so that downstream actions on the `epi_df`, like model fitting and prediction, can recognize and use these keys. |
| 188 | + |
| 189 | +<!-- |
| 190 | +Currently `other_keys` metadata in `epi_df` doesn't impact `epi_slide()`, contrary to `other_keys` in `as_epi_archive` which affects how the update data is interpreted. |
| 191 | +--> |
| 192 | + |
| 193 | +## Working with `epi_df` objects downstream |
| 194 | + |
| 195 | +Data in `epi_df` format should be easy to work with downstream, since it is a |
| 196 | +very standard tabular data format; in the other vignettes, we'll walk through |
| 197 | +some basic signal processing tasks using functions provided in the `epiprocess` |
| 198 | +package. Of course, we can also write custom code for other downstream uses, |
| 199 | +like plotting, which is pretty easy to do `ggplot2`. |
| 200 | + |
| 201 | +```{r, message = FALSE, warning = FALSE} |
| 202 | +ggplot(x, aes(x = time_value, y = total_cases, color = geo_value)) + |
| 203 | + geom_line() + |
| 204 | + scale_color_brewer(palette = "Set1") + |
| 205 | + scale_x_date(minor_breaks = "month", date_labels = "%b %Y") + |
| 206 | + labs(x = "Date", y = "Cumulative COVID-19 cases", color = "State") |
| 207 | +``` |
| 208 | + |
| 209 | +Finally, we'll examine some data from other packages just to show how |
| 210 | +we might get them into `epi_df` format. |
| 211 | +The first is data on daily new (not cumulative) SARS |
| 212 | +cases in Canada in 2003, from the |
| 213 | +[outbreaks](https://github.com/reconverse/outbreaks) package. New cases are |
| 214 | +broken into a few categories by provenance. |
| 215 | + |
| 216 | +```{r} |
| 217 | +x <- outbreaks::sars_canada_2003 %>% |
| 218 | + mutate(geo_value = "ca") %>% |
| 219 | + select(geo_value, time_value = date, starts_with("cases")) %>% |
| 220 | + as_epi_df(geo_type = "nation") |
| 221 | +
|
| 222 | +head(x) |
| 223 | +``` |
| 224 | + |
| 225 | +```{r} |
| 226 | +#| code-fold: true |
| 227 | +x <- x %>% |
| 228 | + pivot_longer(starts_with("cases"), names_to = "type") %>% |
| 229 | + mutate(type = substring(type, 7)) |
| 230 | +
|
| 231 | +ggplot(x, aes(x = time_value, y = value)) + |
| 232 | + geom_col(aes(fill = type), just = 0.5) + |
| 233 | + scale_y_continuous(breaks = 0:4*2, expand = expansion(c(0, 0.05))) + |
| 234 | + scale_x_date(minor_breaks = "month", date_labels = "%b %Y") + |
| 235 | + labs(x = "Date", y = "SARS cases in Canada", fill = "Type") |
| 236 | +``` |
| 237 | + |
| 238 | +This next example examines data on new cases of Ebola in Sierra Leone in 2014 (from the same package). |
| 239 | + |
| 240 | +```{r, message = FALSE} |
| 241 | +x <- outbreaks::ebola_sierraleone_2014 %>% |
| 242 | + mutate( |
| 243 | + cases = ifelse(status == "confirmed", 1, 0), |
| 244 | + province = case_when( |
| 245 | + district %in% c("Kailahun", "Kenema", "Kono") ~ "Eastern", |
| 246 | + district %in% c("Bombali", "Kambia", "Koinadugu", |
| 247 | + "Port Loko", "Tonkolili") ~ "Northern", |
| 248 | + district %in% c("Bo", "Bonthe", "Moyamba", "Pujehun") ~ "Sourthern", |
| 249 | + district %in% c("Western Rural", "Western Urban") ~ "Western") |
| 250 | + ) %>% |
| 251 | + select(geo_value = province, time_value = date_of_onset, cases) %>% |
| 252 | + filter(cases == 1) %>% |
| 253 | + group_by(geo_value, time_value) %>% |
| 254 | + summarise(cases = sum(cases)) %>% |
| 255 | + as_epi_df(geo_type = "province") |
| 256 | +``` |
| 257 | + |
| 258 | +```{r} |
| 259 | +#| code-fold: true |
| 260 | +#| fig-width: 8 |
| 261 | +#| fig-height: 6 |
| 262 | +ggplot(x, aes(x = time_value, y = cases)) + |
| 263 | + geom_col(aes(fill = geo_value), show.legend = FALSE) + |
| 264 | + facet_wrap(~ geo_value, scales = "free_y") + |
| 265 | + scale_x_date(minor_breaks = "month", date_labels = "%b %Y") + |
| 266 | + labs(x = "Date", y = "Confirmed cases of Ebola in Sierra Leone") |
| 267 | +``` |
| 268 | + |
| 269 | + |
| 270 | + |
| 271 | +## Attribution {.unnumbered} |
| 272 | + |
| 273 | +This document contains a dataset that is a modified part of the [COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University](https://github.com/CSSEGISandData/COVID-19) as [republished in the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html). This data set is licensed under the terms of the [Creative Commons Attribution 4.0 International license](https://creativecommons.org/licenses/by/4.0/) by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020. |
| 274 | + |
| 275 | +[From the COVIDcast Epidata API](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html): |
| 276 | + These signals are taken directly from the JHU CSSE [COVID-19 GitHub repository](https://github.com/CSSEGISandData/COVID-19) without changes. |
0 commit comments