+ "markdown": "# Correlate signals over space and time\n\n\nThe `epiprocess` package provides some simple functionality for computing lagged\ncorrelations between two signals, over space or time (or other variables), via\n`epi_cor()`. This function is really just a convenience wrapper over some basic\ncommands: it first performs specified time shifts, then computes correlations,\ngrouped in a specified way. In this vignette, we'll examine correlations between\nstate-level COVID-19 case and death rates, smoothed using 7-day trailing\naverages.\n\n\n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-2_081a29de33f56cfd09230e962a2fd7a1'}\n\n```{.r .cell-code}\nx <- covid_case_death_rates\n```\n:::\n\n\n## Correlations grouped by time\n\nThe `epi_cor()` function operates on an `epi_df` object, and it requires further\nspecification of the variables to correlate, in its next two arguments (`var1`\nand `var2`).\n\nIn general, we can specify any grouping variable (or combination of variables)\nfor the correlation computations in a call to `epi_cor()`, via the `cor_by`\nargument. This potentially leads to many ways to compute correlations. There are\nalways at least two ways to compute correlations in an `epi_df`: grouping by\ntime value, and by geo value. The former is obtained via `cor_by = time_value`.\n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-3_17a4851fe6549e0c65934b93f3c0eb7d'}\n\n```{.r .cell-code}\nz1 <- epi_cor(x, case_rate, death_rate, cor_by = \"time_value\")\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-4_79711542aad5bf3e922f62f048ded617'}\n\n```{.r .cell-code code-fold=\"true\"}\nggplot(z1, aes(x = time_value, y = cor)) +\n geom_hline(yintercept = 0) +\n geom_line(color = 4) +\n scale_x_date(minor_breaks = \"month\", date_labels = \"%b %Y\") +\n labs(x = \"Date\", y = \"Correlation\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=90%}\n:::\n:::\n\n\nThe above plot addresses the question: \"on any given day, are case and death\nrates linearly associated, across the U.S. states?\". We might be interested in\nbroadening this question, instead asking: \"on any given day, do higher case\nrates tend to associate with higher death rates?\", removing the dependence on a\nlinear relationship. The latter can be addressed using Spearman correlation,\naccomplished by setting `method = \"spearman\"` in the call to `epi_cor()`.\nSpearman correlation is highly robust and invariant to monotone transformations.\n\n## Lagged correlations\n\nWe might also be interested in how case rates associate with death rates in the\n*future*. Using the `dt1` parameter in `epi_cor()`, we can lag case rates back\nany number of days we want, before calculating correlations. Below, we set `dt1\n= -10`. This means that `var1 = case_rate` will be lagged by 10 days, so that\ncase rates on June 1st will be correlated with death rates on June 11th. (It\nmight also help to think of it this way: death rates on a certain day will be\ncorrelated with case rates at an offset of -10 days.)\n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-5_c9c40c0d1a722107c239e63a3a966415'}\n\n```{.r .cell-code}\nz2 <- epi_cor(x, case_rate, death_rate, cor_by = time_value, dt1 = -10)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-6_367fad1beaf98263a83ae9d0d36b86bf'}\n\n```{.r .cell-code code-fold=\"true\"}\nz <- rbind(\n z1 %>% mutate(lag = 0),\n z2 %>% mutate(lag = 10)\n) %>%\n mutate(lag = as.factor(lag))\n\nggplot(z, aes(x = time_value, y = cor)) +\n geom_hline(yintercept = 0) +\n geom_line(aes(color = lag)) +\n scale_color_brewer(palette = \"Set1\") +\n scale_x_date(minor_breaks = \"month\", date_labels = \"%b %Y\") +\n labs(x = \"Date\", y = \"Correlation\", col = \"Lag\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=90%}\n:::\n:::\n\n\nNote that `epi_cor()` takes an argument `shift_by` that determines the grouping\nto use for the time shifts. The default is `geo_value`, which makes sense in our\nproblem at hand (but in another setting, we may want to group by geo value and\nanother variable---say, age---before time shifting).\n\nWe can see that, generally, lagging the case rates back by 10 days improves the\ncorrelations, confirming case rates are better correlated with death rates 10\ndays from now.\n\n## Correlations grouped by state\n\nThe second option we have is to group by geo value, obtained by setting `cor_by\n= geo_value`. We'll again look at correlations for both 0- and 10-day lagged \ncase rates.\n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-7_6264ff17fede55338817acb4f6c9cf83'}\n\n```{.r .cell-code}\nz1 <- epi_cor(x, case_rate, death_rate, cor_by = geo_value)\nz2 <- epi_cor(x, case_rate, death_rate, cor_by = geo_value, dt1 = -10)\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-8_b53f34c4fb7b12c015535702b7c4fe2b'}\n\n```{.r .cell-code code-fold=\"true\"}\nz <- rbind(\n z1 %>% mutate(lag = 0),\n z2 %>% mutate(lag = 10)\n) %>%\n mutate(lag = as.factor(lag))\n\nggplot(z, aes(cor)) +\n geom_density(aes(fill = lag, col = lag), alpha = 0.5, bounds = c(-1, 1)) +\n scale_fill_brewer(palette = \"Set1\") +\n scale_color_brewer(palette = \"Set1\") +\n labs(x = \"Correlation\", y = \"Density\", fill = \"Lag\", col = \"Lag\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=90%}\n:::\n:::\n\n\nWe can again see that, generally speaking, lagging the case rates back by 10 \ndays improves the correlations.\n\n## More systematic lag analysis\n\nNext we perform a more systematic investigation of the correlations over a broad \nrange of lag values. \n\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-9_a70cecd97ad03567621ffbdfb1f5025b'}\n\n```{.r .cell-code}\nlags <- 0:35\n\nz <- map(\n .x = lags,\n ~ epi_cor(x, case_rate, death_rate, cor_by = geo_value, dt1 = -.x) %>%\n mutate(lag = .x)\n) %>% list_rbind()\n```\n:::\n\n::: {.cell layout-align=\"center\" hash='correlations_cache/html/unnamed-chunk-10_12c7877a106190f5c7d4fdb8683cb6d7'}\n\n```{.r .cell-code code-fold=\"true\"}\nz %>%\n group_by(lag) %>%\n summarize(mean = mean(cor, na.rm = TRUE)) %>%\n ggplot(aes(x = lag, y = mean)) +\n geom_line(color = 4) +\n geom_point(color = 4) +\n labs(x = \"Lag\", y = \"Mean correlation\")\n```\n\n::: {.cell-output-display}\n{fig-align='center' width=90%}\n:::\n:::\n\n\nWe can see pronounced curvature in the average correlation between \ncase and death rates (where the correlations come from grouping by `geo_value`) as\na function of lag. The maximum occurs at a lag of somewhere around 17 days.\n",
0 commit comments