Skip to content

Consider incremental covidcast meta data update #368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sgratzl opened this issue Jan 12, 2021 · 4 comments
Closed

Consider incremental covidcast meta data update #368

sgratzl opened this issue Jan 12, 2021 · 4 comments

Comments

@sgratzl
Copy link
Member

sgratzl commented Jan 12, 2021

taking a look at:

MIN(t.`time_value`) AS `min_time`,
        MAX(t.`time_value`) AS `max_time`,
        COUNT(DISTINCT t.`geo_value`) AS `num_locations`,
        MIN(`value`) AS `min_value`,
        MAX(`value`) AS `max_value`,
        ROUND(AVG(`value`),7) AS `mean_value`,
        ROUND(STD(`value`),7) AS `stdev_value`,
        MAX(`value_updated_timestamp`) AS `last_update`,
        MAX(`issue`) as `max_issue`,
        MIN(`lag`) as `min_lag`,
        MAX(`lag`) as `max_lag`

a lot of these values could be easily incrementally computed (min, max) others with some effort (avg, std, count distinct). There are statistical versions of stable incremental avg/std that we could explore. If we cannot compute the changed rows per run, we could use a multi stage approach like:

general assumption: it is unlikely that values older than a month will get an new issue.

  1. we compute the metadata for all data which is older than e.g., December 2020 with additional information for the incremental metadata (count, list of distinct geo_values, ...)
  2. each day we compute the metadata for all data younger than December 2020
  3. merge the old cached with the new ones to produce the final meta data
  4. every (two) weeks we shift the split date to reduce the amount of data we have to compute during each update

in the end it depends on finding a good method for incrementally (or in batches) find the mean/avg and variance/standard deviation.

@krivard
Copy link
Contributor

krivard commented Jan 12, 2021

Yep. I believe @melange396 is working on a draft of this. Brief summary of offline discussion --

assumption: values older than a month are unlikely to change

Inpatient, outpatient, and testing signals have 60-80 days of backfill. typical change between first and final issue is less than 20%, but individual regions can change by 300% or more.

some things we've thought about in response to this:

  • prototype incremental updates in a separate table so we can easily compare with the exact values
  • run a full exact update once a week (or otherwise periodically) to correct any estimates that have slipped

different definitions of incremental

  • just today's issue
  • two-way split between 1-2 months ago (proposed above)
  • monthly or quarterly bins (easier to support mass backissue uploads or extreme backfill)

things we can ignore

  • number of regions (do we even have users who need this)

incremental math

mean is easy; stdev is slightly more complicated. here's a closed form for adding a single sample; some work/research needed for adding a whole batch efficiently.

@krivard
Copy link
Contributor

krivard commented Jan 12, 2021

@eujing
Copy link
Contributor

eujing commented Jan 12, 2021

Just wanted to add on that the previous formula is also the parallel algorithm for variance, with some discussion about it's numerical stability and the associated paper.

@melange396
Copy link
Collaborator

Duplicate of #289

@melange396 melange396 marked this as a duplicate of #289 Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants