Consider incremental covidcast meta data update #368

sgratzl · 2021-01-12T17:56:25Z

taking a look at:

MIN(t.`time_value`) AS `min_time`,
        MAX(t.`time_value`) AS `max_time`,
        COUNT(DISTINCT t.`geo_value`) AS `num_locations`,
        MIN(`value`) AS `min_value`,
        MAX(`value`) AS `max_value`,
        ROUND(AVG(`value`),7) AS `mean_value`,
        ROUND(STD(`value`),7) AS `stdev_value`,
        MAX(`value_updated_timestamp`) AS `last_update`,
        MAX(`issue`) as `max_issue`,
        MIN(`lag`) as `min_lag`,
        MAX(`lag`) as `max_lag`

a lot of these values could be easily incrementally computed (min, max) others with some effort (avg, std, count distinct). There are statistical versions of stable incremental avg/std that we could explore. If we cannot compute the changed rows per run, we could use a multi stage approach like:

general assumption: it is unlikely that values older than a month will get an new issue.

we compute the metadata for all data which is older than e.g., December 2020 with additional information for the incremental metadata (count, list of distinct geo_values, ...)
each day we compute the metadata for all data younger than December 2020
merge the old cached with the new ones to produce the final meta data
every (two) weeks we shift the split date to reduce the amount of data we have to compute during each update

in the end it depends on finding a good method for incrementally (or in batches) find the mean/avg and variance/standard deviation.

The text was updated successfully, but these errors were encountered:

krivard · 2021-01-12T18:31:27Z

Yep. I believe @melange396 is working on a draft of this. Brief summary of offline discussion --

assumption: values older than a month are unlikely to change

Inpatient, outpatient, and testing signals have 60-80 days of backfill. typical change between first and final issue is less than 20%, but individual regions can change by 300% or more.

some things we've thought about in response to this:

prototype incremental updates in a separate table so we can easily compare with the exact values
run a full exact update once a week (or otherwise periodically) to correct any estimates that have slipped

different definitions of incremental

just today's issue
two-way split between 1-2 months ago (proposed above)
monthly or quarterly bins (easier to support mass backissue uploads or extreme backfill)

things we can ignore

number of regions (do we even have users who need this)

incremental math

mean is easy; stdev is slightly more complicated. here's a closed form for adding a single sample; some work/research needed for adding a whole batch efficiently.

krivard · 2021-01-12T19:18:53Z

Ooh, a closed form for computing the combined sample standard deviation of two samples

eujing · 2021-01-12T21:19:51Z

Just wanted to add on that the previous formula is also the parallel algorithm for variance, with some discussion about it's numerical stability and the associated paper.

melange396 · 2021-01-14T19:32:01Z

Duplicate of #289

krivard mentioned this issue Jan 14, 2021

incremental cache updating prototype #375

Open

melange396 marked this as a duplicate of #289 Jan 14, 2021

melange396 closed this as completed Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider incremental covidcast meta data update #368

Consider incremental covidcast meta data update #368

sgratzl commented Jan 12, 2021

krivard commented Jan 12, 2021

krivard commented Jan 12, 2021

eujing commented Jan 12, 2021

melange396 commented Jan 14, 2021

Consider incremental covidcast meta data update #368

Consider incremental covidcast meta data update #368

Comments

sgratzl commented Jan 12, 2021

krivard commented Jan 12, 2021

assumption: values older than a month are unlikely to change

different definitions of incremental

things we can ignore

incremental math

krivard commented Jan 12, 2021

eujing commented Jan 12, 2021

melange396 commented Jan 14, 2021