-
Notifications
You must be signed in to change notification settings - Fork 68
Consider incremental covidcast meta data update #368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yep. I believe @melange396 is working on a draft of this. Brief summary of offline discussion -- assumption: values older than a month are unlikely to changeInpatient, outpatient, and testing signals have 60-80 days of backfill. typical change between first and final issue is less than 20%, but individual regions can change by 300% or more. some things we've thought about in response to this:
different definitions of incremental
things we can ignore
incremental mathmean is easy; stdev is slightly more complicated. here's a closed form for adding a single sample; some work/research needed for adding a whole batch efficiently. |
Just wanted to add on that the previous formula is also the parallel algorithm for variance, with some discussion about it's numerical stability and the associated paper. |
Duplicate of #289 |
taking a look at:
a lot of these values could be easily incrementally computed (min, max) others with some effort (avg, std, count distinct). There are statistical versions of stable incremental avg/std that we could explore. If we cannot compute the changed rows per run, we could use a multi stage approach like:
general assumption: it is unlikely that values older than a month will get an new issue.
in the end it depends on finding a good method for incrementally (or in batches) find the mean/avg and variance/standard deviation.
The text was updated successfully, but these errors were encountered: