-
Notifications
You must be signed in to change notification settings - Fork 68
New layer in API to support signal preprocessing and renaming #239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
How much of this do we think should happen server side vs client side? In the former, if someone requests a smoothed (or whatever other operation is desired) signal, that gets computed and sent from our end, so we eat the compute cost for every single smoother variant they want to try. In the latter, someone makes a single request for the raw data and then uses smoother object that takes the raw data as an input and outputs a smoothed signal, which means they can play around with whatever settings they like locally, and also means they can use the smoothers on their own custom signals if they have them. We'd probably want to provide that smoothing code across popular languages. I read this as doing server side which seems perfectly reasonable, but just wanted to clarify. I think there's a good argument for the client side method as well, but don't have a great sense of what our users expect so don't have a strong conviction. |
Some signals can't be smoothed client-side -- for example, the symptom survey signals are not reported if there are fewer than 100 observations in a time window. But if, server-side, we group together 7 days and there are ≥100 observations, we can report the smoothed version. The client would not have access to the underlying raw data to do the same. However, that's also a reason why we can't support configurable server-side smoothing. To do server-side smoothing for the symptom survey data, we would need to store (privately) the raw survey responses for every day, then, on the fly, calculate weighted estimates for any region and time period requested by the user and apply the minimum sample size filter. That would require completely rewriting the current indicator code so it can be run on the fly on the server side, instead of once in advance, and would be a major undertaking. |
@capnrefsmmat Understood. So we don't have to do the symptom survey data that way, at least not initially. My bigger concern is the 8x duplication of Cases, Deaths and Tests, and overall flexibility, naming and deprecation. We can start by just introducing the indirection layer, defaulting it to the current behavior, and then judiciously move some select processing for some signals to server-side or client-side. |
imo the more complex the api layer gets the more useful it would be to switch to a Python based version, see #178. It would allow to reuse modules and also get better performance by using connection pooling, ... |
Summarizing some points from a related conversation with @krivard @chinandrew. We discussed centralizing geocoding calculations further. Currently the indicators duplicate a lot of geocoding work. We can remove this duplication by keeping the data at the FIPS level in the database and doing geocoding at query time or ingestion time. This has the advantage of:
The challenges here are:
Good principles for future decisions:
The API server will switch from PHP to Python soon. That could be a good time to centralize geocoding. |
This decision does not have to be all-or-none. You could have centralized geo-coding calculations, but still allow a particular signal to opt-out and do its own calculation, or preferably to rely on the centralized calculation but do some post-processing. |
There are several significant problems with the EpiData Database & API:
new vs. cumulative
counts vs. ratios (=normalized by population)
daily vs. 7day average
(also raw vs. smooth?)
This increases storage by a factor of 8 (16?) for Cases from EACH source (JHU, USAFacts, hybrid?).
Same for deaths.
Same for covid tests of all types.
For some of the other signals, the multiplier is 4 or 2.
As we struggle with the growth of our DB (both more signals and longer time period), we can't afford this waste.
2 The pre-processing we do, e.g. smoothing, averaging and convert cumulative into 'new', each represent just one choice of multiple reasonable choices. For example, some users may want 14day averaging for some signals. Some may want to allocate "bumps" in cumulative counts differently than we have, e.g. by distributing them uniformly (or proportionately) over some past period, or by eliminating negative adjustments.
I think we can solve all of the above by introducing a layer of indirection at the highest level of the API calls. A new API call will have additional parameters, e.g.:
This will give us the flexibility to:
(1) support a larger (unlimited) range of preprocessing options, which can be easily extended.
(2) continue to allow some popular combinations to be pre-computed and stored in the DB as 'pre-compiled' signals. But also allow the option to create some combinations on the fly. We can also cache some results for commonly asked queries, representing something between "on the fly" and a permanent signal in the DB. The decision between teh three options (pre-compiled, cached, or on-the-fly) can be made either manually or automatically based on frequency of use.
(3) Enable naming evolution by using this layer to support backward compatibility, with some feedback about deprecation. So, the current API calls would still work, but would be translated into the new calls, and old signal names will be mapped into new signal names, for as long as we want, after which they will return a "please switch to the new name" error message.
The text was updated successfully, but these errors were encountered: