Skip to content

New layer in API to support signal preprocessing and renaming #239

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RoniRos opened this issue Oct 8, 2020 · 7 comments
Closed

New layer in API to support signal preprocessing and renaming #239

RoniRos opened this issue Oct 8, 2020 · 7 comments
Labels
Engineering Used to filter issues when synching with Asana project proposal This is a big project deserving of a full project requirements doc

Comments

@RoniRos
Copy link
Member

RoniRos commented Oct 8, 2020

There are several significant problems with the EpiData Database & API:

  1. There is a tremendous amount of signal duplication. For example, Cases from the same source are stored:
    new vs. cumulative
    counts vs. ratios (=normalized by population)
    daily vs. 7day average
    (also raw vs. smooth?)
    This increases storage by a factor of 8 (16?) for Cases from EACH source (JHU, USAFacts, hybrid?).
    Same for deaths.
    Same for covid tests of all types.
    For some of the other signals, the multiplier is 4 or 2.
    As we struggle with the growth of our DB (both more signals and longer time period), we can't afford this waste.

2 The pre-processing we do, e.g. smoothing, averaging and convert cumulative into 'new', each represent just one choice of multiple reasonable choices. For example, some users may want 14day averaging for some signals. Some may want to allocate "bumps" in cumulative counts differently than we have, e.g. by distributing them uniformly (or proportionately) over some past period, or by eliminating negative adjustments.

  1. As we add sources and signals, our naming sometimes needs to evolve to remain clear and accurate. Right now we are stuck indefinitely with naming decisions we made long ago. We need a more flexible way to evolve while remaining backward compatible. We need to have a process for gradually deprecating old names.

I think we can solve all of the above by introducing a layer of indirection at the highest level of the API calls. A new API call will have additional parameters, e.g.:

Source=JHU,  Signal=Cases,  Mode=Cumulative, Window=7day, Smoothing="XYZ".

This will give us the flexibility to:
(1) support a larger (unlimited) range of preprocessing options, which can be easily extended.
(2) continue to allow some popular combinations to be pre-computed and stored in the DB as 'pre-compiled' signals. But also allow the option to create some combinations on the fly. We can also cache some results for commonly asked queries, representing something between "on the fly" and a permanent signal in the DB. The decision between teh three options (pre-compiled, cached, or on-the-fly) can be made either manually or automatically based on frequency of use.
(3) Enable naming evolution by using this layer to support backward compatibility, with some feedback about deprecation. So, the current API calls would still work, but would be translated into the new calls, and old signal names will be mapped into new signal names, for as long as we want, after which they will return a "please switch to the new name" error message.

@krivard krivard transferred this issue from cmu-delphi/covidcast-indicators Oct 8, 2020
@chinandrew
Copy link
Contributor

How much of this do we think should happen server side vs client side? In the former, if someone requests a smoothed (or whatever other operation is desired) signal, that gets computed and sent from our end, so we eat the compute cost for every single smoother variant they want to try. In the latter, someone makes a single request for the raw data and then uses smoother object that takes the raw data as an input and outputs a smoothed signal, which means they can play around with whatever settings they like locally, and also means they can use the smoothers on their own custom signals if they have them. We'd probably want to provide that smoothing code across popular languages.

I read this as doing server side which seems perfectly reasonable, but just wanted to clarify. I think there's a good argument for the client side method as well, but don't have a great sense of what our users expect so don't have a strong conviction.

@capnrefsmmat
Copy link
Contributor

Some signals can't be smoothed client-side -- for example, the symptom survey signals are not reported if there are fewer than 100 observations in a time window. But if, server-side, we group together 7 days and there are ≥100 observations, we can report the smoothed version. The client would not have access to the underlying raw data to do the same.

However, that's also a reason why we can't support configurable server-side smoothing. To do server-side smoothing for the symptom survey data, we would need to store (privately) the raw survey responses for every day, then, on the fly, calculate weighted estimates for any region and time period requested by the user and apply the minimum sample size filter. That would require completely rewriting the current indicator code so it can be run on the fly on the server side, instead of once in advance, and would be a major undertaking.

@RoniRos
Copy link
Member Author

RoniRos commented Oct 9, 2020

@capnrefsmmat Understood. So we don't have to do the symptom survey data that way, at least not initially. My bigger concern is the 8x duplication of Cases, Deaths and Tests, and overall flexibility, naming and deprecation. We can start by just introducing the indirection layer, defaulting it to the current behavior, and then judiciously move some select processing for some signals to server-side or client-side.

@sgratzl
Copy link
Member

sgratzl commented Oct 9, 2020

imo the more complex the api layer gets the more useful it would be to switch to a Python based version, see #178. It would allow to reuse modules and also get better performance by using connection pooling, ...

@krivard krivard added the project proposal This is a big project deserving of a full project requirements doc label Oct 13, 2020
@SumitDELPHI SumitDELPHI added the Engineering Used to filter issues when synching with Asana label Dec 2, 2020
@dshemetov
Copy link
Contributor

dshemetov commented Feb 1, 2021

Summarizing some points from a related conversation with @krivard @chinandrew. We discussed centralizing geocoding calculations further. Currently the indicators duplicate a lot of geocoding work. We can remove this duplication by keeping the data at the FIPS level in the database and doing geocoding at query time or ingestion time. This has the advantage of:

  1. saving about 20% storage space (there are approximately 3k FIPS codes, 400 msas, 300 hrrs, 53 states, 10 hrrs, and 1 nation),
  2. centralizing geocoding
  3. simplifying indicator pipelines
  4. saving maintenance time, i.e. if there is a geocoding bug, we don't have to reissue values

The challenges here are:

  1. some indicators have particular geocoding edge-cases (quidel, combo cases, JHU, nowcasting), which is code complexity that will be carried over to ingestion/query
  2. costs development effort that can be spent elsewhere

Good principles for future decisions:

  1. each indicator should do what is unique to that source/signal, and if all indicators are doing the same operations, that's something we should centralize somewhere (e.g. all indicators feed a fips file into one big geomapper before ingestion)
  2. if geomapping actually is indicator specific (quidel, nowcasting), then maybe we don't extract it
  3. nonetheless, the decision to centralize has to respond to practical concerns, which differ depending on the implementation details. it was decentralized for speed; now that speed is no longer our number one priority, we have to find a balance between implementation time to fix it and the maintenance load of leaving it broken/decentralized/inefficient.

The API server will switch from PHP to Python soon. That could be a good time to centralize geocoding.

@RoniRos
Copy link
Member Author

RoniRos commented Feb 2, 2021

This decision does not have to be all-or-none. You could have centralized geo-coding calculations, but still allow a particular signal to opt-out and do its own calculation, or preferably to rely on the centralized calculation but do some post-processing.

@krivard
Copy link
Contributor

krivard commented Feb 3, 2023

Signal transformations are being tracked in #607 and #608

Moving renames to its own issue and linking back to discussion here

@krivard krivard closed this as completed Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Engineering Used to filter issues when synching with Asana project proposal This is a big project deserving of a full project requirements doc
Projects
None yet
Development

No branches or pull requests

7 participants