New layer in API to support signal preprocessing and renaming #239

RoniRos · 2020-10-08T15:17:01Z

There are several significant problems with the EpiData Database & API:

There is a tremendous amount of signal duplication. For example, Cases from the same source are stored:
new vs. cumulative
counts vs. ratios (=normalized by population)
daily vs. 7day average
(also raw vs. smooth?)
This increases storage by a factor of 8 (16?) for Cases from EACH source (JHU, USAFacts, hybrid?).
Same for deaths.
Same for covid tests of all types.
For some of the other signals, the multiplier is 4 or 2.
As we struggle with the growth of our DB (both more signals and longer time period), we can't afford this waste.

2 The pre-processing we do, e.g. smoothing, averaging and convert cumulative into 'new', each represent just one choice of multiple reasonable choices. For example, some users may want 14day averaging for some signals. Some may want to allocate "bumps" in cumulative counts differently than we have, e.g. by distributing them uniformly (or proportionately) over some past period, or by eliminating negative adjustments.

As we add sources and signals, our naming sometimes needs to evolve to remain clear and accurate. Right now we are stuck indefinitely with naming decisions we made long ago. We need a more flexible way to evolve while remaining backward compatible. We need to have a process for gradually deprecating old names.

I think we can solve all of the above by introducing a layer of indirection at the highest level of the API calls. A new API call will have additional parameters, e.g.:

Source=JHU,  Signal=Cases,  Mode=Cumulative, Window=7day, Smoothing="XYZ".

This will give us the flexibility to:
(1) support a larger (unlimited) range of preprocessing options, which can be easily extended.
(2) continue to allow some popular combinations to be pre-computed and stored in the DB as 'pre-compiled' signals. But also allow the option to create some combinations on the fly. We can also cache some results for commonly asked queries, representing something between "on the fly" and a permanent signal in the DB. The decision between teh three options (pre-compiled, cached, or on-the-fly) can be made either manually or automatically based on frequency of use.
(3) Enable naming evolution by using this layer to support backward compatibility, with some feedback about deprecation. So, the current API calls would still work, but would be translated into the new calls, and old signal names will be mapped into new signal names, for as long as we want, after which they will return a "please switch to the new name" error message.

The text was updated successfully, but these errors were encountered:

chinandrew · 2020-10-08T21:33:49Z

How much of this do we think should happen server side vs client side? In the former, if someone requests a smoothed (or whatever other operation is desired) signal, that gets computed and sent from our end, so we eat the compute cost for every single smoother variant they want to try. In the latter, someone makes a single request for the raw data and then uses smoother object that takes the raw data as an input and outputs a smoothed signal, which means they can play around with whatever settings they like locally, and also means they can use the smoothers on their own custom signals if they have them. We'd probably want to provide that smoothing code across popular languages.

I read this as doing server side which seems perfectly reasonable, but just wanted to clarify. I think there's a good argument for the client side method as well, but don't have a great sense of what our users expect so don't have a strong conviction.

capnrefsmmat · 2020-10-08T21:45:13Z

Some signals can't be smoothed client-side -- for example, the symptom survey signals are not reported if there are fewer than 100 observations in a time window. But if, server-side, we group together 7 days and there are ≥100 observations, we can report the smoothed version. The client would not have access to the underlying raw data to do the same.

However, that's also a reason why we can't support configurable server-side smoothing. To do server-side smoothing for the symptom survey data, we would need to store (privately) the raw survey responses for every day, then, on the fly, calculate weighted estimates for any region and time period requested by the user and apply the minimum sample size filter. That would require completely rewriting the current indicator code so it can be run on the fly on the server side, instead of once in advance, and would be a major undertaking.

RoniRos · 2020-10-09T01:15:11Z

@capnrefsmmat Understood. So we don't have to do the symptom survey data that way, at least not initially. My bigger concern is the 8x duplication of Cases, Deaths and Tests, and overall flexibility, naming and deprecation. We can start by just introducing the indirection layer, defaulting it to the current behavior, and then judiciously move some select processing for some signals to server-side or client-side.

sgratzl · 2020-10-09T12:52:37Z

imo the more complex the api layer gets the more useful it would be to switch to a Python based version, see #178. It would allow to reuse modules and also get better performance by using connection pooling, ...

dshemetov · 2021-02-01T22:39:33Z

Summarizing some points from a related conversation with @krivard @chinandrew. We discussed centralizing geocoding calculations further. Currently the indicators duplicate a lot of geocoding work. We can remove this duplication by keeping the data at the FIPS level in the database and doing geocoding at query time or ingestion time. This has the advantage of:

saving about 20% storage space (there are approximately 3k FIPS codes, 400 msas, 300 hrrs, 53 states, 10 hrrs, and 1 nation),
centralizing geocoding
simplifying indicator pipelines
saving maintenance time, i.e. if there is a geocoding bug, we don't have to reissue values

The challenges here are:

some indicators have particular geocoding edge-cases (quidel, combo cases, JHU, nowcasting), which is code complexity that will be carried over to ingestion/query
costs development effort that can be spent elsewhere

Good principles for future decisions:

each indicator should do what is unique to that source/signal, and if all indicators are doing the same operations, that's something we should centralize somewhere (e.g. all indicators feed a fips file into one big geomapper before ingestion)
if geomapping actually is indicator specific (quidel, nowcasting), then maybe we don't extract it
nonetheless, the decision to centralize has to respond to practical concerns, which differ depending on the implementation details. it was decentralized for speed; now that speed is no longer our number one priority, we have to find a balance between implementation time to fix it and the maintenance load of leaving it broken/decentralized/inefficient.

The API server will switch from PHP to Python soon. That could be a good time to centralize geocoding.

RoniRos · 2021-02-02T03:30:07Z

This decision does not have to be all-or-none. You could have centralized geo-coding calculations, but still allow a particular signal to opt-out and do its own calculation, or preferably to rely on the centralized calculation but do some post-processing.

krivard · 2023-02-03T19:57:42Z

Signal transformations are being tracked in #607 and #608

Moving renames to its own issue and linking back to discussion here

krivard transferred this issue from cmu-delphi/covidcast-indicators Oct 8, 2020

krivard added the project proposal This is a big project deserving of a full project requirements doc label Oct 13, 2020

SumitDELPHI added the Engineering Used to filter issues when synching with Asana label Dec 2, 2020

krivard mentioned this issue Jan 12, 2021

Sensorization cmu-delphi/covidcast-indicators#568

Open

krivard mentioned this issue Feb 3, 2023

Consider adding a layer to permit renaming of signals #1077

Open

krivard closed this as completed Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New layer in API to support signal preprocessing and renaming #239

New layer in API to support signal preprocessing and renaming #239

RoniRos commented Oct 8, 2020 •

edited by krivard

Loading

chinandrew commented Oct 8, 2020

capnrefsmmat commented Oct 8, 2020

RoniRos commented Oct 9, 2020

sgratzl commented Oct 9, 2020

dshemetov commented Feb 1, 2021 •

edited

Loading

RoniRos commented Feb 2, 2021

krivard commented Feb 3, 2023

New layer in API to support signal preprocessing and renaming #239

New layer in API to support signal preprocessing and renaming #239

Comments

RoniRos commented Oct 8, 2020 • edited by krivard Loading

chinandrew commented Oct 8, 2020

capnrefsmmat commented Oct 8, 2020

RoniRos commented Oct 9, 2020

sgratzl commented Oct 9, 2020

dshemetov commented Feb 1, 2021 • edited Loading

RoniRos commented Feb 2, 2021

krivard commented Feb 3, 2023

RoniRos commented Oct 8, 2020 •

edited by krivard

Loading

dshemetov commented Feb 1, 2021 •

edited

Loading