Skip to content

Add 7-day average versions of SafeGraph signals #271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
krivard opened this issue Sep 11, 2020 · 4 comments · Fixed by #309
Closed

Add 7-day average versions of SafeGraph signals #271

krivard opened this issue Sep 11, 2020 · 4 comments · Fixed by #309
Labels
API addition New signals Engineering Used to filter issues when synching with Asana good first issue

Comments

@krivard
Copy link
Contributor

krivard commented Sep 11, 2020

Larry says forecasting has been using something similar, and they're extremely useful.

@sgsmob
Copy link
Contributor

sgsmob commented Oct 8, 2020

How are this and #273 supposed to be inserted? It looks to me like

  • Each input file corresponds to a single date.
  • Each output file corresponds to a single input.
  • The files are all processed independently in parallel.

Am I missing something or does this require a completely new paradigm where many/all input files are read simultaneously to pull out the data from the last n days?

@krivard
Copy link
Contributor Author

krivard commented Oct 8, 2020

The desired behavior is for each output file of a 7-day average signal to correspond to 7 single-date input files.

Safegraph is a bit unique in using a parallel architecture, which does indeed make this a bit more complicated.

Two possible approaches:

  • rewrite delphi_safegraph.process.process to take a list of files as input, and use the max timestamp as the output date. In run, split the calls to process() into two batches: 7-day-average signals that pass in 7 files, and unfiltered signals that pass in 1 file.
  • implement 7-day averages as post-processing on the outputs of the unfiltered signals. This signal doesn't appear to do any low-sample-size censoring, so we won't lose any data by doing that.

@sgsmob
Copy link
Contributor

sgsmob commented Oct 9, 2020

What is the naming conventions for the files in this directory? And is there a 1:1 correspondence between dates and input files?

@krivard
Copy link
Contributor Author

krivard commented Oct 9, 2020

The format for output files (which go into the receiving directory) is documented here.

The naming convention for input files (which are pulled from s3://sg-c19-response/social-distancing/v2/ and saved into the {raw_data_dir}/social-distancing/ directory is apparently undocumented -- we should fix that; it would go into DETAILS.md or a line comment here. Digging it out of the server, I get:

{raw_data_dir}/social-distancing/{YYYY}/{MM}/{DD}/{YYYY}-{MM}-{DD}-social-distancing.csv.gz

@SumitDELPHI SumitDELPHI added the Engineering Used to filter issues when synching with Asana label Dec 6, 2020
@krivard krivard closed this as completed Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API addition New signals Engineering Used to filter issues when synching with Asana good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants