NOTE: Edit this file on GitHub! https://github.com/cmu-delphi/delphi-epidata/tree/master/src/acquisition/covidcast/data_dir_readme.md
This is the location where covidcast CSVs are deposited and stored.
WARNING: All data herein is publicly surfaced through the Epidata API.
There are three important directories:
receiving/
: put your new CSVs under a subdirectory here!archive/successful/
: storage for successfully uploaded CSVs. automation will compress and move your CSVs here.archive/failed/
: storage for broken or failed CSVs. automation will move your uncompressed CSVs here for you to debug. delete them when you're finished. consult automation logs to determine the reason for the failure.
Within each of the above, there are a number of subdirectories, generally one for each covidcast data source. It's important to place your CSVs in the appropriate subdirectory as the name of the data source is extracted from the path of each file.
Any files that are unable to be loaded due to invalid naming will be moved to
the special directory archive/failed/unknown/
, as in this case the data
source name (i.e. name of subdirectory) is not assumed to be reliable.
Automation will periodically (currently at :50 every hour) check for new files
nested under receiving/
. When it finds some, it will upload them to the
epidata database and then archive the source CSV as described above. The API
will immediately begin serving the new data.
Criteria for a valid filename:
- Name format is
YYYYMMDD_{geo}_{signal}.csv
for dailies andweekly_YYYYWW_{geo}_{signal}.csv
for weeklies - Minimum year: 2019
- Maximum year: 2030
geo
one of: county, hrr, msa, dma, state, nationsignal
must be matched by a/\w+/
regex.
If a filename is invalid, it does not get ingested, and will be listed in the failed archive.
Criteria for a valid file:
- Required columns in header:
geo_id
,val
,se
,sample_size
- Additional columns are permitted but will be ignored
geo_id
forhrr
,msa
,dma
must be interpretable as a/[0-9]+/
string (ints and floats are allowed but you definitely want to be careful there)geo_id
forcounty
must have length 5 and sort between '01000' and '80000'geo_id
forhrr
must sort between '001' and '500'geo_id
formsa
must have length 5 and sort between '10000' and'99999'geo_id
fordma
must sort between '450' and '950'geo_id
forstate
,nation
must have length 2 and sort between 'aa' and 'zz'value
must be a real number (ie not nan, inf, empty, na, or None)se
may be nan; if it is a number, it must be nonnegativesample_size
may be nan; if it is a number, it must be at least 5
If a file has invalid headers, it does not get ingested, and will be listed in the failed archive.
If a row has invalid data, the row is skipped, but the rest of the file is ingested. The file will be listed in the failed archive.
If a filename is valid and all data in the file is valid, it will be listed in the successful archive.
To add a new source, create a directory in receiving/
with permissions that
exactly match the other directories there. The easiest way to do this is to cp -a
a directory from an existing source that happens to be empty.
You do not need to create a directory in archive/successful
or
archive/failed
; one will be created automatically as soon as it is needed. Do
not modify the permissions in the archive directories, as a permissions error
will crash the data ingestion job.
The reason for each failed file is logged by Automation in the ~automation directory. Off-cycle ingestions can be scheduled using the Automation web console.