Skip to content

JHU generating bogus ".0000" geo id #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
krivard opened this issue Aug 28, 2020 · 6 comments
Closed

JHU generating bogus ".0000" geo id #254

krivard opened this issue Aug 28, 2020 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@krivard
Copy link
Contributor

krivard commented Aug 28, 2020

JHU is generating a geo id of ".0000", which is not a valid geo id. It seems to affect some signals but not others. For example, from the ingestion log:

handling  /common/covidcast/receiving/jhu-csse/20200827_county_confirmed_incidence_num.csv
confirmed_incidence_num False
 invalid value for Pandas(geo_id='.0000', val='1287.0', se=nan, sample_size=nan) (geo_id)
exception while inserting rows: 'NoneType' object has no attribute 'geo_value'
archiving as failed - jhu-csse
handling  /common/covidcast/receiving/jhu-csse/20200827_county_confirmed_incidence_prop.csv
confirmed_incidence_prop False
archiving as successful

& the CSV files in question:

# this one is bad
$ head archive/failed/jhu-csse/20200827_county_confirmed_7dav_incidence_num.csv 
geo_id,val,se,sample_size
.0000,544.4285714285714,NA,NA
01001,6.285714285714286,NA,NA
01003,34.57142857142857,NA,NA
01005,-0.7142857142857143,NA,NA
01007,3.0,NA,NA

# but this one is fine
$ zcat archive/successful/jhu-csse/20200827_county_confirmed_7dav_incidence_prop.csv.gz | head
geo_id,val,se,sample_size
01001,11.250808651871852,NA,NA
01003,15.486632220642273,NA,NA
01005,-2.8934850291084593,NA,NA
01007,13.39644547646691,NA,NA
01009,16.55211941242447,NA,NA
@krivard krivard added the bug Something isn't working label Aug 28, 2020
@krivard
Copy link
Contributor Author

krivard commented Aug 29, 2020

We are also seeing a FIPS code of 80001, which I can't find in the geo coding materials:

handling  /common/covidcast/receiving/jhu-csse/20200827_county_deaths_incidence_num.csv
deaths_incidence_num False
 invalid value for Pandas(geo_id='80001', val='0.0', se=nan, sample_size=nan) (geo_id)
exception while inserting rows: 'NoneType' object has no attribute 'geo_value'
archiving as failed - jhu-csse
destination exists, will overwrite (/common/covidcast/archive/failed/jhu-csse/20200827_county_deaths_incidence_num.csv)

It may be associated with this line from the JHU csv:

84080001,US,USA,840,80001.0,Out of AL,Alabama,US,0.0,0.0,"Out of AL, Alabama, US",[...]

...which JHU seems to have completely made up?

@dshemetov
Copy link
Contributor

dshemetov commented Sep 2, 2020

The 840800XX codes are listed as "Out of [State]" in their UID lookup. My impression from the way they report Puerto Rico is that these are reserved for values they can't pin down to a particular FIPS. For most FIPS codes, they are either blank or zero. Update: looking through the confirmed US cases time series, the following states use that field frequently:

  • Georgia 84080013
  • Illinois 84080017
  • Michigan 84080026
  • Tennessee 84080047
  • Utah (this is mentioned in UID lookup docs)

@dshemetov
Copy link
Contributor

@krivard hm, with regard to the '.0001' geoid, since this doesn't appear in the JHU time series, it must be generated somewhere in our code. I would like to see a set difference between the geo_ids of 20200827_county_confirmed_incidence_num.csv and 20200827_county_confirmed_incidence_prop.csv.gz, to see if any are missing. Also, it is strange that the value associated with .0001 , 1287.0, does not show up anywhere in the associated time series.

@krivard
Copy link
Contributor Author

krivard commented Sep 3, 2020

Unfortunately, I don't seem to have kept copies of the bad files. I've modified the cron job I'm using for interim repairs to make backups; hopefully I'll have that for you tomorrow morning PDT.

We wouldn't expect 1287.0 to show up in the CSSE time series, because those files are cumulative and 1287.0 was an incidence figure.

@krivard
Copy link
Contributor Author

krivard commented Sep 4, 2020

Here's a bundle of 20200903_county_confirmed_7dav_incidence_*; _num includes the bad .0000 region and _prop does not. I used comm to do set intersection on the geo ids and generate the .counties files:

https://delphi.midas.cs.cmu.edu/~krivard/jhu-.0000.tgz

$ wc -l *.counties
 3143 both.counties
  139 only-num.counties
    0 only-prop.counties

@dshemetov
Copy link
Contributor

Thanks Katie! I found the bug source. It's in this line. There is a string selection method str[-2:] attempting to select the last two values of a UID like 84072001, but the UID is a type float at that point, so the str[-2:] receives just ".0".

This should already be fixed in my refactor #217, because I handle those UIDs manually elsewhere.

krivard added a commit that referenced this issue Sep 18, 2020
* added the functions zip_to_state_code, zip_to_state_id (and the convert_* versions), zip_to_msa and convert_zip_to_msa
* added two functions add_geocode and replace_geocode meant to consolidate the logic in the utility and reduce the code size by a factor of 5. These functions work along side with the rest of the deprecated functions and are meant to replace e.g. zip_to_msa(df, ...) with replace_geocode(df, "zip", "msa", ...).
* renamed functions that referred to fips or county interchangeably to consistently use fips, e.g. zip_to_county to zip_to_fips
* enforced the string type on all geocodes, with zero padding as necessary
* renamed instances of stcode to state_code for clarity
* removed non-JHU UID functions for JHU conversion
* updated tests to match

Bugfixes:

* Removed .0000, 9xxx in output mappings - fixes most of #254
* Puerto Rico deaths should now be reported - fixes #179
* Generally fixes #215
@krivard krivard closed this as completed Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants