Skip to content

Extend geocode utility to actually support the state to state mappings #310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 26, 2020

Conversation

dshemetov
Copy link
Contributor

Add support for the state_x -> state_y where x,y are in {code, id, name} mappings

- state_x -> state_y where x,y are in {code, id, name}
# state_name -> state_id
new_data = gmpr.add_geocode(self.zip_data, "zip", "state_name")
new_data2 = gmpr.add_geocode(new_data, "state_name", "state_id")
assert new_data2.shape == (12, 6)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there also be an assert statement for new_data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, new_data is just the test data for state_name to state_id, but it starts in zip form

@krivard krivard requested a review from chinandrew October 14, 2020 19:10
* update replace_geocode documentation to be clear about data columns
* add test cases for renaming columns in replace_geocode
* fix the state to state conversion dropped columns issue
@chinandrew
Copy link
Contributor

not specifically related to this PR, I just missed this in the previous one: date_cols is undocumented in the replace_geocode() docstring

@dshemetov
Copy link
Contributor Author

Ahhh, thanks, might as well take care of that here.

new_data = gmpr.add_population_column(self.zip_data, "zip")
assert new_data["population"].sum() == 274902
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to keep both of these checks?

Also, if it's 5x5 output it may be simpler juts to do a direct df comparison on values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old test I don't see anymore. Pull again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just asking why it was deleted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I decided to move away from tests based on data-derived population counts. I figured the tests should catch whether the underlying logic or arithmetic breaks, not whether the data file changed. Am open to reasons for keeping though.

Comment on lines 423 to 437
# hrr -> nation
with pytest.raises(ValueError):
new_data = gmpr.replace_geocode(self.zip_data, "zip", "hrr")
new_data2 = gmpr.replace_geocode(new_data, "hrr", "nation")

# hrr -> nation
with pytest.raises(ValueError):
new_data = gmpr.replace_geocode(self.zip_data, "zip", "hrr")
new_data2 = gmpr.replace_geocode(new_data, "hrr", "nation")

# hrr -> nation
with pytest.raises(ValueError):
new_data = gmpr.replace_geocode(self.zip_data, "zip", "hrr")
new_data2 = gmpr.replace_geocode(new_data, "hrr", "nation")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this is a new one 😄

@@ -411,6 +420,23 @@ def test_add_geocode(self):
)
)

# hrr -> nation
with pytest.raises(ValueError):
new_data = gmpr.replace_geocode(self.zip_data, "zip", "hrr")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure if the first line valueerrors but the second one doesnt, it still passes, so may want to split this into two with pytest raises.... statements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that have been caught in the zip -> hrr test some lines before that though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe im understanding this test wrong. I read it as testing that both

            new_data = gmpr.replace_geocode(self.zip_data, "zip", "hrr")

and

            new_data2 = gmpr.replace_geocode(new_data, "hrr", "nation")

raise valueerrors. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's testing the second one, since we should not be mapping hrr -> nation (it's an incomplete mapping, so it's unsupported).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh got it, didn't see the second line calls new_data. Thanks.

Copy link
Contributor

@chinandrew chinandrew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

new_code="state_id",
dropna=False)
# To use the original column name, reassign original column and drop new one
hosp_df[APIConfig.STATE_COL] = hosp_df["state_id"].str.upper()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chinandrew btw, this may be cdc_covidnet specific, but the state abbreviation in other indicators (like JHU) is assumed to be lower case. Are we sure this is what we want?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason cdc covidnet was uppercase, so i kept it consistent. if it's something we can change, would definitely recommend we standardize, but haven't looked into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krivard thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slacked Katie on this while discussing a similar issue, apparently downstream ingestion will standardize everything and accepts lower or uppercase, so this can be removed and we can just move to lowercase for the indicator code.

@krivard krivard merged commit c9fea0a into main Oct 26, 2020
@krivard krivard deleted the geoutil_state_extension branch October 29, 2020 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants