Description
The "Update COVID Hosp: state timeseries dataset" acquisition job has been failing since apr 4 2024; first failure in cronicle: https://cronicle-prod-01.delphi.cmu.edu/#JobDetails?id=jluljl1fnlb
Error message from cronicle:
{"url":"https://healthdata.gov/api/views/qqte-vkut/rows.csv","event":"fetching metadata","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:02.164180Z"}
{"newer_than":"Timestamp('2024-04-02 00:00:00')","older_than":"datetime.date(2024, 4, 5)","event":"looking up issues in date range","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.430783Z"}
{"newer_than":"2024-04-02 00:00:00","older_than":"2024-04-05","count":2,"event":"issues selected","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.434811Z"}
{"event":"already collected revision: https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-03T12-06-56.csv","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.443171Z"}
{"event":"including dataset revision: https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.451599Z"}
{"url":"https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv","event":"fetching dataset","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:12.451892Z"}
{"event":"2 issues checked w/ 2 revisions, resulting in 1 datasets.","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:16.100503Z"}
{"count":80201,"event":"updating values","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"info","pid":92518,"timestamp":"2024-04-04T18:00:16.193757Z"}
{"publ_date":20240404,"in_lines":[0,5000],"index":4999,"values":["PR",20210430,0.138728323699422,70,692,96,0.66804979253112,71,723,483,0,71,0,0,71,0,7,70,"POINT (-66.468953431 18.222606498)",2,70,1,70,0.052334784329244,70,10108,529,10285,71,6310,71,529,70,0.613514827418571,71,10285,6310,null,null,null,0.085144052792532,70,6213,529,25,0,71,1,71,3,71,6,71,6,71,4,71,2,71,3,71,71,0,71,18,0,70,0,70,0,70,3,70,3,70,2,70,4,70,3,70,70,3,70,4,70,0,70,12,70,0,70,0,70,null,null,null,483,71,96,70,67,71,479,70,451,71,25,70,50,70,25,70,723,71],"exception":"DataError(1406, \"1406 (22001): Data too long for column 'geocoded_state' at row 1\", '22001')","event":"error on insert","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"error","pid":92518,"timestamp":"2024-04-04T18:00:18.773557Z"}
{"scope":"sys","event":"Top-level exception occurred","logger":"delphi.epidata.acquisition.covid_hosp.state_timeseries.database","level":"error","pid":92518,"timestamp":"2024-04-04T18:00:18.773974Z","exception":"Traceback (most recent call last):\n File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/connection_cext.py\", line 487, in cmd_query\n self._cmysql.query(query,\n_mysql_connector.MySQLInterfaceError: Data too long for column 'geocoded_state' at row 1\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py\", line 193, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/runpy.py\", line 86, in _run_code\n exec(code, run_globals)\n File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/state_timeseries/update.py\", line 42, in <module>\n Utils.launch_if_main(Update.run, __name__)\n File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py\", line 38, in launch_if_main\n entrypoint()\n File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/state_timeseries/update.py\", line 38, in run\n return Utils.update_dataset(Database, network)\n File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/utils.py\", line 262, in update_dataset\n db.insert_dataset(issue_int, dataset, logger=logger)\n File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/database.py\", line 231, in insert_dataset\n raise e\n File \"/home/automation/driver/delphi/epidata/acquisition/covid_hosp/common/database.py\", line 225, in insert_dataset\n cursor.executemany(sql, many_values)\n File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/cursor_cext.py\", line 357, in executemany\n self.execute(operation, params)\n File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/cursor_cext.py\", line 264, in execute\n result = self._cnx.cmd_query(stmt, raw=self._raw,\n File \"/home/automation/.pyenv/versions/3.8.2/lib/python3.8/site-packages/mysql/connector/connection_cext.py\", line 491, in cmd_query\n raise errors.get_mysql_exception(exc.errno, msg=exc.msg,\nmysql.connector.errors.DataError: 1406 (22001): Data too long for column 'geocoded_state' at row 1"}
The offending csv file is https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv
In our TABLE covid_hosp_state_timeseries
, COLUMN geocoded_state
is of type VARCHAR(32)
.
The most recently successfully imported csv has nulls/empties for geocoded_state column, as does the csv issued immediately after the offending csv file, as does the most recently issued csv. Only the one csv appears to have these long "POINT (...
" values. I downloaded these 4 mentioned csv files and ran the following commands to demonstrate:
$ ls -laF g62h-syeh_2024-04-*
-rw-r--r-- 1 haff staff 37787646 Apr 22 13:52 g62h-syeh_2024-04-03T12-06-56.csv
-rw-r--r-- 1 haff staff 40484359 Apr 22 13:53 g62h-syeh_2024-04-04T07-01-48.csv
-rw-r--r-- 1 haff staff 37961499 Apr 22 14:07 g62h-syeh_2024-04-11T15-31-38.csv
-rw-r--r-- 1 haff staff 38136294 Apr 22 14:06 g62h-syeh_2024-04-19T12-06-57.csv
$ grep -l POINT g62h-syeh_2024-04-*
g62h-syeh_2024-04-04T07-01-48.csv
$ grep "PR,2021/04/30" g62h-syeh_2024-04-03T12-06-56.csv
PR,2021/04/30,0,0,71,0,0,71,2,70,10285,71,6310,71,529,70,25,71,18,70,0,70,12,70,483,71,96,70,67,71,479,70,451,71,50,70,25,70,723,71,0.613514827418571,71,6310,10285,0.085144052792532,70,529,6213,0.052334784329244,70,529,10108,0.138728323699422,70,96,692,0.66804979253112,71,483,723,,0,71,1,71,3,71,6,71,6,71,4,71,2,71,3,71,0,71,0,70,0,70,0,70,3,70,3,70,2,70,4,70,3,70,3,70,7,70,,,,,,,1,70,4,70,0,70,0,70,25,13,70,70,62,8,196,8,,0,,0,,0,,0,,0,12,8,29,8
$ grep "PR,2021/04/30" g62h-syeh_2024-04-04T07-01-48.csv
PR,2021/04/30,0,0,71,0,0,71,2,70,10285,71,6310,71,529,70,25,71,18,70,0,70,12,70,483,71,96,70,67,71,479,70,451,71,50,70,25,70,723,71,0.613514827418571,71,6310,10285,0.085144052792532,70,529,6213,0.052334784329244,70,529,10108,0.138728323699422,70,96,692,0.66804979253112,71,483,723,POINT (-66.468953431 18.222606498),0,71,1,71,3,71,6,71,6,71,4,71,2,71,3,71,0,71,0,70,0,70,0,70,3,70,3,70,2,70,4,70,3,70,3,70,7,70,,,,,,,1,70,4,70,0,70,0,70,25,13,70,70,62,8,196,8,,0,,0,,0,,0,,0,12,8,29,8
Attempt to skip the offending file similar to solution in #1369 :
INSERT INTO covid_hosp_meta (dataset_name, publication_date, revision_timestamp, hhs_dataset_id, acquisition_datetime, metadata_json)
VALUES ('covid_hosp_state_timeseries', 20240404, 'https://us-dhhs-aa.s3.us-east-2.amazonaws.com/g62h-syeh_2024-04-04T07-01-48.csv', 'g62h-syeh',
'2024-04-21 00:00:00', '{"error":"semi-malformed data file, includes geo point encodings too long for our data column: geocoded_state"}');
After adding that line to the covid_hosp_meta
table above, i re-ran the cronicle job and am now awaiting the results.