Skip to content

load_table_from_dataframe produces incorrect results when used in list of dict #781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Lmmejia11 opened this issue Jul 19, 2021 · 5 comments · Fixed by #787
Closed

load_table_from_dataframe produces incorrect results when used in list of dict #781

Lmmejia11 opened this issue Jul 19, 2021 · 5 comments · Fixed by #787
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@Lmmejia11
Copy link

Environment details

  • OS type and version: Debian 10 (dataproc image 2.0-debian10)
  • Python version: python --version: Python 3.8.10
  • pip version: pip --version: pip 21.1.2
  • google-cloud-bigquery version: pip show google-cloud-bigquery: google-cloud-bigquery==2.6.2, pyarrow==2.0.0

Steps to reproduce

  1. Create a big dataframe (1000 lines) with a column containing a list (at least length 6) of identically structured dictionaries
  2. Create a bq client and use load_table_from_dataframe to create a table in bigquery
  3. Check the resulting table in bigquery. Structures seem to switch values with other instances in the list. (eg should have [STRUCT('w0' AS name, 0.1 AS value),STRUCT('h1' AS name, 1.2 AS value)] but have [STRUCT('h1' AS name, 0.1 AS value),STRUCT('w0' AS name, 1.2 AS value)]. The big problem is not the order, is that the integrity of information of each structure is not kept (eg. 'w0' should be 0.1, not 1.2).

Code example

# create df with a list of dictionaries
# In this example, the dict structure is {"name": str, "value":float}. name is a letter + int, and value are increasingly big floats
data = [[[{'name':'whyist'[i]+str(i), 'value':np.random.random()*10**i} for i in range(6)]] for n in range(1000)]
df = pd.DataFrame(data, columns=['vals'])

# load
project = 'myproject'
bq_client = bigquery.Client(project=project)
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = 'WRITE_TRUNCATE'
bq_client.load_table_from_dataframe(
    dataframe=df,
    destination='tmp.test_bug',
    job_config = job_config
)

# Checking in bigquery, 

At least for this example, the 'value' attribute is transcribed in the correct order (first item has the smallest value, and it increases). The 'name' value was sampled with possibility of repetition. All table lines have the same 'name' values in the same order, and it can change if the code is reexecuted.

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 19, 2021
@plamut plamut self-assigned this Jul 20, 2021
@plamut
Copy link
Contributor

plamut commented Jul 20, 2021

@Lmmejia11 Thanks for the runnable example, it appears pyarrow version is the culprit. This is the result I got with pyarrow==4.0.0, seemed fine to me (please confirm):

Screenshot from 2021-07-20 17-33-41

This was both with the latest BigQuery client (v2.22.0) as well as with the v2.6.2 version.

However, when I downgraded pyarrow to v2.0.0, I started to see the names and values scrambled:

Screenshot from 2021-07-20 17-41-52

Would it be possible to upgrade pyarrow in your system?

I will nevertheless try to find why this is the case, because pyarrow is pinned to >= 1.0.0.


Edit: Interesting, this seems to start happening when the number of dataframe rows is 513 or more, while with 512 rows or less it works just fine. On the other hand, pyarrow==4.0.0 works even with a large number of rows (at least up to 1 million, didn't test with more).

Edit 2: After additional tests it actually seems that pyarrow==1.0.x is also OK to use, only the 2.0.0 release causes problems.

@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jul 20, 2021
@plamut
Copy link
Contributor

plamut commented Jul 20, 2021

Since we want to keep the supported dependency ranges wide (pyarrow can be tricky to upgrade), it appears that the most sensible thing would be detecting the pyarrow version at runtime in load_table_from-dataframe() and issue a warning if uploading more than 512 rows.

Also increasing priority, as possible silent data corruption is Bad™.

@plamut plamut added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jul 20, 2021
@Lmmejia11
Copy link
Author

Thanks! thankfully I can upgrade pyarrow

@Lmmejia11
Copy link
Author

As you said, I noticed the bug appeared when there are many lines. But it can also depend on the length of the lists. If I remember correctly, 1000 lines with list of length 5 also worked fine. I dont know if it could bug with less than 512 rows if the lists are longer or the values heavier, it might be linked to space.

@plamut
Copy link
Contributor

plamut commented Jul 20, 2021

Thanks, I'll keep this in mind. It might actually be better to not try to narrow down the conditions when the bug could occur, but instead always issue a warning when a less recent pyarrow version is detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants