load_table_from_dataframe produces incorrect results when used in list of dict #781

Lmmejia11 · 2021-07-19T09:50:19Z

Environment details

OS type and version: Debian 10 (dataproc image 2.0-debian10)
Python version: python --version: Python 3.8.10
pip version: pip --version: pip 21.1.2
google-cloud-bigquery version: pip show google-cloud-bigquery: google-cloud-bigquery==2.6.2, pyarrow==2.0.0

Steps to reproduce

Create a big dataframe (1000 lines) with a column containing a list (at least length 6) of identically structured dictionaries
Create a bq client and use load_table_from_dataframe to create a table in bigquery
Check the resulting table in bigquery. Structures seem to switch values with other instances in the list. (eg should have [STRUCT('w0' AS name, 0.1 AS value),STRUCT('h1' AS name, 1.2 AS value)] but have [STRUCT('h1' AS name, 0.1 AS value),STRUCT('w0' AS name, 1.2 AS value)]. The big problem is not the order, is that the integrity of information of each structure is not kept (eg. 'w0' should be 0.1, not 1.2).

Code example

# create df with a list of dictionaries
# In this example, the dict structure is {"name": str, "value":float}. name is a letter + int, and value are increasingly big floats
data = [[[{'name':'whyist'[i]+str(i), 'value':np.random.random()*10**i} for i in range(6)]] for n in range(1000)]
df = pd.DataFrame(data, columns=['vals'])

# load
project = 'myproject'
bq_client = bigquery.Client(project=project)
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = 'WRITE_TRUNCATE'
bq_client.load_table_from_dataframe(
    dataframe=df,
    destination='tmp.test_bug',
    job_config = job_config
)

# Checking in bigquery,

At least for this example, the 'value' attribute is transcribed in the correct order (first item has the smallest value, and it increases). The 'name' value was sampled with possibility of repetition. All table lines have the same 'name' values in the same order, and it can change if the code is reexecuted.

The text was updated successfully, but these errors were encountered:

plamut · 2021-07-20T15:44:20Z

@Lmmejia11 Thanks for the runnable example, it appears pyarrow version is the culprit. This is the result I got with pyarrow==4.0.0, seemed fine to me (please confirm):

This was both with the latest BigQuery client (v2.22.0) as well as with the v2.6.2 version.

However, when I downgraded pyarrow to v2.0.0, I started to see the names and values scrambled:

Would it be possible to upgrade pyarrow in your system?

I will nevertheless try to find why this is the case, because pyarrow is pinned to >= 1.0.0.

Edit: Interesting, this seems to start happening when the number of dataframe rows is 513 or more, while with 512 rows or less it works just fine. On the other hand, pyarrow==4.0.0 works even with a large number of rows (at least up to 1 million, didn't test with more).

Edit 2: After additional tests it actually seems that pyarrow==1.0.x is also OK to use, only the 2.0.0 release causes problems.

plamut · 2021-07-20T17:08:36Z

Since we want to keep the supported dependency ranges wide (pyarrow can be tricky to upgrade), it appears that the most sensible thing would be detecting the pyarrow version at runtime in load_table_from-dataframe() and issue a warning if uploading more than 512 rows.

Also increasing priority, as possible silent data corruption is Bad™.

Lmmejia11 · 2021-07-20T17:39:03Z

Thanks! thankfully I can upgrade pyarrow

Lmmejia11 · 2021-07-20T18:06:53Z

As you said, I noticed the bug appeared when there are many lines. But it can also depend on the length of the lists. If I remember correctly, 1000 lines with list of length 5 also worked fine. I dont know if it could bug with less than 512 rows if the lists are longer or the values heavier, it might be linked to space.

plamut · 2021-07-20T19:07:18Z

Thanks, I'll keep this in mind. It might actually be better to not try to narrow down the conditions when the bug could occur, but instead always issue a warning when a less recent pyarrow version is detected.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 19, 2021

plamut self-assigned this Jul 20, 2021

plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jul 20, 2021

plamut mentioned this issue Jul 20, 2021

deps!: BigQuery Storage and pyarrow are required dependencies #776

Merged

4 tasks

plamut added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jul 20, 2021

plamut mentioned this issue Jul 21, 2021

fix: issue a warning if buggy pyarrow is detected #787

Merged

4 tasks

tswast closed this as completed in #787 Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

load_table_from_dataframe produces incorrect results when used in list of dict #781

load_table_from_dataframe produces incorrect results when used in list of dict #781

Lmmejia11 commented Jul 19, 2021

plamut commented Jul 20, 2021 •

edited

Loading

Uh oh!

plamut commented Jul 20, 2021 •

edited

Loading

Uh oh!

Lmmejia11 commented Jul 20, 2021

Uh oh!

Lmmejia11 commented Jul 20, 2021

Uh oh!

plamut commented Jul 20, 2021

Uh oh!

load_table_from_dataframe produces incorrect results when used in list of dict #781

load_table_from_dataframe produces incorrect results when used in list of dict #781

Comments

Lmmejia11 commented Jul 19, 2021

Environment details

Steps to reproduce

Code example

plamut commented Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

plamut commented Jul 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lmmejia11 commented Jul 20, 2021

Uh oh!

Lmmejia11 commented Jul 20, 2021

Uh oh!

plamut commented Jul 20, 2021

Uh oh!

plamut commented Jul 20, 2021 •

edited

Loading

plamut commented Jul 20, 2021 •

edited

Loading