BigQuery: Upload STRUCT / RECORD fields from load_table_from_dataframe #21

tswast · 2019-05-30T00:37:15Z

Is your feature request related to a problem? Please describe.

If you have a pandas Series containing dictionaries, ideally this could be uploaded to BigQuery as a STRUCT / RECORD column. Currently this fails with a "file does not exist error" because the arrow write_table fails with ""ArrowInvalid: Nested column branch had multiple children".

Describe the solution you'd like

Upload of a RECORD column succeeds. This will require a fix to https://jira.apache.org/jira/browse/ARROW-2587.

Describe alternatives you've considered

Change intermediate file format to JSON or some other type. This isn't ideal, since most other types are row-oriented, but pandas DataFrames are column-oriented.

meredithslota · 2020-03-27T17:54:02Z

https://jira.apache.org/jira/browse/ARROW-2587 is still open. I'm not sure what we are able to do until that is fixed.

emkornfield · 2020-03-29T06:06:35Z

Writing nested structs will be fixed in Arrow 0.17.0 release (sometime in the next few weeks).

@wesm

- Plumbs through engine version - Makes engine version settable via environment variable - Adds unit tests coverage Should also unblock: googleapis/python-bigquery#21 CC @wesm Closes #6751 from emkornfield/add_flag_to_python Authored-by: Micah Kornfield <[email protected]> Signed-off-by: Wes McKinney <[email protected]>

plamut · 2020-03-31T10:49:32Z

I can confirm that the error from the issue description is not reproducible anymore with the latest pyarrow (development version, compiled from source).

I was able to successfully load the following data (into a new table, that is):

schema = [
    bigquery.SchemaField(
        "bar",
        "STRUCT",
        fields=[
            bigquery.SchemaField("aaa", "INTEGER", mode="REQUIRED"),
            bigquery.SchemaField("bbb", "INTEGER", mode="REQUIRED"),
        ],
        mode="REQUIRED",
    ),
]

dict_series = [
    {"aaa": 1, "bbb": 2}, {"aaa": 3, "bbb": 4}, {"aaa": 5, "bbb": 6}
]
df = pd.DataFrame(data={"bar": dict_series}, columns=["bar"])

job_config = bigquery.LoadJobConfig(schema=schema)
client.load_table_from_dataframe(
    df, "my.table.reference", job_config=job_config
).result()

This resulted in the following table and schema on the backend:

Row	bar.aaa	bar.bbb
1	1	2
2	3	4
3	5	6

Schema:

Field name	Type	Mode
bar	RECORD	REQUIRED
bar. aaa	INTEGER	REQUIRED
bar. bbb	INTEGER	REQUIRED

MainHanzo · 2020-04-10T12:46:21Z

Hello, I am very interested in this feature and I would love to compile the latest pyarrow from source.
Could you please give me some guides on how to compile it locally? I haven't found any instructions on how to build this project.

plamut · 2020-04-10T13:02:58Z

@MainHanzo Fortunately, compiling from source is not needed, as the pyarrow maintainers made the nightly pre-release builds available (comment).

If you still want to compile the project on your own, check pyarrow docs (I didn't manage to compiling it for Python 3.7 on my machine, though, but succeeded with Python 3.6).

jack-tee · 2020-07-10T16:20:28Z

Should this be fixed now? I'm using pyarrow 0.17.1 and google-cloud-bigquery 1.25.0.
If I take a dataframe and pass it to .load_table_from_dataframe() if the table does not exist in bigquery then it loads the struct field correctly.
If the table already exists I get this error and a link to this page? Caused by: Uploading dataframes with struct (record) column types is not supported.

plamut · 2020-07-11T09:09:32Z

@jack-tee AFAIK it should be fixed in pyarrow, yes, although only for Python 3.

The error you are seeing is raised by the client library itself. There is a PR that will remove this error-raising part, but it's not merged yet (it's actually put on hold, because it only works in Python 3 and we are dropping Python 2 support in the near future anyway).

How urgently do you need the fix?

If feasible, you can temporarily manually comment out the linked code block until the fix gets actually released.

jack-tee · 2020-07-13T14:23:41Z

Thanks for clarifying @plamut. When I saw that I could load struct fields into a table that didn't exist but not into an existing table I thought perhaps that path had been missed, but your explanation makes sense.

I can work around it for now. Thanks :)

plamut transferred this issue from googleapis/google-cloud-python Feb 4, 2020

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Feb 4, 2020

plamut added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Feb 4, 2020

meredithslota added the status: blocked Resolving the issue is dependent on other work. label Mar 27, 2020

emkornfield mentioned this issue Mar 29, 2020

ARROW-2587: [Python][Parquet] Verify nested data can be written apache/arrow#6751

Closed

plamut mentioned this issue Apr 2, 2020

BigQuery: Upload pandas DataFrame containing arrays #19

Closed

scholtzan mentioned this issue Apr 9, 2020

Set up integration testing and plug in some statistics mozilla/jetstream#31

Closed

HemangChothani self-assigned this Jun 24, 2020

HemangChothani mentioned this issue Jun 24, 2020

feat: add support and tests for struct fields #146

Merged

HemangChothani closed this as completed in #146 Aug 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigQuery: Upload STRUCT / RECORD fields from load_table_from_dataframe #21

BigQuery: Upload STRUCT / RECORD fields from load_table_from_dataframe #21

tswast commented May 30, 2019

meredithslota commented Mar 27, 2020

emkornfield commented Mar 29, 2020

plamut commented Mar 31, 2020 •

edited

Loading

MainHanzo commented Apr 10, 2020

plamut commented Apr 10, 2020 •

edited

Loading

jack-tee commented Jul 10, 2020

plamut commented Jul 11, 2020

jack-tee commented Jul 13, 2020

BigQuery: Upload STRUCT / RECORD fields from load_table_from_dataframe #21

BigQuery: Upload STRUCT / RECORD fields from load_table_from_dataframe #21

Comments

tswast commented May 30, 2019

meredithslota commented Mar 27, 2020

emkornfield commented Mar 29, 2020

plamut commented Mar 31, 2020 • edited Loading

MainHanzo commented Apr 10, 2020

plamut commented Apr 10, 2020 • edited Loading

jack-tee commented Jul 10, 2020

plamut commented Jul 11, 2020

jack-tee commented Jul 13, 2020

plamut commented Mar 31, 2020 •

edited

Loading

plamut commented Apr 10, 2020 •

edited

Loading