-
Notifications
You must be signed in to change notification settings - Fork 316
BigQuery: Upload STRUCT / RECORD fields from load_table_from_dataframe #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
https://jira.apache.org/jira/browse/ARROW-2587 is still open. I'm not sure what we are able to do until that is fixed. |
Writing nested structs will be fixed in Arrow 0.17.0 release (sometime in the next few weeks). |
- Plumbs through engine version - Makes engine version settable via environment variable - Adds unit tests coverage Should also unblock: googleapis/python-bigquery#21 CC @wesm Closes #6751 from emkornfield/add_flag_to_python Authored-by: Micah Kornfield <[email protected]> Signed-off-by: Wes McKinney <[email protected]>
I can confirm that the error from the issue description is not reproducible anymore with the latest I was able to successfully load the following data (into a new table, that is): schema = [
bigquery.SchemaField(
"bar",
"STRUCT",
fields=[
bigquery.SchemaField("aaa", "INTEGER", mode="REQUIRED"),
bigquery.SchemaField("bbb", "INTEGER", mode="REQUIRED"),
],
mode="REQUIRED",
),
]
dict_series = [
{"aaa": 1, "bbb": 2}, {"aaa": 3, "bbb": 4}, {"aaa": 5, "bbb": 6}
]
df = pd.DataFrame(data={"bar": dict_series}, columns=["bar"])
job_config = bigquery.LoadJobConfig(schema=schema)
client.load_table_from_dataframe(
df, "my.table.reference", job_config=job_config
).result() This resulted in the following table and schema on the backend:
Schema:
|
Hello, I am very interested in this feature and I would love to compile the latest pyarrow from source. |
@MainHanzo Fortunately, compiling from source is not needed, as the If you still want to compile the project on your own, check pyarrow docs (I didn't manage to compiling it for Python 3.7 on my machine, though, but succeeded with Python 3.6). |
Should this be fixed now? I'm using pyarrow 0.17.1 and google-cloud-bigquery 1.25.0. |
@jack-tee AFAIK it should be fixed in The error you are seeing is raised by the client library itself. There is a PR that will remove this error-raising part, but it's not merged yet (it's actually put on hold, because it only works in Python 3 and we are dropping Python 2 support in the near future anyway). How urgently do you need the fix? If feasible, you can temporarily manually comment out the linked code block until the fix gets actually released. |
Thanks for clarifying @plamut. When I saw that I could load struct fields into a table that didn't exist but not into an existing table I thought perhaps that path had been missed, but your explanation makes sense. I can work around it for now. Thanks :) |
Is your feature request related to a problem? Please describe.
If you have a pandas Series containing dictionaries, ideally this could be uploaded to BigQuery as a STRUCT / RECORD column. Currently this fails with a "file does not exist error" because the arrow write_table fails with ""ArrowInvalid: Nested column branch had multiple children".
Describe the solution you'd like
Upload of a RECORD column succeeds. This will require a fix to https://jira.apache.org/jira/browse/ARROW-2587.
Describe alternatives you've considered
Change intermediate file format to JSON or some other type. This isn't ideal, since most other types are row-oriented, but pandas DataFrames are column-oriented.
The text was updated successfully, but these errors were encountered: