Closed
Description
When loading a pandas dataframe into Bigquery that contains a nan in a required column, the upload succeeds but the resulting table is not representative of the dataframe. The values in the column containing the nan
are unordered and the user is not aware of it.
Environment details
- OS type and version:
- Python version: Python 3.11.4
- pip version: pip 23.2.1
google-cloud-bigquery
version:- Name: google-cloud-bigquery
Version: 3.12.0
Summary: Google BigQuery API client library
Home-page: https://github.com/googleapis/python-bigquery
Author: Google LLC
Author-email: [email protected]
License: Apache 2.0
Requires: google-api-core, google-cloud-core, google-resumable-media, grpcio, grpcio, packaging, proto-plus, protobuf, python-dateutil, requests
Required-by: google-cloud-aiplatform, pandas-gbq
Steps to reproduce
- Run the code below
Code example
from google.cloud import bigquery
import pandas as pd
import numpy as np
df = pd.DataFrame([["hello", "string"], ["hello2", np.nan], ["hello3", "valid"], ["hello4", "valid2"]], columns=["image_uri", "phash"])
client = bigquery.Client(project="project")
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("image_uri", "STRING", mode="REQUIRED"),
bigquery.SchemaField("phash", "STRING", mode="REQUIRED"),
]
)
job = client.load_table_from_dataframe(
df, "foo.foo_bar", job_config=job_config
)
job.result()
df_read = pd.read_gbq("foo.foo_bar", project_id="project")