Skip to content

pd.read_gbq broken with 1.26.0 and pyarrow #177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
inglesp opened this issue Jul 22, 2020 · 7 comments · Fixed by #181
Closed

pd.read_gbq broken with 1.26.0 and pyarrow #177

inglesp opened this issue Jul 22, 2020 · 7 comments · Fixed by #181
Assignees
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@inglesp
Copy link

inglesp commented Jul 22, 2020

If pyarrow is installed, then with pandas-gbq==0.13.2, using pd.read_gbq causes an exception inside this library.

>>> pd.read_gbq("SELECT 1", project_id="ebmdatalab")
/home/inglesp/.pyenv/versions/openp/lib/python3.5/site-packages/google/auth/_default.py:69: UserWarning: Your application has authenticated using end user credentials from Google Cloud SDK without a quota project. You might receive a "quota exceeded" or "API not enabled" error. We recommend you rerun `gcloud auth application-default login` and make sure a quota project is added. Or you can use service accounts instead. For more information about service accounts, see https://cloud.google.com/docs/authentication/
  warnings.warn(_CLOUD_SDK_CREDENTIALS_WARNING)
/home/inglesp/.pyenv/versions/openp/lib/python3.5/site-packages/google/cloud/bigquery/client.py:407: UserWarning: Cannot create BigQuery Storage client, the dependency google-cloud-bigquery-storage is not installed.
  "Cannot create BigQuery Storage client, the dependency "
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.23rows/s]
Traceback (most recent call last):
  File "/home/inglesp/.pyenv/versions/3.5.9/lib/python3.5/code.py", line 91, in runcode
    exec(code, self.locals)
  File "<console>", line 1, in <module>
  File "/home/inglesp/.pyenv/versions/openp/lib/python3.5/site-packages/pandas/io/gbq.py", line 176, in read_gbq
    **kwargs
  File "/home/inglesp/.pyenv/versions/openp/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 967, in read_gbq
    progress_bar_type=progress_bar_type,
  File "/home/inglesp/.pyenv/versions/openp/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 532, in run_query
    progress_bar_type=progress_bar_type,
  File "/home/inglesp/.pyenv/versions/openp/lib/python3.5/site-packages/pandas_gbq/gbq.py", line 562, in _download_results
    progress_bar_type=progress_bar_type,
  File "/home/inglesp/.pyenv/versions/openp/lib/python3.5/site-packages/google/cloud/bigquery/table.py", line 1727, in to_dataframe
    create_bqstorage_client=create_bqstorage_client,
  File "/home/inglesp/.pyenv/versions/openp/lib/python3.5/site-packages/google/cloud/bigquery/table.py", line 1561, in to_arrow
    bqstorage_client.transport.channel.close()
AttributeError: 'NoneType' object has no attribute 'transport'

If pyarrow is not installed, there is no exception. The same code works with 1.25.0, so I'm raising the issue against this library and not pydata/pandas-gbq/ or apache/arrow.

Here are details of the various versions used to reproduce this.

$ python --version
Python 3.8.2
$ cat requirements.in 
google-cloud-bigquery
pandas-gbq
pyarrow
$ pip freeze
cachetools==4.1.1
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
google-api-core==1.22.0
google-auth==1.19.2
google-auth-oauthlib==0.4.1
google-cloud-bigquery==1.26.0
google-cloud-core==1.3.0
google-resumable-media==0.5.1
googleapis-common-protos==1.52.0
idna==2.10
numpy==1.19.1
oauthlib==3.1.0
pandas==1.0.5
pandas-gbq==0.13.2
pip-tools==5.2.1
protobuf==3.12.2
pyarrow==0.17.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydata-google-auth==1.1.0
python-dateutil==2.8.1
pytz==2020.1
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
six==1.15.0
urllib3==1.25.9
@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery API. label Jul 22, 2020
@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Jul 22, 2020
@plamut
Copy link
Contributor

plamut commented Jul 22, 2020

I think I see what the issue is. The to_arrow() method tries to create the BQ storage client, which fails due to the missing optional dependency. However, the owns_bqstorage_client flag is still set to True even though the BQ Storage client could not be constructed, and later on the method tries to close the transport on it, which fails, because client is None.

If feasible, one possible quick workaround would be to install the google-cloud-bigquery-storage dependency (that client is faster anyway).

@plamut plamut self-assigned this Jul 22, 2020
@inglesp
Copy link
Author

inglesp commented Jul 22, 2020

If feasible, one possible quick workaround would be to install the google-cloud-bigquery-storage dependency (that client is faster anyway).

Are there any downsides to using google-cloud-bigquery-storage?

@plamut
Copy link
Contributor

plamut commented Jul 22, 2020

@inglesp On a technical level probably not, apart from a few extra dependencies. It's also much faster, especially for large datasets. On the downside, however, the BQ Storage API is billable (check the link in the Pricing section at the end), which can affect any business aspects of a project.

@shollyman Are there perhaps any other factors that should be taken into account?

alangenfeld added a commit to dagster-io/dagster that referenced this issue Jul 22, 2020
Summary: googleapis/python-bigquery#177

Test Plan: bk

Reviewers: nate, dgibson, max, sashank, schrockn

Reviewed By: schrockn

Differential Revision: https://dagster.phacility.com/D3968
@bartaelterman
Copy link

The release notes of 1.26.0 say:

use BigQuery Storage client by default

Shouldn't it be added to the dependencies in setup.py then?

@plamut
Copy link
Contributor

plamut commented Jul 22, 2020

@bartaelterman It is added, but it's an optional dependency (the BQ Storage client has reached the stable version only recently). The release notes should probably have emphasized "if BQ Storage client is available", though.

@shollyman
Copy link
Contributor

@shollyman Are there perhaps any other factors that should be taken into account?

Differences in billing are really the major consideration if this integration is the major use case. The storage API has additional features (projection/filtering/snapshot control), but they're for use cases where you desrie custom processing of the manged storage and you don't want the BigQuery query engine to do any of the work.

API enablement is mirrored/shared, so no differences there.

@plamut
Copy link
Contributor

plamut commented Jul 23, 2020

FWIW, if somebody wants the fix before the next release, the following can be used to monkeypatch the installed client, file google/cloud/bigquery/table.py:

@@ -1534,8 +1534,8 @@ class RowIterator(HTTPIterator):
 
         owns_bqstorage_client = False
         if not bqstorage_client and create_bqstorage_client:
-            owns_bqstorage_client = True
             bqstorage_client = self.client._create_bqstorage_client()
+            owns_bqstorage_client = bqstorage_client is not None
 
         try:
             progress_bar = self._get_progress_bar(progress_bar_type)

inglesp added a commit to bennettoxford/openprescribing that referenced this issue Jul 27, 2020
This is faster than using the REST API or Avro (via PyArrow), and should
cost pennies a month.  See googleapis/python-bigquery#177.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants