-
Notifications
You must be signed in to change notification settings - Fork 125
BUG: Add support to replace partitions in date-partitioned tables #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. add an entry to the changelog.
pandas_gbq/tests/test_gbq.py
Outdated
@@ -1093,6 +1096,79 @@ def test_upload_data_if_table_exists_replace(self): | |||
project_id=_get_project_id(), | |||
private_key=_get_private_key_path()) | |||
assert result['num_rows'][0] == 5 | |||
|
|||
def test_upload_data_if_table_exists_replace_dpt_partition(self): | |||
test_dpt_suffix = "20170101" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number here as a comment
pandas_gbq/tests/test_gbq.py
Outdated
_get_project_id(), if_exists='replace', | ||
private_key=_get_private_key_path()) | ||
|
||
sleep(30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to remove these things :> (maybe with a poll decorator?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, would need to refactor some other tests also to get rid of sleep
calls in between.
cc @parthea |
pandas_gbq/tests/test_gbq.py
Outdated
assert result1['num_rows'][0] == 15 | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the extra blank lines here.
pandas_gbq/tests/test_gbq.py
Outdated
@@ -1117,7 +1193,7 @@ def test_google_upload_errors_should_raise_exception(self): | |||
with tm.assertRaises(gbq.StreamingInsertError): | |||
gbq.to_gbq(bad_df, self.destination_table + test_id, | |||
_get_project_id(), private_key=_get_private_key_path()) | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to avoid making spacing modifications in future PRs. It is ok for this PR.
Codecov Report
@@ Coverage Diff @@
## master #47 +/- ##
===========================================
- Coverage 73.61% 27.58% -46.04%
===========================================
Files 4 4
Lines 1554 1606 +52
===========================================
- Hits 1144 443 -701
- Misses 410 1163 +753
Continue to review full report at Codecov.
|
d40d63c
to
82f9b1f
Compare
can you rebase |
@jreback done. |
docs/source/changelog.rst
Outdated
@@ -8,6 +8,7 @@ Changelog | |||
- The dataframe passed to ```.to_gbq(...., if_exists='append')``` needs to contain only a subset of the fields in the BigQuery schema. (:issue:`24`) | |||
- Use the `google-auth <https://google-auth.readthedocs.io/en/latest/>`__ library for authentication because oauth2client is deprecated. (:issue:`39`) | |||
- ``read_gbq`` now has a ``auth_local_webserver`` boolean argument for controlling whether to use web server or console flow when getting user credentials. Replaces `--noauth_local_webserver` command line argument (:issue:`35`) | |||
- Add support to replace partitions in date-partitioned tables (:issue:`47`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you be slightly more verbose and include the link: date-partitioned tables. any addition to the main docs about this? (or enhancements to the doc-string)?
can you fixup the linting failures? |
@jreback done, coverage still lagging behind the target unfortunately... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc comment. otherwise lgtm. ping on green.
docs/source/changelog.rst
Outdated
@@ -8,6 +8,7 @@ Changelog | |||
- The dataframe passed to ```.to_gbq(...., if_exists='append')``` needs to contain only a subset of the fields in the BigQuery schema. (:issue:`24`) | |||
- Use the `google-auth <https://google-auth.readthedocs.io/en/latest/>`__ library for authentication because oauth2client is deprecated. (:issue:`39`) | |||
- ``read_gbq`` now has a ``auth_local_webserver`` boolean argument for controlling whether to use web server or console flow when getting user credentials. Replaces `--noauth_local_webserver` command line argument (:issue:`35`) | |||
- Add support to replace partitions in `date-partitioned tables <https://cloud.google.com/bigquery/docs/partitioned-tables>`__. Partition must be specified with a partition decorator separator (``$``). (:issue:`47`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add something in the docs about this (maybe the link above is enough) or a simple code-example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have a couple minor observations. There are 3 unit tests failing on Travis for this PR.
============== 3 failed, 81 passed, 12 skipped in 497.45 seconds ===============
Please see the test results here:
https://travis-ci.org/parthea/pandas-gbq/builds/250715765
Please see the following link for steps on running the integration tests.
https://pandas-gbq.readthedocs.io/en/latest/contributing.html#running-google-bigquery-integration-tests
pandas_gbq/tests/test_gbq.py
Outdated
.format(dpt_partition), | ||
project_id=_get_project_id(), | ||
private_key=_get_private_key_path()) | ||
assert result0['num_rows'][0] == 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that table dpt_partition
doesn't exist if I run this unit test on its own. My personal preference is that unit tests are not dependent on prior tests.
pandas_gbq/tests/test_gbq.py
Outdated
dpt_partition = self.destination_dpt + '$' + test_dpt_suffix | ||
|
||
gbq.to_gbq(df, dpt_partition, _get_project_id(), | ||
chunksize=10000, private_key=_get_private_key_path()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I receive the follow error when running this unit test locally. Can we add code to create the required table first?
tony@tonypc:~/pydata-pandas-gbq/pandas_gbq/tests$ pytest test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace_dpt_partition -v -r s . --maxfail=1
============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-3.0.6, py-1.4.32, pluggy-0.4.0 -- /home/tony/miniconda2/bin/python
cachedir: ../../.cache
rootdir: /home/tony/pydata-pandas-gbq, inifile:
plugins: cov-2.4.0
collected 130 items
test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace_dpt_partition FAILED
=================================== FAILURES ===================================
TestToGBQIntegrationWithServiceAccountKeyPath.test_upload_data_if_table_exists_replace_dpt_partition
self = <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7fcf16807050>
def test_upload_data_if_table_exists_replace_dpt_partition(self):
# Issue #47; tests that 'replace' is done by the subsequent call
test_dpt_suffix = "20170101"
test_size = 10
df = make_mixed_dataframe_v2(test_size)
df_different_schema = tm.makeMixedDataFrame()
dpt_partition = self.destination_dpt + '$' + test_dpt_suffix
gbq.to_gbq(df, dpt_partition, _get_project_id(),
> chunksize=10000, private_key=_get_private_key_path())
test_gbq.py:1072:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dataframe = bools flts ints strs times
0 False 1.135...:55.158551-07:00
9 True -0.716134 2 8 2017-07-06 03:42:55.158551-07:00
destination_table = 'pandas_gbq_527291.dpt_test$20170101'
project_id = 'pandas-140401', chunksize = 10000, verbose = True, reauth = False
if_exists = 'fail', private_key = '/home/tony/Desktop/pandas.json'
auth_local_webserver = False
def to_gbq(dataframe, destination_table, project_id, chunksize=10000,
verbose=True, reauth=False, if_exists='fail', private_key=None,
auth_local_webserver=False):
"""Write a DataFrame to a Google BigQuery table.
The main method a user calls to export pandas DataFrame contents to
Google BigQuery table.
Google BigQuery API Client Library v2 for Python is used.
Documentation is available `here
<https://developers.google.com/api-client-library/python/apis/bigquery/v2>`__
Authentication to the Google BigQuery service is via OAuth 2.0.
- If "private_key" is not provided:
By default "application default credentials" are used.
If default application credentials are not found or are restrictive,
user account credentials are used. In this case, you will be asked to
grant permissions for product name 'pandas GBQ'.
- If "private_key" is provided:
Service account credentials will be used to authenticate.
Parameters
----------
dataframe : DataFrame
DataFrame to be written
destination_table : string
Name of table to be written, in the form 'dataset.tablename'
project_id : str
Google BigQuery Account project ID.
chunksize : int (default 10000)
Number of rows to be inserted in each chunk from the dataframe.
verbose : boolean (default True)
Show percentage complete
reauth : boolean (default False)
Force Google BigQuery to reauthenticate the user. This is useful
if multiple accounts are used.
if_exists : {'fail', 'replace', 'append'}, default 'fail'
'fail': If table exists, do nothing.
'replace': If table exists, drop it, recreate it, and insert data.
'append': If table exists and the dataframe schema is a subset of
the destination table schema, insert data. Create destination table
if does not exist.
private_key : str (optional)
Service account private key in JSON format. Can be file path
or string contents. This is useful for remote server
authentication (eg. jupyter iPython notebook on remote host)
auth_local_webserver : boolean, default False
Use the [local webserver flow] instead of the [console flow] when
getting user credentials.
.. [local webserver flow]
http://google-auth-oauthlib.readthedocs.io/en/latest/reference/google_auth_oauthlib.flow.html#google_auth_oauthlib.flow.InstalledAppFlow.run_local_server
.. [console flow]
http://google-auth-oauthlib.readthedocs.io/en/latest/reference/google_auth_oauthlib.flow.html#google_auth_oauthlib.flow.InstalledAppFlow.run_console
.. versionadded:: 0.2.0
"""
_test_google_api_imports()
if if_exists not in ('fail', 'replace', 'append'):
raise ValueError("'{0}' is not valid for if_exists".format(if_exists))
if '.' not in destination_table:
raise NotFoundException(
"Invalid Table Name. Should be of the form 'datasetId.tableId' ")
connector = GbqConnector(
project_id, reauth=reauth, verbose=verbose, private_key=private_key,
auth_local_webserver=auth_local_webserver)
dataset_id, table_id = destination_table.rsplit('.', 1)
table = _Table(project_id, dataset_id, reauth=reauth,
private_key=private_key)
table_schema = _generate_bq_schema(dataframe)
# If table exists, check if_exists parameter
if table.exists(table_id):
if if_exists == 'fail':
raise TableCreationError("Could not create the table because it "
"already exists. "
"Change the if_exists parameter to "
"append or replace data.")
else:
delay = 0
if not connector.verify_schema(dataset_id, table_id, table_schema):
if if_exists == 'append' \
or table.partition_decorator in table_id:
raise InvalidSchema("Please verify that the structure "
"and data types in the DataFrame "
"match the schema of the destination "
"table.")
elif if_exists == 'replace':
table._print('The existing table has a different schema. '
'Please wait 2 minutes. See Google BigQuery '
'issue #191')
delay = 120
if if_exists == 'replace':
table.delete(table_id)
if table.partition_decorator not in table_id:
table.create(table_id, table_schema)
sleep(delay)
else:
if table.partition_decorator in table_id:
> raise TableCreationError("Cannot create a partition without the "
"main table.")
E TableCreationError: Cannot create a partition without the main table.
../gbq.py:1057: TableCreationError
!!!!!!!!!!!!!!!!!!!! Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!
=========================== 1 failed in 5.56 seconds ===========================
pandas_gbq/gbq.py
Outdated
else: | ||
if table.partition_decorator in table_id: | ||
raise TableCreationError("Cannot create a partition without the " | ||
"main table.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we a unit test that checks that TableCreationError
is raised when the main table doesn't exist?
@@ -1032,17 +1032,30 @@ def to_gbq(dataframe, destination_table, project_id, chunksize=10000, | |||
"already exists. " | |||
"Change the if_exists parameter to " | |||
"append or replace data.") | |||
elif if_exists == 'replace': | |||
connector.delete_and_recreate_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function delete_and_recreate_table
is unused in pandas-gbq
code so the actual function can be removed as well
pandas_gbq/gbq.py
Outdated
"schema of the destination table.") | ||
else: | ||
delay = 0 | ||
if not connector.verify_schema(dataset_id, table_id, table_schema): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be if not connector.schema_is_subset(dataset_id, table_id, table_schema):
See https://travis-ci.org/parthea/pandas-gbq/jobs/250715768:
The following unit test is failing :
pandas_gbq/tests/test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_subset_columns_if_table_exists_append
@jreback sorry for the delay. I set up a dev environment on my machine, and I was able to successfully run the tests of the PR's feature. I've commited an updated branch but the build errors with |
If the behaviour of 'append', 'replace' and 'fail' is referred to the existence of a partition I believe that this patch is incorrect. |
Closes #43