Skip to content

BUG: Add support to replace partitions in date-partitioned tables #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Jun 3, 2017

Closes #43

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. add an entry to the changelog.

@@ -1093,6 +1096,79 @@ def test_upload_data_if_table_exists_replace(self):
project_id=_get_project_id(),
private_key=_get_private_key_path())
assert result['num_rows'][0] == 5

def test_upload_data_if_table_exists_replace_dpt_partition(self):
test_dpt_suffix = "20170101"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number here as a comment

_get_project_id(), if_exists='replace',
private_key=_get_private_key_path())

sleep(30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to remove these things :> (maybe with a poll decorator?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, would need to refactor some other tests also to get rid of sleep calls in between.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2017

cc @parthea

@jreback jreback added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label Jun 4, 2017
@jreback jreback added this to the 0.2.0 milestone Jun 4, 2017
assert result1['num_rows'][0] == 15



Copy link
Contributor

@parthea parthea Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the extra blank lines here.

@@ -1117,7 +1193,7 @@ def test_google_upload_errors_should_raise_exception(self):
with tm.assertRaises(gbq.StreamingInsertError):
gbq.to_gbq(bad_df, self.destination_table + test_id,
_get_project_id(), private_key=_get_private_key_path())

Copy link
Contributor

@parthea parthea Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to avoid making spacing modifications in future PRs. It is ok for this PR.

@ghost ghost force-pushed the add-dpt-support branch from 6c6f715 to 1227127 Compare June 11, 2017 13:29
@codecov-io
Copy link

codecov-io commented Jun 11, 2017

Codecov Report

Merging #47 into master will decrease coverage by 46.03%.
The diff coverage is 11.76%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #47       +/-   ##
===========================================
- Coverage   73.61%   27.58%   -46.04%     
===========================================
  Files           4        4               
  Lines        1554     1606       +52     
===========================================
- Hits         1144      443      -701     
- Misses        410     1163      +753
Impacted Files Coverage Δ
pandas_gbq/tests/test_gbq.py 26.75% <11.76%> (-56.25%) ⬇️
pandas_gbq/gbq.py 19.33% <11.76%> (-56.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 79c9067...96c2777. Read the comment docs.

@ghost ghost force-pushed the add-dpt-support branch 3 times, most recently from d40d63c to 82f9b1f Compare June 11, 2017 14:49
@parthea parthea changed the title Add support to replace partitions in date-partitioned tables BUG: Add support to replace partitions in date-partitioned tables Jun 13, 2017
@jreback
Copy link
Contributor

jreback commented Jul 1, 2017

can you rebase

@ghost ghost force-pushed the add-dpt-support branch from 82f9b1f to b8b631e Compare July 3, 2017 16:47
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Jul 3, 2017
@ghost
Copy link
Author

ghost commented Jul 3, 2017

@jreback done. pydata/master rebased on top of branch of this PR

@ghost ghost force-pushed the add-dpt-support branch from b8b631e to 5a337ff Compare July 3, 2017 18:46
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Jul 3, 2017
@@ -8,6 +8,7 @@ Changelog
- The dataframe passed to ```.to_gbq(...., if_exists='append')``` needs to contain only a subset of the fields in the BigQuery schema. (:issue:`24`)
- Use the `google-auth <https://google-auth.readthedocs.io/en/latest/>`__ library for authentication because oauth2client is deprecated. (:issue:`39`)
- ``read_gbq`` now has a ``auth_local_webserver`` boolean argument for controlling whether to use web server or console flow when getting user credentials. Replaces `--noauth_local_webserver` command line argument (:issue:`35`)
- Add support to replace partitions in date-partitioned tables (:issue:`47`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be slightly more verbose and include the link: date-partitioned tables. any addition to the main docs about this? (or enhancements to the doc-string)?

@ghost ghost force-pushed the add-dpt-support branch from 5a337ff to cf59a52 Compare July 5, 2017 12:08
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Jul 5, 2017
@ghost
Copy link
Author

ghost commented Jul 5, 2017

@jreback it's green now. let's merge this before #46 as conflicts are solved.

@jreback
Copy link
Contributor

jreback commented Jul 5, 2017

can you fixup the linting failures?

@ghost ghost force-pushed the add-dpt-support branch from cf59a52 to c7b22da Compare July 6, 2017 07:38
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Jul 6, 2017
@ghost
Copy link
Author

ghost commented Jul 6, 2017

@jreback done, coverage still lagging behind the target unfortunately...

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc comment. otherwise lgtm. ping on green.

@@ -8,6 +8,7 @@ Changelog
- The dataframe passed to ```.to_gbq(...., if_exists='append')``` needs to contain only a subset of the fields in the BigQuery schema. (:issue:`24`)
- Use the `google-auth <https://google-auth.readthedocs.io/en/latest/>`__ library for authentication because oauth2client is deprecated. (:issue:`39`)
- ``read_gbq`` now has a ``auth_local_webserver`` boolean argument for controlling whether to use web server or console flow when getting user credentials. Replaces `--noauth_local_webserver` command line argument (:issue:`35`)
- Add support to replace partitions in `date-partitioned tables <https://cloud.google.com/bigquery/docs/partitioned-tables>`__. Partition must be specified with a partition decorator separator (``$``). (:issue:`47`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add something in the docs about this (maybe the link above is enough) or a simple code-example.

parthea pushed a commit to parthea/pandas-gbq that referenced this pull request Jul 6, 2017
Copy link
Contributor

@parthea parthea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a couple minor observations. There are 3 unit tests failing on Travis for this PR.

============== 3 failed, 81 passed, 12 skipped in 497.45 seconds ===============

Please see the test results here:
https://travis-ci.org/parthea/pandas-gbq/builds/250715765

Please see the following link for steps on running the integration tests.
https://pandas-gbq.readthedocs.io/en/latest/contributing.html#running-google-bigquery-integration-tests

.format(dpt_partition),
project_id=_get_project_id(),
private_key=_get_private_key_path())
assert result0['num_rows'][0] == 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that table dpt_partition doesn't exist if I run this unit test on its own. My personal preference is that unit tests are not dependent on prior tests.

dpt_partition = self.destination_dpt + '$' + test_dpt_suffix

gbq.to_gbq(df, dpt_partition, _get_project_id(),
chunksize=10000, private_key=_get_private_key_path())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I receive the follow error when running this unit test locally. Can we add code to create the required table first?

tony@tonypc:~/pydata-pandas-gbq/pandas_gbq/tests$ pytest test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace_dpt_partition -v -r s . --maxfail=1
============================= test session starts ==============================
platform linux2 -- Python 2.7.12, pytest-3.0.6, py-1.4.32, pluggy-0.4.0 -- /home/tony/miniconda2/bin/python
cachedir: ../../.cache
rootdir: /home/tony/pydata-pandas-gbq, inifile: 
plugins: cov-2.4.0
collected 130 items 

test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_data_if_table_exists_replace_dpt_partition FAILED

=================================== FAILURES ===================================
 TestToGBQIntegrationWithServiceAccountKeyPath.test_upload_data_if_table_exists_replace_dpt_partition 

self = <pandas_gbq.tests.test_gbq.TestToGBQIntegrationWithServiceAccountKeyPath object at 0x7fcf16807050>

    def test_upload_data_if_table_exists_replace_dpt_partition(self):
        # Issue #47; tests that 'replace' is done by the subsequent call
        test_dpt_suffix = "20170101"
        test_size = 10
        df = make_mixed_dataframe_v2(test_size)
        df_different_schema = tm.makeMixedDataFrame()
    
        dpt_partition = self.destination_dpt + '$' + test_dpt_suffix
    
        gbq.to_gbq(df, dpt_partition, _get_project_id(),
>                  chunksize=10000, private_key=_get_private_key_path())

test_gbq.py:1072: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

dataframe =    bools      flts  ints strs                            times
0  False  1.135...:55.158551-07:00
9   True -0.716134     2    8 2017-07-06 03:42:55.158551-07:00
destination_table = 'pandas_gbq_527291.dpt_test$20170101'
project_id = 'pandas-140401', chunksize = 10000, verbose = True, reauth = False
if_exists = 'fail', private_key = '/home/tony/Desktop/pandas.json'
auth_local_webserver = False

    def to_gbq(dataframe, destination_table, project_id, chunksize=10000,
               verbose=True, reauth=False, if_exists='fail', private_key=None,
               auth_local_webserver=False):
        """Write a DataFrame to a Google BigQuery table.
    
        The main method a user calls to export pandas DataFrame contents to
        Google BigQuery table.
    
        Google BigQuery API Client Library v2 for Python is used.
        Documentation is available `here
        <https://developers.google.com/api-client-library/python/apis/bigquery/v2>`__
    
        Authentication to the Google BigQuery service is via OAuth 2.0.
    
        - If "private_key" is not provided:
    
          By default "application default credentials" are used.
    
          If default application credentials are not found or are restrictive,
          user account credentials are used. In this case, you will be asked to
          grant permissions for product name 'pandas GBQ'.
    
        - If "private_key" is provided:
    
          Service account credentials will be used to authenticate.
    
        Parameters
        ----------
        dataframe : DataFrame
            DataFrame to be written
        destination_table : string
            Name of table to be written, in the form 'dataset.tablename'
        project_id : str
            Google BigQuery Account project ID.
        chunksize : int (default 10000)
            Number of rows to be inserted in each chunk from the dataframe.
        verbose : boolean (default True)
            Show percentage complete
        reauth : boolean (default False)
            Force Google BigQuery to reauthenticate the user. This is useful
            if multiple accounts are used.
        if_exists : {'fail', 'replace', 'append'}, default 'fail'
            'fail': If table exists, do nothing.
            'replace': If table exists, drop it, recreate it, and insert data.
            'append': If table exists and the dataframe schema is a subset of
            the destination table schema, insert data. Create destination table
            if does not exist.
        private_key : str (optional)
            Service account private key in JSON format. Can be file path
            or string contents. This is useful for remote server
            authentication (eg. jupyter iPython notebook on remote host)
        auth_local_webserver : boolean, default False
            Use the [local webserver flow] instead of the [console flow] when
            getting user credentials.
    
            .. [local webserver flow]
                http://google-auth-oauthlib.readthedocs.io/en/latest/reference/google_auth_oauthlib.flow.html#google_auth_oauthlib.flow.InstalledAppFlow.run_local_server
            .. [console flow]
                http://google-auth-oauthlib.readthedocs.io/en/latest/reference/google_auth_oauthlib.flow.html#google_auth_oauthlib.flow.InstalledAppFlow.run_console
            .. versionadded:: 0.2.0
        """
    
        _test_google_api_imports()
    
        if if_exists not in ('fail', 'replace', 'append'):
            raise ValueError("'{0}' is not valid for if_exists".format(if_exists))
    
        if '.' not in destination_table:
            raise NotFoundException(
                "Invalid Table Name. Should be of the form 'datasetId.tableId' ")
    
        connector = GbqConnector(
            project_id, reauth=reauth, verbose=verbose, private_key=private_key,
            auth_local_webserver=auth_local_webserver)
        dataset_id, table_id = destination_table.rsplit('.', 1)
    
        table = _Table(project_id, dataset_id, reauth=reauth,
                       private_key=private_key)
    
        table_schema = _generate_bq_schema(dataframe)
    
        # If table exists, check if_exists parameter
        if table.exists(table_id):
            if if_exists == 'fail':
                raise TableCreationError("Could not create the table because it "
                                         "already exists. "
                                         "Change the if_exists parameter to "
                                         "append or replace data.")
            else:
                delay = 0
                if not connector.verify_schema(dataset_id, table_id, table_schema):
                    if if_exists == 'append' \
                            or table.partition_decorator in table_id:
                        raise InvalidSchema("Please verify that the structure "
                                            "and data types in the DataFrame "
                                            "match the schema of the destination "
                                            "table.")
                    elif if_exists == 'replace':
                        table._print('The existing table has a different schema. '
                                     'Please wait 2 minutes. See Google BigQuery '
                                     'issue #191')
                        delay = 120
                if if_exists == 'replace':
                    table.delete(table_id)
                    if table.partition_decorator not in table_id:
                        table.create(table_id, table_schema)
                        sleep(delay)
    
        else:
            if table.partition_decorator in table_id:
>               raise TableCreationError("Cannot create a partition without the "
                                         "main table.")
E               TableCreationError: Cannot create a partition without the main table.

../gbq.py:1057: TableCreationError
!!!!!!!!!!!!!!!!!!!! Interrupted: stopping after 1 failures !!!!!!!!!!!!!!!!!!!!
=========================== 1 failed in 5.56 seconds ===========================

else:
if table.partition_decorator in table_id:
raise TableCreationError("Cannot create a partition without the "
"main table.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we a unit test that checks that TableCreationError is raised when the main table doesn't exist?

@@ -1032,17 +1032,30 @@ def to_gbq(dataframe, destination_table, project_id, chunksize=10000,
"already exists. "
"Change the if_exists parameter to "
"append or replace data.")
elif if_exists == 'replace':
connector.delete_and_recreate_table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function delete_and_recreate_table is unused in pandas-gbq code so the actual function can be removed as well

"schema of the destination table.")
else:
delay = 0
if not connector.verify_schema(dataset_id, table_id, table_schema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be if not connector.schema_is_subset(dataset_id, table_id, table_schema):

See https://travis-ci.org/parthea/pandas-gbq/jobs/250715768:
The following unit test is failing :

pandas_gbq/tests/test_gbq.py::TestToGBQIntegrationWithServiceAccountKeyPath::test_upload_subset_columns_if_table_exists_append

@jreback jreback modified the milestones: 0.3.0, 0.2.0 Jul 22, 2017
@ghost ghost force-pushed the add-dpt-support branch from 7cec25f to ac4f9fa Compare August 6, 2017 14:07
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Aug 6, 2017
@ghost ghost force-pushed the add-dpt-support branch from ac4f9fa to 4c41991 Compare August 6, 2017 14:32
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Aug 6, 2017
@ghost ghost force-pushed the add-dpt-support branch from 4c41991 to ccf91ab Compare August 6, 2017 14:38
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Aug 6, 2017
@ghost ghost force-pushed the add-dpt-support branch from ccf91ab to e5b999e Compare August 6, 2017 15:05
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Aug 6, 2017
@ghost ghost force-pushed the add-dpt-support branch from e5b999e to 58486e5 Compare August 6, 2017 15:09
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Aug 6, 2017
@ghost ghost force-pushed the add-dpt-support branch from 58486e5 to 386044c Compare August 6, 2017 17:13
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Aug 6, 2017
@ghost ghost force-pushed the add-dpt-support branch from 386044c to 9111364 Compare August 6, 2017 17:20
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Aug 6, 2017
@ghost ghost force-pushed the add-dpt-support branch from 9111364 to 62f1501 Compare August 6, 2017 17:25
ghost pushed a commit to mremes/pandas-gbq that referenced this pull request Aug 6, 2017
@ghost
Copy link
Author

ghost commented Aug 7, 2017

@jreback sorry for the delay. I set up a dev environment on my machine, and I was able to successfully run the tests of the PR's feature. I've commited an updated branch but the build errors with Invalid JWT Signature. message. Is this something related to the service key used by CI?

@ghost ghost force-pushed the add-dpt-support branch from 62f1501 to 96c2777 Compare August 7, 2017 06:40
@paoloburelli
Copy link

If the behaviour of 'append', 'replace' and 'fail' is referred to the existence of a partition I believe that this patch is incorrect.
The existence test is performed on the table rather than the partition, so writing to an empty partitioned table fails unless 'replace' or 'append' is specified.
Furthermore, the replace test is incorrect as it does not replace the content of a specific partition.
If such a test was correctly performed it would fail as, due to a delay in the streaming buffer, deleting partition and streaming into it results in the data appearing after several minutes and easily up to an hour later.
I addressed all these issues in #85 with correct testing for all the writing policies.

@jreback jreback modified the milestones: 0.3.0, 0.4.0 Nov 25, 2017
mremes pushed a commit to mremes/pandas-gbq that referenced this pull request Feb 19, 2018
@mremes
Copy link
Contributor

mremes commented Feb 19, 2018

@jreback @parthea This can be closed, #124 is for this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants