BUG: fix read_gbq lost numeric precision

tworec · Piotr Chromiec · commit f32edcb0c21b · 2016-10-06T18:32:02.000Z
fixes:
- lost precision for longs above 2^53
- and floats above 10k
diff --git a/doc/source/io.rst b/doc/source/io.rst
@@ -38,7 +38,7 @@ object.
     * :ref:`read_json<io.json_reader>`
     * :ref:`read_msgpack<io.msgpack>` (experimental)
     * :ref:`read_html<io.read_html>`
-    * :ref:`read_gbq<io.bigquery_reader>` (experimental)
+    * :ref:`read_gbq<io.bigquery>` (experimental)
     * :ref:`read_stata<io.stata_reader>`
     * :ref:`read_sas<io.sas_reader>`
     * :ref:`read_clipboard<io.clipboard>`
@@ -53,7 +53,7 @@ The corresponding ``writer`` functions are object methods that are accessed like
     * :ref:`to_json<io.json_writer>`
     * :ref:`to_msgpack<io.msgpack>` (experimental)
     * :ref:`to_html<io.html>`
-    * :ref:`to_gbq<io.bigquery_writer>` (experimental)
+    * :ref:`to_gbq<io.bigquery>` (experimental)
     * :ref:`to_stata<io.stata_writer>`
     * :ref:`to_clipboard<io.clipboard>`
     * :ref:`to_pickle<io.pickle>`
@@ -4429,16 +4429,11 @@ DataFrame with a shape and data types derived from the source table.
 Additionally, DataFrames can be inserted into new BigQuery tables or appended
 to existing tables.
 
-You will need to install some additional dependencies:
-
-- Google's `python-gflags <https://github.com/google/python-gflags/>`__
-- `httplib2 <http://pypi.python.org/pypi/httplib2>`__
-- `google-api-python-client <http://github.com/google/google-api-python-client>`__
-
 .. warning::
 
    To use this module, you will need a valid BigQuery account. Refer to the
-   `BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__ for details on the service itself.
+   `BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__
+   for details on the service itself.
 
 The key functions are:
 
@@ -4452,7 +4447,43 @@ The key functions are:
 
 .. currentmodule:: pandas
 
-.. _io.bigquery_reader:
+
+Supported Data Types
+++++++++++++++++++++
+
+Pandas supports these all `BigQuery data types <https://cloud.google.com/bigquery/data-types>`__:
+``STRING``, ``INTEGER`` (64bit), ``FLOAT`` (64 bit), ``BOOLEAN`` and
+``TIMESTAMP`` (microsecond precision). Data types ``BYTES`` and ``RECORD``
+are not supported.
+
+Integer and boolean ``NA`` handling
++++++++++++++++++++++++++++++++++++
+
+.. versionadded:: 0.19
+
+Since all columns in BigQuery queries are nullable, and NumPy lacks of ``NA``
+support for integer and boolean types, this module will store ``INTEGER`` or
+``BOOLEAN`` columns with at least one ``NULL`` value as ``dtype=object``.
+Otherwise those columns will be stored as ``dtype=int64`` or ``dtype=bool``
+respectively.
+
+This is opposite to default pandas behaviour which will promote integer
+type to float in order to store NAs. See the :ref:`gotchas<gotchas.intna>`
+for detailed explaination.
+
+While this trade-off works well for most cases, it breaks down for storing
+values greater than 2**53. Such values in BigQuery can represent identifiers
+and unnoticed precision lost for identifier is what we want to avoid.
+
+Dependencies
+++++++++++++
+
+This module requires these additional dependencies:
+
+- `httplib2 <http://pypi.python.org/pypi/httplib2>`__
+- `google-api-python-client <http://github.com/google/google-api-python-client>`__
+- `oauth2client <https://github.com/google/oauth2client>`__.
+
 
 .. _io.bigquery_authentication:
 
@@ -4467,7 +4498,7 @@ Is possible to authenticate with either user account credentials or service acco
 Authenticating with user account credentials is as simple as following the prompts in a browser window
 which will be automatically opened for you. You will be authenticated to the specified
 ``BigQuery`` account using the product name ``pandas GBQ``. It is only possible on local host.
-The remote authentication using user account credentials is not currently supported in Pandas.
+The remote authentication using user account credentials is not currently supported in pandas.
 Additional information on the authentication mechanism can be found
 `here <https://developers.google.com/identity/protocols/OAuth2#clientside/>`__.
 
@@ -4476,8 +4507,6 @@ is particularly useful when working on remote servers (eg. jupyter iPython noteb
 Additional information on service accounts can be found
 `here <https://developers.google.com/identity/protocols/OAuth2#serviceaccount>`__.
 
-You will need to install an additional dependency: `oauth2client <https://github.com/google/oauth2client>`__.
-
 Authentication via ``application default credentials`` is also possible. This is only valid
 if the parameter ``private_key`` is not provided. This method also requires that
 the credentials can be fetched from the environment the code is running in.
@@ -4497,6 +4526,7 @@ Additional information on
    A private key can be obtained from the Google developers console by clicking
    `here <https://console.developers.google.com/permissions/serviceaccounts>`__. Use JSON key type.
 
+.. _io.bigquery_reader:
 
 Querying
 ''''''''
@@ -4540,7 +4570,6 @@ destination DataFrame as well as a preferred column order as follows:
 
 .. _io.bigquery_writer:
 
-
 Writing DataFrames
 ''''''''''''''''''
 
@@ -4630,6 +4659,8 @@ For example:
    often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
    sets of data quickly, but it is not a direct replacement for a transactional database.
 
+.. _io.bigquery_create_tables:
+
 Creating BigQuery Tables
 ''''''''''''''''''''''''
 
@@ -4659,6 +4690,7 @@ produce the dictionary representation schema of the specified pandas DataFrame.
    the new table with a different name. Refer to
    `Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.
 
+
 .. _io.stata:
 
 Stata Format
diff --git a/doc/source/whatsnew/v0.19.1.txt b/doc/source/whatsnew/v0.19.1.txt
@@ -0,0 +1,87 @@
+.. _whatsnew_0191:
+
+v0.19.1 (????, 2016)
+--------------------
+
+This is a minor bug-fix release from 0.19.0 and includes a large number of
+bug fixes along with several new features, enhancements, and performance improvements.
+We recommend that all users upgrade to this version.
+
+Highlights include:
+
+
+Check the :ref:`API Changes <whatsnew_0191.api_breaking>` and :ref:`deprecations <whatsnew_0191.deprecations>` before updating.
+
+.. contents:: What's new in v0.19.1
+    :local:
+    :backlinks: none
+
+.. _whatsnew_0191.enhancements:
+
+New features
+~~~~~~~~~~~~
+
+
+
+
+
+.. _whatsnew_0191.enhancements.other:
+
+Other enhancements
+^^^^^^^^^^^^^^^^^^
+
+
+
+
+.. _whatsnew_0191.api_breaking:
+
+Backwards incompatible API changes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _whatsnew_0191.api:
+
+
+
+
+
+
+Other API Changes
+^^^^^^^^^^^^^^^^^
+
+.. _whatsnew_0191.deprecations:
+
+Deprecations
+^^^^^^^^^^^^
+
+
+
+
+
+.. _whatsnew_0191.prior_deprecations:
+
+Removal of prior version deprecations/changes
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+
+
+
+
+.. _whatsnew_0191.performance:
+
+Performance Improvements
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+
+
+
+.. _whatsnew_0191.bug_fixes:
+
+Bug Fixes
+~~~~~~~~~
+
+- The :func:`pandas.io.gbq.to_gbq` method now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Othewise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no more casted to ``int64`` which also caused precision lost (:issue: `14064`).
+
+*OR*
+
+- The :func:`to_gbq` method now stores ``INTEGER`` columns as ``dtype=int64`` by default. If such column contains ``NULL`` values then it will be promoted to ``float`` or ``object`` if there are also values greater than ``2**53`` which prevents precision lost. Furthermore ``FLOAT`` columns with values above 10**4 are no more casted to ``int64`` which also caused precision lost (:issue: `14064`).
diff --git a/pandas/io/gbq.py b/pandas/io/gbq.py
@@ -586,18 +586,14 @@ def _parse_data(schema, rows):
     # see:
     # http://pandas.pydata.org/pandas-docs/dev/missing_data.html
     # #missing-data-casting-rules-and-indexing
-    dtype_map = {'INTEGER': np.dtype(float),
-                 'FLOAT': np.dtype(float),
-                 # This seems to be buggy without nanosecond indicator
+    dtype_map = {'FLOAT': np.dtype(float),
                  'TIMESTAMP': 'M8[ns]'}
 
     fields = schema['fields']
     col_types = [field['type'] for field in fields]
     col_names = [str(field['name']) for field in fields]
     col_dtypes = [dtype_map.get(field['type'], object) for field in fields]
-    page_array = np.zeros((len(rows),),
-                          dtype=lzip(col_names, col_dtypes))
-
+    page_array = np.zeros((len(rows),), dtype=lzip(col_names, col_dtypes))
     for row_num, raw_row in enumerate(rows):
         entries = raw_row.get('f', [])
         for col_num, field_type in enumerate(col_types):
@@ -611,7 +607,9 @@ def _parse_data(schema, rows):
 def _parse_entry(field_value, field_type):
     if field_value is None or field_value == 'null':
         return None
-    if field_type == 'INTEGER' or field_type == 'FLOAT':
+    if field_type == 'INTEGER':
+        return int(field_value)
+    elif field_type == 'FLOAT':
         return float(field_value)
     elif field_type == 'TIMESTAMP':
         timestamp = datetime.utcfromtimestamp(float(field_value))
@@ -728,10 +726,14 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,
                 'Column order does not match this DataFrame.'
             )
 
-    # Downcast floats to integers and objects to booleans
-    # if there are no NaN's. This is presently due to a
-    # limitation of numpy in handling missing data.
-    final_df._data = final_df._data.downcast(dtypes='infer')
+    # cast BOOLEAN and INTEGER columns from object to bool/int
+    # if they dont have any nulls
+    type_map = {'BOOLEAN': bool, 'INTEGER': int}
+    for field in schema['fields']:
+        if field['type'] in type_map and \
+                final_df[field['name']].notnull().all():
+            final_df[field['name']] = \
+                final_df[field['name']].astype(type_map[field['type']])
 
     connector.print_elapsed_seconds(
         'Total time taken',
diff --git a/pandas/io/tests/test_gbq.py b/pandas/io/tests/test_gbq.py