Skip to content

Commit f32edcb

Browse files
author
tworec
committed
BUG: fix read_gbq lost numeric precision
fixes: - lost precision for longs above 2^53 - and floats above 10k
1 parent ee374ee commit f32edcb

File tree

4 files changed

+276
-53
lines changed

4 files changed

+276
-53
lines changed

doc/source/io.rst

+46-14
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ object.
3838
* :ref:`read_json<io.json_reader>`
3939
* :ref:`read_msgpack<io.msgpack>` (experimental)
4040
* :ref:`read_html<io.read_html>`
41-
* :ref:`read_gbq<io.bigquery_reader>` (experimental)
41+
* :ref:`read_gbq<io.bigquery>` (experimental)
4242
* :ref:`read_stata<io.stata_reader>`
4343
* :ref:`read_sas<io.sas_reader>`
4444
* :ref:`read_clipboard<io.clipboard>`
@@ -53,7 +53,7 @@ The corresponding ``writer`` functions are object methods that are accessed like
5353
* :ref:`to_json<io.json_writer>`
5454
* :ref:`to_msgpack<io.msgpack>` (experimental)
5555
* :ref:`to_html<io.html>`
56-
* :ref:`to_gbq<io.bigquery_writer>` (experimental)
56+
* :ref:`to_gbq<io.bigquery>` (experimental)
5757
* :ref:`to_stata<io.stata_writer>`
5858
* :ref:`to_clipboard<io.clipboard>`
5959
* :ref:`to_pickle<io.pickle>`
@@ -4429,16 +4429,11 @@ DataFrame with a shape and data types derived from the source table.
44294429
Additionally, DataFrames can be inserted into new BigQuery tables or appended
44304430
to existing tables.
44314431

4432-
You will need to install some additional dependencies:
4433-
4434-
- Google's `python-gflags <https://github.com/google/python-gflags/>`__
4435-
- `httplib2 <http://pypi.python.org/pypi/httplib2>`__
4436-
- `google-api-python-client <http://github.com/google/google-api-python-client>`__
4437-
44384432
.. warning::
44394433

44404434
To use this module, you will need a valid BigQuery account. Refer to the
4441-
`BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__ for details on the service itself.
4435+
`BigQuery Documentation <https://cloud.google.com/bigquery/what-is-bigquery>`__
4436+
for details on the service itself.
44424437

44434438
The key functions are:
44444439

@@ -4452,7 +4447,43 @@ The key functions are:
44524447

44534448
.. currentmodule:: pandas
44544449

4455-
.. _io.bigquery_reader:
4450+
4451+
Supported Data Types
4452+
++++++++++++++++++++
4453+
4454+
Pandas supports these all `BigQuery data types <https://cloud.google.com/bigquery/data-types>`__:
4455+
``STRING``, ``INTEGER`` (64bit), ``FLOAT`` (64 bit), ``BOOLEAN`` and
4456+
``TIMESTAMP`` (microsecond precision). Data types ``BYTES`` and ``RECORD``
4457+
are not supported.
4458+
4459+
Integer and boolean ``NA`` handling
4460+
+++++++++++++++++++++++++++++++++++
4461+
4462+
.. versionadded:: 0.19
4463+
4464+
Since all columns in BigQuery queries are nullable, and NumPy lacks of ``NA``
4465+
support for integer and boolean types, this module will store ``INTEGER`` or
4466+
``BOOLEAN`` columns with at least one ``NULL`` value as ``dtype=object``.
4467+
Otherwise those columns will be stored as ``dtype=int64`` or ``dtype=bool``
4468+
respectively.
4469+
4470+
This is opposite to default pandas behaviour which will promote integer
4471+
type to float in order to store NAs. See the :ref:`gotchas<gotchas.intna>`
4472+
for detailed explaination.
4473+
4474+
While this trade-off works well for most cases, it breaks down for storing
4475+
values greater than 2**53. Such values in BigQuery can represent identifiers
4476+
and unnoticed precision lost for identifier is what we want to avoid.
4477+
4478+
Dependencies
4479+
++++++++++++
4480+
4481+
This module requires these additional dependencies:
4482+
4483+
- `httplib2 <http://pypi.python.org/pypi/httplib2>`__
4484+
- `google-api-python-client <http://github.com/google/google-api-python-client>`__
4485+
- `oauth2client <https://github.com/google/oauth2client>`__.
4486+
44564487

44574488
.. _io.bigquery_authentication:
44584489

@@ -4467,7 +4498,7 @@ Is possible to authenticate with either user account credentials or service acco
44674498
Authenticating with user account credentials is as simple as following the prompts in a browser window
44684499
which will be automatically opened for you. You will be authenticated to the specified
44694500
``BigQuery`` account using the product name ``pandas GBQ``. It is only possible on local host.
4470-
The remote authentication using user account credentials is not currently supported in Pandas.
4501+
The remote authentication using user account credentials is not currently supported in pandas.
44714502
Additional information on the authentication mechanism can be found
44724503
`here <https://developers.google.com/identity/protocols/OAuth2#clientside/>`__.
44734504

@@ -4476,8 +4507,6 @@ is particularly useful when working on remote servers (eg. jupyter iPython noteb
44764507
Additional information on service accounts can be found
44774508
`here <https://developers.google.com/identity/protocols/OAuth2#serviceaccount>`__.
44784509

4479-
You will need to install an additional dependency: `oauth2client <https://github.com/google/oauth2client>`__.
4480-
44814510
Authentication via ``application default credentials`` is also possible. This is only valid
44824511
if the parameter ``private_key`` is not provided. This method also requires that
44834512
the credentials can be fetched from the environment the code is running in.
@@ -4497,6 +4526,7 @@ Additional information on
44974526
A private key can be obtained from the Google developers console by clicking
44984527
`here <https://console.developers.google.com/permissions/serviceaccounts>`__. Use JSON key type.
44994528

4529+
.. _io.bigquery_reader:
45004530

45014531
Querying
45024532
''''''''
@@ -4540,7 +4570,6 @@ destination DataFrame as well as a preferred column order as follows:
45404570

45414571
.. _io.bigquery_writer:
45424572

4543-
45444573
Writing DataFrames
45454574
''''''''''''''''''
45464575

@@ -4630,6 +4659,8 @@ For example:
46304659
often as the service seems to be changing and evolving. BiqQuery is best for analyzing large
46314660
sets of data quickly, but it is not a direct replacement for a transactional database.
46324661

4662+
.. _io.bigquery_create_tables:
4663+
46334664
Creating BigQuery Tables
46344665
''''''''''''''''''''''''
46354666

@@ -4659,6 +4690,7 @@ produce the dictionary representation schema of the specified pandas DataFrame.
46594690
the new table with a different name. Refer to
46604691
`Google BigQuery issue 191 <https://code.google.com/p/google-bigquery/issues/detail?id=191>`__.
46614692

4693+
46624694
.. _io.stata:
46634695

46644696
Stata Format

doc/source/whatsnew/v0.19.1.txt

+87
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
.. _whatsnew_0191:
2+
3+
v0.19.1 (????, 2016)
4+
--------------------
5+
6+
This is a minor bug-fix release from 0.19.0 and includes a large number of
7+
bug fixes along with several new features, enhancements, and performance improvements.
8+
We recommend that all users upgrade to this version.
9+
10+
Highlights include:
11+
12+
13+
Check the :ref:`API Changes <whatsnew_0191.api_breaking>` and :ref:`deprecations <whatsnew_0191.deprecations>` before updating.
14+
15+
.. contents:: What's new in v0.19.1
16+
:local:
17+
:backlinks: none
18+
19+
.. _whatsnew_0191.enhancements:
20+
21+
New features
22+
~~~~~~~~~~~~
23+
24+
25+
26+
27+
28+
.. _whatsnew_0191.enhancements.other:
29+
30+
Other enhancements
31+
^^^^^^^^^^^^^^^^^^
32+
33+
34+
35+
36+
.. _whatsnew_0191.api_breaking:
37+
38+
Backwards incompatible API changes
39+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
40+
41+
.. _whatsnew_0191.api:
42+
43+
44+
45+
46+
47+
48+
Other API Changes
49+
^^^^^^^^^^^^^^^^^
50+
51+
.. _whatsnew_0191.deprecations:
52+
53+
Deprecations
54+
^^^^^^^^^^^^
55+
56+
57+
58+
59+
60+
.. _whatsnew_0191.prior_deprecations:
61+
62+
Removal of prior version deprecations/changes
63+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64+
65+
66+
67+
68+
69+
.. _whatsnew_0191.performance:
70+
71+
Performance Improvements
72+
~~~~~~~~~~~~~~~~~~~~~~~~
73+
74+
75+
76+
77+
78+
.. _whatsnew_0191.bug_fixes:
79+
80+
Bug Fixes
81+
~~~~~~~~~
82+
83+
- The :func:`pandas.io.gbq.to_gbq` method now stores ``INTEGER`` columns as ``dtype=object`` if they contain ``NULL`` values. Othewise they are stored as ``int64``. This prevents precision lost for integers greather than 2**53. Furthermore ``FLOAT`` columns with values above 10**4 are no more casted to ``int64`` which also caused precision lost (:issue: `14064`).
84+
85+
*OR*
86+
87+
- The :func:`to_gbq` method now stores ``INTEGER`` columns as ``dtype=int64`` by default. If such column contains ``NULL`` values then it will be promoted to ``float`` or ``object`` if there are also values greater than ``2**53`` which prevents precision lost. Furthermore ``FLOAT`` columns with values above 10**4 are no more casted to ``int64`` which also caused precision lost (:issue: `14064`).

pandas/io/gbq.py

+13-11
Original file line numberDiff line numberDiff line change
@@ -586,18 +586,14 @@ def _parse_data(schema, rows):
586586
# see:
587587
# http://pandas.pydata.org/pandas-docs/dev/missing_data.html
588588
# #missing-data-casting-rules-and-indexing
589-
dtype_map = {'INTEGER': np.dtype(float),
590-
'FLOAT': np.dtype(float),
591-
# This seems to be buggy without nanosecond indicator
589+
dtype_map = {'FLOAT': np.dtype(float),
592590
'TIMESTAMP': 'M8[ns]'}
593591

594592
fields = schema['fields']
595593
col_types = [field['type'] for field in fields]
596594
col_names = [str(field['name']) for field in fields]
597595
col_dtypes = [dtype_map.get(field['type'], object) for field in fields]
598-
page_array = np.zeros((len(rows),),
599-
dtype=lzip(col_names, col_dtypes))
600-
596+
page_array = np.zeros((len(rows),), dtype=lzip(col_names, col_dtypes))
601597
for row_num, raw_row in enumerate(rows):
602598
entries = raw_row.get('f', [])
603599
for col_num, field_type in enumerate(col_types):
@@ -611,7 +607,9 @@ def _parse_data(schema, rows):
611607
def _parse_entry(field_value, field_type):
612608
if field_value is None or field_value == 'null':
613609
return None
614-
if field_type == 'INTEGER' or field_type == 'FLOAT':
610+
if field_type == 'INTEGER':
611+
return int(field_value)
612+
elif field_type == 'FLOAT':
615613
return float(field_value)
616614
elif field_type == 'TIMESTAMP':
617615
timestamp = datetime.utcfromtimestamp(float(field_value))
@@ -728,10 +726,14 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,
728726
'Column order does not match this DataFrame.'
729727
)
730728

731-
# Downcast floats to integers and objects to booleans
732-
# if there are no NaN's. This is presently due to a
733-
# limitation of numpy in handling missing data.
734-
final_df._data = final_df._data.downcast(dtypes='infer')
729+
# cast BOOLEAN and INTEGER columns from object to bool/int
730+
# if they dont have any nulls
731+
type_map = {'BOOLEAN': bool, 'INTEGER': int}
732+
for field in schema['fields']:
733+
if field['type'] in type_map and \
734+
final_df[field['name']].notnull().all():
735+
final_df[field['name']] = \
736+
final_df[field['name']].astype(type_map[field['type']])
735737

736738
connector.print_elapsed_seconds(
737739
'Total time taken',

0 commit comments

Comments
 (0)