Skip to content

Commit 779b6f0

Browse files
Merge branch 'master' into bug_delta_time
2 parents 9577408 + 5b88d2f commit 779b6f0

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

66 files changed

+779
-396
lines changed

Diff for: .github/PULL_REQUEST_TEMPLATE.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
- [ ] closes #xxxx
22
- [ ] tests added / passed
3-
- [ ] passes ``git diff upstream/master --name-only -- '*.py' | flake8 --diff``
3+
- [ ] passes ``git diff upstream/master --name-only -- '*.py' | flake8 --diff`` (On Windows, ``git diff upstream/master -u -- "*.py" | flake8 --diff`` might work as an alternative.)
44
- [ ] whatsnew entry

Diff for: asv_bench/benchmarks/hdfstore_bench.py

+9
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,15 @@ def time_query_store_table(self):
9090
stop = self.df2.index[15000]
9191
self.store.select('table', where="index > start and index < stop")
9292

93+
def time_store_repr(self):
94+
repr(self.store)
95+
96+
def time_store_str(self):
97+
str(self.store)
98+
99+
def time_store_info(self):
100+
self.store.info()
101+
93102

94103
class HDF5Panel(object):
95104
goal_time = 0.2

Diff for: ci/requirements-2.7.build

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@ python=2.7*
22
python-dateutil=2.4.1
33
pytz=2013b
44
nomkl
5-
numpy=1.12*
5+
numpy
66
cython=0.23

Diff for: ci/requirements-2.7.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27"
66

7-
conda install -n pandas -c conda-forge feather-format
7+
conda install -n pandas -c conda-forge feather-format jemalloc=4.4.0

Diff for: ci/requirements-2.7_BUILD_TEST.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 27 BUILD_TEST"
66

7-
conda install -n pandas -c conda-forge pyarrow dask
7+
conda install -n pandas -c conda-forge pyarrow dask jemalloc=4.4.0

Diff for: ci/requirements-3.5.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ source activate pandas
44

55
echo "install 35"
66

7-
conda install -n pandas -c conda-forge feather-format
7+
conda install -n pandas -c conda-forge feather-format jemalloc=4.4.0

Diff for: ci/requirements-3.6.build

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,5 @@ python=3.6*
22
python-dateutil
33
pytz
44
nomkl
5-
numpy=1.12*
5+
numpy
66
cython

Diff for: ci/requirements-3.6.run

+1
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ html5lib
1414
jinja2
1515
sqlalchemy
1616
pymysql
17+
jemalloc=4.4.0
1718
feather-format
1819
# psycopg2 (not avail on defaults ATM)
1920
beautifulsoup4

Diff for: ci/requirements-3.6_DOC.run

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
ipython
22
ipykernel
33
ipywidgets
4-
sphinx
4+
sphinx=1.5*
55
nbconvert
66
nbformat
77
notebook

Diff for: ci/requirements-3.6_DOC.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@ echo "[install DOC_BUILD deps]"
66

77
pip install pandas-gbq
88

9-
conda install -n pandas -c conda-forge feather-format nbsphinx pandoc
9+
conda install -n pandas -c conda-forge feather-format nbsphinx pandoc jemalloc=4.4.0
1010

1111
conda install -n pandas -c r r rpy2 --yes

Diff for: ci/requirements-3.6_NUMPY_DEV.build

-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
11
python=3.6*
2-
python-dateutil
32
pytz
43
cython

Diff for: ci/requirements-3.6_NUMPY_DEV.build.sh

+3
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,7 @@ pip uninstall numpy -y
1111
PRE_WHEELS="https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com"
1212
pip install --pre --upgrade --timeout=60 -f $PRE_WHEELS numpy scipy
1313

14+
# install dateutil from master
15+
pip install -U git+git://github.com/dateutil/dateutil.git
16+
1417
true

Diff for: doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,7 @@ HDFStore: PyTables (HDF5)
9999
HDFStore.append
100100
HDFStore.get
101101
HDFStore.select
102+
HDFStore.info
102103

103104
Feather
104105
~~~~~~~

Diff for: doc/source/contributing.rst

+6
Original file line numberDiff line numberDiff line change
@@ -525,6 +525,12 @@ run this slightly modified command::
525525

526526
git diff master --name-only -- '*.py' | grep 'pandas/' | xargs flake8
527527

528+
Note that on Windows, ``grep``, ``xargs``, and other tools are likely
529+
unavailable. However, this has been shown to work on smaller commits in the
530+
standard Windows command line::
531+
532+
git diff master -u -- "*.py" | flake8 --diff
533+
528534
Backwards Compatibility
529535
~~~~~~~~~~~~~~~~~~~~~~~
530536

Diff for: doc/source/ecosystem.rst

+11
Original file line numberDiff line numberDiff line change
@@ -239,3 +239,14 @@ pandas own ``read_csv`` for CSV IO and leverages many existing packages such as
239239
PyTables, h5py, and pymongo to move data between non pandas formats. Its graph
240240
based approach is also extensible by end users for custom formats that may be
241241
too specific for the core of odo.
242+
243+
.. _ecosystem.data_validation:
244+
245+
Data validation
246+
---------------
247+
248+
`Engarde <http://engarde.readthedocs.io/en/latest/>`__
249+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
250+
251+
Engarde is a lightweight library used to explicitly state your assumptions abour your datasets
252+
and check that they're *actually* true.

Diff for: doc/source/groupby.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -1200,14 +1200,14 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
12001200
df
12011201
df.groupby(df.sum(), axis=1).sum()
12021202
1203-
.. _groupby.multicolumn_factorization
1203+
.. _groupby.multicolumn_factorization:
12041204

12051205
Multi-column factorization
12061206
~~~~~~~~~~~~~~~~~~~~~~~~~~
12071207

12081208
By using ``.ngroup()``, we can extract information about the groups in
12091209
a way similar to :func:`factorize` (as described further in the
1210-
:ref:`reshaping API <reshaping.factorization>`) but which applies
1210+
:ref:`reshaping API <reshaping.factorize>`) but which applies
12111211
naturally to multiple columns of mixed type and different
12121212
sources. This can be useful as an intermediate categorical-like step
12131213
in processing, when the relationships between the group rows are more

Diff for: doc/source/io.rst

+92-56
Original file line numberDiff line numberDiff line change
@@ -137,8 +137,10 @@ usecols : array-like or callable, default ``None``
137137
138138
Using this parameter results in much faster parsing time and lower memory usage.
139139
as_recarray : boolean, default ``False``
140-
DEPRECATED: this argument will be removed in a future version. Please call
141-
``pd.read_csv(...).to_records()`` instead.
140+
141+
.. deprecated:: 0.18.2
142+
143+
Please call ``pd.read_csv(...).to_records()`` instead.
142144

143145
Return a NumPy recarray instead of a DataFrame after parsing the data. If
144146
set to ``True``, this option takes precedence over the ``squeeze`` parameter.
@@ -191,7 +193,11 @@ skiprows : list-like or integer, default ``None``
191193
skipfooter : int, default ``0``
192194
Number of lines at bottom of file to skip (unsupported with engine='c').
193195
skip_footer : int, default ``0``
194-
DEPRECATED: use the ``skipfooter`` parameter instead, as they are identical
196+
197+
.. deprecated:: 0.19.0
198+
199+
Use the ``skipfooter`` parameter instead, as they are identical
200+
195201
nrows : int, default ``None``
196202
Number of rows of file to read. Useful for reading pieces of large files.
197203
low_memory : boolean, default ``True``
@@ -202,16 +208,25 @@ low_memory : boolean, default ``True``
202208
use the ``chunksize`` or ``iterator`` parameter to return the data in chunks.
203209
(Only valid with C parser)
204210
buffer_lines : int, default None
205-
DEPRECATED: this argument will be removed in a future version because its
206-
value is not respected by the parser
211+
212+
.. deprecated:: 0.19.0
213+
214+
Argument removed because its value is not respected by the parser
215+
207216
compact_ints : boolean, default False
208-
DEPRECATED: this argument will be removed in a future version
217+
218+
.. deprecated:: 0.19.0
219+
220+
Argument moved to ``pd.to_numeric``
209221

210222
If ``compact_ints`` is ``True``, then for any column that is of integer dtype, the
211223
parser will attempt to cast it as the smallest integer ``dtype`` possible, either
212224
signed or unsigned depending on the specification from the ``use_unsigned`` parameter.
213225
use_unsigned : boolean, default False
214-
DEPRECATED: this argument will be removed in a future version
226+
227+
.. deprecated:: 0.18.2
228+
229+
Argument moved to ``pd.to_numeric``
215230

216231
If integer columns are being compacted (i.e. ``compact_ints=True``), specify whether
217232
the column should be compacted to the smallest signed or unsigned integer dtype.
@@ -225,9 +240,9 @@ NA and Missing Data Handling
225240

226241
na_values : scalar, str, list-like, or dict, default ``None``
227242
Additional strings to recognize as NA/NaN. If dict passed, specific per-column
228-
NA values. By default the following values are interpreted as NaN:
229-
``'-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA',
230-
'#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''``.
243+
NA values. See :ref:`na values const <io.navaluesconst>` below
244+
for a list of the values interpreted as NaN by default.
245+
231246
keep_default_na : boolean, default ``True``
232247
If na_values are specified and keep_default_na is ``False`` the default NaN
233248
values are overridden, otherwise they're appended to.
@@ -712,6 +727,16 @@ index column inference and discard the last column, pass ``index_col=False``:
712727
pd.read_csv(StringIO(data))
713728
pd.read_csv(StringIO(data), index_col=False)
714729
730+
If a subset of data is being parsed using the ``usecols`` option, the
731+
``index_col`` specification is based on that subset, not the original data.
732+
733+
.. ipython:: python
734+
735+
data = 'a,b,c\n4,apple,bat,\n8,orange,cow,'
736+
print(data)
737+
pd.read_csv(StringIO(data), usecols=['b', 'c'])
738+
pd.read_csv(StringIO(data), usecols=['b', 'c'], index_col=0)
739+
715740
.. _io.parse_dates:
716741

717742
Date Handling
@@ -1020,10 +1045,11 @@ the corresponding equivalent values will also imply a missing value (in this cas
10201045
``[5.0,5]`` are recognized as ``NaN``.
10211046

10221047
To completely override the default values that are recognized as missing, specify ``keep_default_na=False``.
1023-
The default ``NaN`` recognized values are ``['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A','N/A', 'NA',
1024-
'#NA', 'NULL', 'NaN', '-NaN', 'nan', '-nan']``. Although a 0-length string
1025-
``''`` is not included in the default ``NaN`` values list, it is still treated
1026-
as a missing value.
1048+
1049+
.. _io.navaluesconst:
1050+
1051+
The default ``NaN`` recognized values are ``['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A',
1052+
'n/a', 'NA', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', '']``.
10271053

10281054
.. code-block:: python
10291055
@@ -3396,7 +3422,7 @@ Fixed Format
33963422
This was prior to 0.13.0 the ``Storer`` format.
33973423

33983424
The examples above show storing using ``put``, which write the HDF5 to ``PyTables`` in a fixed array format, called
3399-
the ``fixed`` format. These types of stores are are **not** appendable once written (though you can simply
3425+
the ``fixed`` format. These types of stores are **not** appendable once written (though you can simply
34003426
remove them and rewrite). Nor are they **queryable**; they must be
34013427
retrieved in their entirety. They also do not support dataframes with non-unique column names.
34023428
The ``fixed`` format stores offer very fast writing and slightly faster reading than ``table`` stores.
@@ -4056,26 +4082,64 @@ Compression
40564082
+++++++++++
40574083

40584084
``PyTables`` allows the stored data to be compressed. This applies to
4059-
all kinds of stores, not just tables.
4085+
all kinds of stores, not just tables. Two parameters are used to
4086+
control compression: ``complevel`` and ``complib``.
4087+
4088+
``complevel`` specifies if and how hard data is to be compressed.
4089+
``complevel=0`` and ``complevel=None`` disables
4090+
compression and ``0<complevel<10`` enables compression.
4091+
4092+
``complib`` specifies which compression library to use. If nothing is
4093+
specified the default library ``zlib`` is used. A
4094+
compression library usually optimizes for either good
4095+
compression rates or speed and the results will depend on
4096+
the type of data. Which type of
4097+
compression to choose depends on your specific needs and
4098+
data. The list of supported compression libraries:
4099+
4100+
- `zlib <http://zlib.net/>`_: The default compression library. A classic in terms of compression, achieves good compression rates but is somewhat slow.
4101+
- `lzo <http://www.oberhumer.com/opensource/lzo/>`_: Fast compression and decompression.
4102+
- `bzip2 <http://bzip.org/>`_: Good compression rates.
4103+
- `blosc <http://www.blosc.org/>`_: Fast compression and decompression.
4104+
4105+
.. versionadded:: 0.20.2
4106+
4107+
Support for alternative blosc compressors:
4108+
4109+
- `blosc:blosclz <http://www.blosc.org/>`_ This is the
4110+
default compressor for ``blosc``
4111+
- `blosc:lz4
4112+
<https://fastcompression.blogspot.dk/p/lz4.html>`_:
4113+
A compact, very popular and fast compressor.
4114+
- `blosc:lz4hc
4115+
<https://fastcompression.blogspot.dk/p/lz4.html>`_:
4116+
A tweaked version of LZ4, produces better
4117+
compression ratios at the expense of speed.
4118+
- `blosc:snappy <https://google.github.io/snappy/>`_:
4119+
A popular compressor used in many places.
4120+
- `blosc:zlib <http://zlib.net/>`_: A classic;
4121+
somewhat slower than the previous ones, but
4122+
achieving better compression ratios.
4123+
- `blosc:zstd <https://facebook.github.io/zstd/>`_: An
4124+
extremely well balanced codec; it provides the best
4125+
compression ratios among the others above, and at
4126+
reasonably fast speed.
4127+
4128+
If ``complib`` is defined as something other than the
4129+
listed libraries a ``ValueError`` exception is issued.
40604130

4061-
- Pass ``complevel=int`` for a compression level (1-9, with 0 being no
4062-
compression, and the default)
4063-
- Pass ``complib=lib`` where lib is any of ``zlib, bzip2, lzo, blosc`` for
4064-
whichever compression library you prefer.
4131+
.. note::
40654132

4066-
``HDFStore`` will use the file based compression scheme if no overriding
4067-
``complib`` or ``complevel`` options are provided. ``blosc`` offers very
4068-
fast compression, and is my most used. Note that ``lzo`` and ``bzip2``
4069-
may not be installed (by Python) by default.
4133+
If the library specified with the ``complib`` option is missing on your platform,
4134+
compression defaults to ``zlib`` without further ado.
40704135

4071-
Compression for all objects within the file
4136+
Enable compression for all objects within the file:
40724137

40734138
.. code-block:: python
40744139
4075-
store_compressed = pd.HDFStore('store_compressed.h5', complevel=9, complib='blosc')
4140+
store_compressed = pd.HDFStore('store_compressed.h5', complevel=9, complib='blosc:blosclz')
40764141
4077-
Or on-the-fly compression (this only applies to tables). You can turn
4078-
off file compression for a specific table by passing ``complevel=0``
4142+
Or on-the-fly compression (this only applies to tables) in stores where compression is not enabled:
40794143

40804144
.. code-block:: python
40814145
@@ -4410,34 +4474,6 @@ Performance
44104474
`Here <http://stackoverflow.com/questions/14355151/how-to-make-pandas-hdfstore-put-operation-faster/14370190#14370190>`__
44114475
for more information and some solutions.
44124476

4413-
Experimental
4414-
''''''''''''
4415-
4416-
HDFStore supports ``Panel4D`` storage.
4417-
4418-
.. ipython:: python
4419-
:okwarning:
4420-
4421-
wp = pd.Panel(randn(2, 5, 4), items=['Item1', 'Item2'],
4422-
major_axis=pd.date_range('1/1/2000', periods=5),
4423-
minor_axis=['A', 'B', 'C', 'D'])
4424-
p4d = pd.Panel4D({ 'l1' : wp })
4425-
p4d
4426-
store.append('p4d', p4d)
4427-
store
4428-
4429-
These, by default, index the three axes ``items, major_axis,
4430-
minor_axis``. On an ``AppendableTable`` it is possible to setup with the
4431-
first append a different indexing scheme, depending on how you want to
4432-
store your data. Pass the ``axes`` keyword with a list of dimensions
4433-
(currently must by exactly 1 less than the total dimensions of the
4434-
object). This cannot be changed after table creation.
4435-
4436-
.. ipython:: python
4437-
:okwarning:
4438-
4439-
store.append('p4d2', p4d, axes=['labels', 'major_axis', 'minor_axis'])
4440-
store.select('p4d2', where='labels=l1 and items=Item1 and minor_axis=A')
44414477

44424478
.. ipython:: python
44434479
:suppress:

0 commit comments

Comments
 (0)