Skip to content

Commit cf19528

Browse files
barronhshoyer
authored andcommitted
Added PNC backend to xarray (#1905)
* Added PNC backend to xarray PNC is used for GEOS-Chem, CAMx, CMAQ and other atmospheric data formats that have their own file formats and meta-data conventions. It can provide a CF compliant netCDF-like interface. * Added whats-new documentation * Updating pnc_ to remove DunderArrayMixin dependency * Adding basic tests for pnc Right now, pnc is simply being tested as a reader for NetCDF3 files * Updating for flake8 compliance * flake does not like unused e * Updating pnc to PseudoNetCDF * Remove outer except * Updating pnc to PseudoNetCDF * Added open and updated init Based on shoyer review * Updated indexing and test fix Indexing supports #1899 * Added PseudoNetCDF to doc/io.rst * Changing test subtype * Changing test subtype removing pdb * pnc test case requires netcdf3only For now, pnc is only supporting the classic data model * adding backend_kwargs default as dict This ensures **mapping is possible. * Upgrading tests to CFEncodedDataTest Some tests are bypassed. PseudoNetCDF string treatment is not currently compatible with xarray. This will be addressed soon. * Not currently supporting autoclose I do not fully understand the usecase, so I have not implemented these tests. * Minor updates for flake8 * Explicit skipping Using pytest.mark.skip to skip unsupported tests * removing trailing whitespace from pytest skip * Adding pip support * Addressing comments * Bypassing pickle, mask/scale, and object These tests cause errors that do not affect desired backend performance. * Added uamiv test PseudoNetCDF reads other formats. This adds a test of uamiv to the standard test for a backend and skips mask/scale, object, and boolean tests * Adding support for autoclose ensure open must be called before accessing variable data * Adding bakcend_kwargs to all backends Most backends currently take no keywords, so an empty ditionary is appropriate. * Small tweaks to PNC backend * remove warning and update whats-new * Separating isntall and io pnc doc and updating whats new * fixing line length in test * Tests now use non-netcdf files * Removing unknown meta-data netcdf support. * flake8 cleanup * Using python 2 and 3 compat testing * Disabling mask_and_scale by default prevents inadvertent double scaling in PNC formats * consistent with 3.0.0 Updates in 3.0.1 will fix close in uamiv. * Updating readers and line length * Updating readers and line length * Updating readers and line length * Adding open_mfdataset test Testing by opening same file twice and stacking it. * Using conda version of PseudoNetCDF * Removing xfail for netcdf Mask and scale with PseudoNetCDF and NetCDF4 is not supported, but not prevented. * Moving pseudonetcdf to v0.15 * Updating what's new * Fixing open_dataarray CF options mask_and_scale is None (diagnosed by open_dataset) and decode_cf should be True
1 parent 4106b94 commit cf19528

11 files changed

+440
-17
lines changed

ci/requirements-py36.yml

+1
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ dependencies:
2020
- rasterio
2121
- bottleneck
2222
- zarr
23+
- pseudonetcdf>=3.0.1
2324
- pip:
2425
- coveralls
2526
- pytest-cov

doc/installing.rst

+5-2
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,9 @@ For netCDF and IO
2828
- `cftime <https://unidata.github.io/cftime>`__: recommended if you
2929
want to encode/decode datetimes for non-standard calendars or dates before
3030
year 1678 or after year 2262.
31+
- `PseudoNetCDF <http://github.com/barronh/pseudonetcdf/>`__: recommended
32+
for accessing CAMx, GEOS-Chem (bpch), NOAA ARL files, ICARTT files
33+
(ffi1001) and many other.
3134

3235
For accelerating xarray
3336
~~~~~~~~~~~~~~~~~~~~~~~
@@ -65,9 +68,9 @@ with its recommended dependencies using the conda command line tool::
6568

6669
.. _conda: http://conda.io/
6770

68-
We recommend using the community maintained `conda-forge <https://conda-forge.github.io/>`__ channel if you need difficult\-to\-build dependencies such as cartopy or pynio::
71+
We recommend using the community maintained `conda-forge <https://conda-forge.github.io/>`__ channel if you need difficult\-to\-build dependencies such as cartopy, pynio or PseudoNetCDF::
6972

70-
$ conda install -c conda-forge xarray cartopy pynio
73+
$ conda install -c conda-forge xarray cartopy pynio pseudonetcdf
7174

7275
New releases may also appear in conda-forge before being updated in the default
7376
channel.

doc/io.rst

+22-1
Original file line numberDiff line numberDiff line change
@@ -650,7 +650,26 @@ We recommend installing PyNIO via conda::
650650

651651
.. _PyNIO: https://www.pyngl.ucar.edu/Nio.shtml
652652

653-
.. _combining multiple files:
653+
.. _io.PseudoNetCDF:
654+
655+
Formats supported by PseudoNetCDF
656+
---------------------------------
657+
658+
xarray can also read CAMx, BPCH, ARL PACKED BIT, and many other file
659+
formats supported by PseudoNetCDF_, if PseudoNetCDF is installed.
660+
PseudoNetCDF can also provide Climate Forecasting Conventions to
661+
CMAQ files. In addition, PseudoNetCDF can automatically register custom
662+
readers that subclass PseudoNetCDF.PseudoNetCDFFile. PseudoNetCDF can
663+
identify readers heuristically, or format can be specified via a key in
664+
`backend_kwargs`.
665+
666+
To use PseudoNetCDF to read such files, supply
667+
``engine='pseudonetcdf'`` to :py:func:`~xarray.open_dataset`.
668+
669+
Add ``backend_kwargs={'format': '<format name>'}`` where `<format name>`
670+
options are listed on the PseudoNetCDF page.
671+
672+
.. _PseuodoNetCDF: http://github.com/barronh/PseudoNetCDF
654673

655674

656675
Formats supported by Pandas
@@ -662,6 +681,8 @@ exporting your objects to pandas and using its broad range of `IO tools`_.
662681
.. _IO tools: http://pandas.pydata.org/pandas-docs/stable/io.html
663682

664683

684+
.. _combining multiple files:
685+
665686

666687
Combining multiple files
667688
------------------------

doc/whats-new.rst

+4
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,10 @@ Enhancements
4141
dask<0.17.4. (related to :issue:`2203`)
4242
By `Keisuke Fujii <https://github.com/fujiisoup`_.
4343

44+
- added a PseudoNetCDF backend for many Atmospheric data formats including
45+
GEOS-Chem, CAMx, NOAA arlpacked bit and many others.
46+
By `Barron Henderson <https://github.com/barronh>`_.
47+
4448
- :py:meth:`~DataArray.cumsum` and :py:meth:`~DataArray.cumprod` now support
4549
aggregation over multiple dimensions at the same time. This is the default
4650
behavior when dimensions are not specified (previously this raised an error).

xarray/backends/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
from .pynio_ import NioDataStore
1111
from .scipy_ import ScipyDataStore
1212
from .h5netcdf_ import H5NetCDFStore
13+
from .pseudonetcdf_ import PseudoNetCDFDataStore
1314
from .zarr import ZarrStore
1415

1516
__all__ = [
@@ -21,4 +22,5 @@
2122
'ScipyDataStore',
2223
'H5NetCDFStore',
2324
'ZarrStore',
25+
'PseudoNetCDFDataStore',
2426
]

xarray/backends/api.py

+42-13
Original file line numberDiff line numberDiff line change
@@ -152,9 +152,10 @@ def _finalize_store(write, store):
152152

153153

154154
def open_dataset(filename_or_obj, group=None, decode_cf=True,
155-
mask_and_scale=True, decode_times=True, autoclose=False,
155+
mask_and_scale=None, decode_times=True, autoclose=False,
156156
concat_characters=True, decode_coords=True, engine=None,
157-
chunks=None, lock=None, cache=None, drop_variables=None):
157+
chunks=None, lock=None, cache=None, drop_variables=None,
158+
backend_kwargs=None):
158159
"""Load and decode a dataset from a file or file-like object.
159160
160161
Parameters
@@ -178,7 +179,8 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
178179
taken from variable attributes (if they exist). If the `_FillValue` or
179180
`missing_value` attribute contains multiple values a warning will be
180181
issued and all array values matching one of the multiple values will
181-
be replaced by NA.
182+
be replaced by NA. mask_and_scale defaults to True except for the
183+
pseudonetcdf backend.
182184
decode_times : bool, optional
183185
If True, decode times encoded in the standard NetCDF datetime format
184186
into datetime objects. Otherwise, leave them encoded as numbers.
@@ -194,7 +196,7 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
194196
decode_coords : bool, optional
195197
If True, decode the 'coordinates' attribute to identify coordinates in
196198
the resulting dataset.
197-
engine : {'netcdf4', 'scipy', 'pydap', 'h5netcdf', 'pynio'}, optional
199+
engine : {'netcdf4', 'scipy', 'pydap', 'h5netcdf', 'pynio', 'pseudonetcdf'}, optional
198200
Engine to use when reading files. If not provided, the default engine
199201
is chosen based on available dependencies, with a preference for
200202
'netcdf4'.
@@ -219,6 +221,10 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
219221
A variable or list of variables to exclude from being parsed from the
220222
dataset. This may be useful to drop variables with problems or
221223
inconsistent values.
224+
backend_kwargs: dictionary, optional
225+
A dictionary of keyword arguments to pass on to the backend. This
226+
may be useful when backend options would improve performance or
227+
allow user control of dataset processing.
222228
223229
Returns
224230
-------
@@ -229,6 +235,10 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
229235
--------
230236
open_mfdataset
231237
"""
238+
239+
if mask_and_scale is None:
240+
mask_and_scale = not engine == 'pseudonetcdf'
241+
232242
if not decode_cf:
233243
mask_and_scale = False
234244
decode_times = False
@@ -238,6 +248,9 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
238248
if cache is None:
239249
cache = chunks is None
240250

251+
if backend_kwargs is None:
252+
backend_kwargs = {}
253+
241254
def maybe_decode_store(store, lock=False):
242255
ds = conventions.decode_cf(
243256
store, mask_and_scale=mask_and_scale, decode_times=decode_times,
@@ -303,18 +316,26 @@ def maybe_decode_store(store, lock=False):
303316
if engine == 'netcdf4':
304317
store = backends.NetCDF4DataStore.open(filename_or_obj,
305318
group=group,
306-
autoclose=autoclose)
319+
autoclose=autoclose,
320+
**backend_kwargs)
307321
elif engine == 'scipy':
308322
store = backends.ScipyDataStore(filename_or_obj,
309-
autoclose=autoclose)
323+
autoclose=autoclose,
324+
**backend_kwargs)
310325
elif engine == 'pydap':
311-
store = backends.PydapDataStore.open(filename_or_obj)
326+
store = backends.PydapDataStore.open(filename_or_obj,
327+
**backend_kwargs)
312328
elif engine == 'h5netcdf':
313329
store = backends.H5NetCDFStore(filename_or_obj, group=group,
314-
autoclose=autoclose)
330+
autoclose=autoclose,
331+
**backend_kwargs)
315332
elif engine == 'pynio':
316333
store = backends.NioDataStore(filename_or_obj,
317-
autoclose=autoclose)
334+
autoclose=autoclose,
335+
**backend_kwargs)
336+
elif engine == 'pseudonetcdf':
337+
store = backends.PseudoNetCDFDataStore.open(
338+
filename_or_obj, autoclose=autoclose, **backend_kwargs)
318339
else:
319340
raise ValueError('unrecognized engine for open_dataset: %r'
320341
% engine)
@@ -334,9 +355,10 @@ def maybe_decode_store(store, lock=False):
334355

335356

336357
def open_dataarray(filename_or_obj, group=None, decode_cf=True,
337-
mask_and_scale=True, decode_times=True, autoclose=False,
358+
mask_and_scale=None, decode_times=True, autoclose=False,
338359
concat_characters=True, decode_coords=True, engine=None,
339-
chunks=None, lock=None, cache=None, drop_variables=None):
360+
chunks=None, lock=None, cache=None, drop_variables=None,
361+
backend_kwargs=None):
340362
"""Open an DataArray from a netCDF file containing a single data variable.
341363
342364
This is designed to read netCDF files with only one data variable. If
@@ -363,7 +385,8 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
363385
taken from variable attributes (if they exist). If the `_FillValue` or
364386
`missing_value` attribute contains multiple values a warning will be
365387
issued and all array values matching one of the multiple values will
366-
be replaced by NA.
388+
be replaced by NA. mask_and_scale defaults to True except for the
389+
pseudonetcdf backend.
367390
decode_times : bool, optional
368391
If True, decode times encoded in the standard NetCDF datetime format
369392
into datetime objects. Otherwise, leave them encoded as numbers.
@@ -403,6 +426,10 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
403426
A variable or list of variables to exclude from being parsed from the
404427
dataset. This may be useful to drop variables with problems or
405428
inconsistent values.
429+
backend_kwargs: dictionary, optional
430+
A dictionary of keyword arguments to pass on to the backend. This
431+
may be useful when backend options would improve performance or
432+
allow user control of dataset processing.
406433
407434
Notes
408435
-----
@@ -417,13 +444,15 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
417444
--------
418445
open_dataset
419446
"""
447+
420448
dataset = open_dataset(filename_or_obj, group=group, decode_cf=decode_cf,
421449
mask_and_scale=mask_and_scale,
422450
decode_times=decode_times, autoclose=autoclose,
423451
concat_characters=concat_characters,
424452
decode_coords=decode_coords, engine=engine,
425453
chunks=chunks, lock=lock, cache=cache,
426-
drop_variables=drop_variables)
454+
drop_variables=drop_variables,
455+
backend_kwargs=backend_kwargs)
427456

428457
if len(dataset.data_vars) != 1:
429458
raise ValueError('Given file dataset contains more than one data '

xarray/backends/pseudonetcdf_.py

+101
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
from __future__ import absolute_import
2+
from __future__ import division
3+
from __future__ import print_function
4+
5+
import functools
6+
7+
import numpy as np
8+
9+
from .. import Variable
10+
from ..core.pycompat import OrderedDict
11+
from ..core.utils import (FrozenOrderedDict, Frozen)
12+
from ..core import indexing
13+
14+
from .common import AbstractDataStore, DataStorePickleMixin, BackendArray
15+
16+
17+
class PncArrayWrapper(BackendArray):
18+
19+
def __init__(self, variable_name, datastore):
20+
self.datastore = datastore
21+
self.variable_name = variable_name
22+
array = self.get_array()
23+
self.shape = array.shape
24+
self.dtype = np.dtype(array.dtype)
25+
26+
def get_array(self):
27+
self.datastore.assert_open()
28+
return self.datastore.ds.variables[self.variable_name]
29+
30+
def __getitem__(self, key):
31+
key, np_inds = indexing.decompose_indexer(
32+
key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR)
33+
34+
with self.datastore.ensure_open(autoclose=True):
35+
array = self.get_array()[key.tuple] # index backend array
36+
37+
if len(np_inds.tuple) > 0:
38+
# index the loaded np.ndarray
39+
array = indexing.NumpyIndexingAdapter(array)[np_inds]
40+
return array
41+
42+
43+
class PseudoNetCDFDataStore(AbstractDataStore, DataStorePickleMixin):
44+
"""Store for accessing datasets via PseudoNetCDF
45+
"""
46+
@classmethod
47+
def open(cls, filename, format=None, writer=None,
48+
autoclose=False, **format_kwds):
49+
from PseudoNetCDF import pncopen
50+
opener = functools.partial(pncopen, filename, **format_kwds)
51+
ds = opener()
52+
mode = format_kwds.get('mode', 'r')
53+
return cls(ds, mode=mode, writer=writer, opener=opener,
54+
autoclose=autoclose)
55+
56+
def __init__(self, pnc_dataset, mode='r', writer=None, opener=None,
57+
autoclose=False):
58+
59+
if autoclose and opener is None:
60+
raise ValueError('autoclose requires an opener')
61+
62+
self._ds = pnc_dataset
63+
self._autoclose = autoclose
64+
self._isopen = True
65+
self._opener = opener
66+
self._mode = mode
67+
super(PseudoNetCDFDataStore, self).__init__()
68+
69+
def open_store_variable(self, name, var):
70+
with self.ensure_open(autoclose=False):
71+
data = indexing.LazilyOuterIndexedArray(
72+
PncArrayWrapper(name, self)
73+
)
74+
attrs = OrderedDict((k, getattr(var, k)) for k in var.ncattrs())
75+
return Variable(var.dimensions, data, attrs)
76+
77+
def get_variables(self):
78+
with self.ensure_open(autoclose=False):
79+
return FrozenOrderedDict((k, self.open_store_variable(k, v))
80+
for k, v in self.ds.variables.items())
81+
82+
def get_attrs(self):
83+
with self.ensure_open(autoclose=True):
84+
return Frozen(dict([(k, getattr(self.ds, k))
85+
for k in self.ds.ncattrs()]))
86+
87+
def get_dimensions(self):
88+
with self.ensure_open(autoclose=True):
89+
return Frozen(self.ds.dimensions)
90+
91+
def get_encoding(self):
92+
encoding = {}
93+
encoding['unlimited_dims'] = set(
94+
[k for k in self.ds.dimensions
95+
if self.ds.dimensions[k].isunlimited()])
96+
return encoding
97+
98+
def close(self):
99+
if self._isopen:
100+
self.ds.close()
101+
self._isopen = False

xarray/tests/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ def _importorskip(modname, minversion=None):
6868
has_netCDF4, requires_netCDF4 = _importorskip('netCDF4')
6969
has_h5netcdf, requires_h5netcdf = _importorskip('h5netcdf')
7070
has_pynio, requires_pynio = _importorskip('Nio')
71+
has_pseudonetcdf, requires_pseudonetcdf = _importorskip('PseudoNetCDF')
7172
has_cftime, requires_cftime = _importorskip('cftime')
7273
has_dask, requires_dask = _importorskip('dask')
7374
has_bottleneck, requires_bottleneck = _importorskip('bottleneck')

xarray/tests/data/example.ict

+31
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
27, 1001
2+
Henderson, Barron
3+
U.S. EPA
4+
Example file with artificial data
5+
JUST_A_TEST
6+
1, 1
7+
2018, 04, 27, 2018, 04, 27
8+
0
9+
Start_UTC
10+
7
11+
1, 1, 1, 1, 1
12+
-9999, -9999, -9999, -9999, -9999
13+
lat, degrees_north
14+
lon, degrees_east
15+
elev, meters
16+
TEST_ppbv, ppbv
17+
TESTM_ppbv, ppbv
18+
0
19+
8
20+
ULOD_FLAG: -7777
21+
ULOD_VALUE: N/A
22+
LLOD_FLAG: -8888
23+
LLOD_VALUE: N/A, N/A, N/A, N/A, 0.025
24+
OTHER_COMMENTS: www-air.larc.nasa.gov/missions/etc/IcarttDataFormat.htm
25+
REVISION: R0
26+
R0: No comments for this revision.
27+
Start_UTC, lat, lon, elev, TEST_ppbv, TESTM_ppbv
28+
43200, 41.00000, -71.00000, 5, 1.2345, 2.220
29+
46800, 42.00000, -72.00000, 15, 2.3456, -9999
30+
50400, 42.00000, -73.00000, 20, 3.4567, -7777
31+
50400, 42.00000, -74.00000, 25, 4.5678, -8888

xarray/tests/data/example.uamiv

608 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)