Added PNC backend to xarray (#1905)

barronh · shoyer · commit cf19528d6d2b · 2018-05-31T21:21:43.000-07:00
* Added PNC backend to xarray PNC is used for GEOS-Chem, CAMx, CMAQ and other atmospheric data formats that have their own file formats and meta-data conventions. It can provide a CF compliant netCDF-like interface. * Added whats-new documentation * Updating pnc_ to remove DunderArrayMixin dependency * Adding basic tests for pnc Right now, pnc is simply being tested as a reader for NetCDF3 files * Updating for flake8 compliance * flake does not like unused e * Updating pnc to PseudoNetCDF * Remove outer except * Updating pnc to PseudoNetCDF * Added open and updated init Based on shoyer review * Updated indexing and test fix Indexing supports #1899 * Added PseudoNetCDF to doc/io.rst * Changing test subtype * Changing test subtype removing pdb * pnc test case requires netcdf3only For now, pnc is only supporting the classic data model * adding backend_kwargs default as dict This ensures **mapping is possible. * Upgrading tests to CFEncodedDataTest Some tests are bypassed. PseudoNetCDF string treatment is not currently compatible with xarray. This will be addressed soon. * Not currently supporting autoclose I do not fully understand the usecase, so I have not implemented these tests. * Minor updates for flake8 * Explicit skipping Using pytest.mark.skip to skip unsupported tests * removing trailing whitespace from pytest skip * Adding pip support * Addressing comments * Bypassing pickle, mask/scale, and object These tests cause errors that do not affect desired backend performance. * Added uamiv test PseudoNetCDF reads other formats. This adds a test of uamiv to the standard test for a backend and skips mask/scale, object, and boolean tests * Adding support for autoclose ensure open must be called before accessing variable data * Adding bakcend_kwargs to all backends Most backends currently take no keywords, so an empty ditionary is appropriate. * Small tweaks to PNC backend * remove warning and update whats-new * Separating isntall and io pnc doc and updating whats new * fixing line length in test * Tests now use non-netcdf files * Removing unknown meta-data netcdf support. * flake8 cleanup * Using python 2 and 3 compat testing * Disabling mask_and_scale by default prevents inadvertent double scaling in PNC formats * consistent with 3.0.0 Updates in 3.0.1 will fix close in uamiv. * Updating readers and line length * Updating readers and line length * Updating readers and line length * Adding open_mfdataset test Testing by opening same file twice and stacking it. * Using conda version of PseudoNetCDF * Removing xfail for netcdf Mask and scale with PseudoNetCDF and NetCDF4 is not supported, but not prevented. * Moving pseudonetcdf to v0.15 * Updating what's new * Fixing open_dataarray CF options mask_and_scale is None (diagnosed by open_dataset) and decode_cf should be True
diff --git a/ci/requirements-py36.yml b/ci/requirements-py36.yml
@@ -20,6 +20,7 @@ dependencies:
   - rasterio
   - bottleneck
   - zarr
+  - pseudonetcdf>=3.0.1
   - pip:
     - coveralls
     - pytest-cov
diff --git a/doc/installing.rst b/doc/installing.rst
@@ -28,6 +28,9 @@ For netCDF and IO
 - `cftime <https://unidata.github.io/cftime>`__: recommended if you
   want to encode/decode datetimes for non-standard calendars or dates before
   year 1678 or after year 2262.
+- `PseudoNetCDF <http://github.com/barronh/pseudonetcdf/>`__: recommended
+  for accessing CAMx, GEOS-Chem (bpch), NOAA ARL files, ICARTT files
+  (ffi1001) and many other.
 
 For accelerating xarray
 ~~~~~~~~~~~~~~~~~~~~~~~
@@ -65,9 +68,9 @@ with its recommended dependencies using the conda command line tool::
 
 .. _conda: http://conda.io/
 
-We recommend using the community maintained `conda-forge <https://conda-forge.github.io/>`__ channel if you need difficult\-to\-build dependencies such as cartopy or pynio::
+We recommend using the community maintained `conda-forge <https://conda-forge.github.io/>`__ channel if you need difficult\-to\-build dependencies such as cartopy, pynio or PseudoNetCDF::
 
-    $ conda install -c conda-forge xarray cartopy pynio
+    $ conda install -c conda-forge xarray cartopy pynio pseudonetcdf
 
 New releases may also appear in conda-forge before being updated in the default
 channel.
diff --git a/doc/io.rst b/doc/io.rst
@@ -650,7 +650,26 @@ We recommend installing PyNIO via conda::
 
 .. _PyNIO: https://www.pyngl.ucar.edu/Nio.shtml
 
-.. _combining multiple files:
+.. _io.PseudoNetCDF:
+
+Formats supported by PseudoNetCDF
+---------------------------------
+
+xarray can also read CAMx, BPCH, ARL PACKED BIT, and many other file
+formats supported by PseudoNetCDF_, if PseudoNetCDF is installed. 
+PseudoNetCDF can also provide Climate Forecasting Conventions to
+CMAQ files. In addition, PseudoNetCDF can automatically register custom
+readers that subclass PseudoNetCDF.PseudoNetCDFFile. PseudoNetCDF can
+identify readers heuristically, or format can be specified via a key in
+`backend_kwargs`.
+
+To use PseudoNetCDF to read such files, supply
+``engine='pseudonetcdf'`` to :py:func:`~xarray.open_dataset`.
+
+Add ``backend_kwargs={'format': '<format name>'}`` where `<format name>`
+options are listed on the PseudoNetCDF page.
+
+.. _PseuodoNetCDF: http://github.com/barronh/PseudoNetCDF
 
 
 Formats supported by Pandas
@@ -662,6 +681,8 @@ exporting your objects to pandas and using its broad range of `IO tools`_.
 .. _IO tools: http://pandas.pydata.org/pandas-docs/stable/io.html
 
 
+.. _combining multiple files:
+
 
 Combining multiple files
 ------------------------
diff --git a/doc/whats-new.rst b/doc/whats-new.rst
@@ -41,6 +41,10 @@ Enhancements
   dask<0.17.4. (related to :issue:`2203`)
   By `Keisuke Fujii <https://github.com/fujiisoup`_.
 
+- added a PseudoNetCDF backend for many Atmospheric data formats including
+  GEOS-Chem, CAMx, NOAA arlpacked bit and many others.
+  By `Barron Henderson <https://github.com/barronh>`_.
+
 - :py:meth:`~DataArray.cumsum` and :py:meth:`~DataArray.cumprod` now support
   aggregation over multiple dimensions at the same time. This is the default
   behavior when dimensions are not specified (previously this raised an error).
diff --git a/xarray/backends/__init__.py b/xarray/backends/__init__.py
@@ -10,6 +10,7 @@
 from .pynio_ import NioDataStore
 from .scipy_ import ScipyDataStore
 from .h5netcdf_ import H5NetCDFStore
+from .pseudonetcdf_ import PseudoNetCDFDataStore
 from .zarr import ZarrStore
 
 __all__ = [
@@ -21,4 +22,5 @@
     'ScipyDataStore',
     'H5NetCDFStore',
     'ZarrStore',
+    'PseudoNetCDFDataStore',
 ]
diff --git a/xarray/backends/api.py b/xarray/backends/api.py
@@ -152,9 +152,10 @@ def _finalize_store(write, store):
 
 
 def open_dataset(filename_or_obj, group=None, decode_cf=True,
-                 mask_and_scale=True, decode_times=True, autoclose=False,
+                 mask_and_scale=None, decode_times=True, autoclose=False,
                  concat_characters=True, decode_coords=True, engine=None,
-                 chunks=None, lock=None, cache=None, drop_variables=None):
+                 chunks=None, lock=None, cache=None, drop_variables=None,
+                 backend_kwargs=None):
     """Load and decode a dataset from a file or file-like object.
 
     Parameters
@@ -178,7 +179,8 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
         taken from variable attributes (if they exist).  If the `_FillValue` or
         `missing_value` attribute contains multiple values a warning will be
         issued and all array values matching one of the multiple values will
-        be replaced by NA.
+        be replaced by NA. mask_and_scale defaults to True except for the 
+        pseudonetcdf backend.
     decode_times : bool, optional
         If True, decode times encoded in the standard NetCDF datetime format
         into datetime objects. Otherwise, leave them encoded as numbers.
@@ -194,7 +196,7 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
     decode_coords : bool, optional
         If True, decode the 'coordinates' attribute to identify coordinates in
         the resulting dataset.
-    engine : {'netcdf4', 'scipy', 'pydap', 'h5netcdf', 'pynio'}, optional
+    engine : {'netcdf4', 'scipy', 'pydap', 'h5netcdf', 'pynio', 'pseudonetcdf'}, optional
         Engine to use when reading files. If not provided, the default engine
         is chosen based on available dependencies, with a preference for
         'netcdf4'.
@@ -219,6 +221,10 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
         A variable or list of variables to exclude from being parsed from the
         dataset. This may be useful to drop variables with problems or
         inconsistent values.
+    backend_kwargs: dictionary, optional
+        A dictionary of keyword arguments to pass on to the backend. This
+        may be useful when backend options would improve performance or 
+        allow user control of dataset processing.
 
     Returns
     -------
@@ -229,6 +235,10 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
     --------
     open_mfdataset
     """
+    
+    if mask_and_scale is None:
+        mask_and_scale = not engine == 'pseudonetcdf'
+
     if not decode_cf:
         mask_and_scale = False
         decode_times = False
@@ -238,6 +248,9 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True,
     if cache is None:
         cache = chunks is None
 
+    if backend_kwargs is None:
+        backend_kwargs = {}
+
     def maybe_decode_store(store, lock=False):
         ds = conventions.decode_cf(
             store, mask_and_scale=mask_and_scale, decode_times=decode_times,
@@ -303,18 +316,26 @@ def maybe_decode_store(store, lock=False):
         if engine == 'netcdf4':
             store = backends.NetCDF4DataStore.open(filename_or_obj,
                                                    group=group,
-                                                   autoclose=autoclose)
+                                                   autoclose=autoclose,
+                                                   **backend_kwargs)
         elif engine == 'scipy':
             store = backends.ScipyDataStore(filename_or_obj,
-                                            autoclose=autoclose)
+                                            autoclose=autoclose,
+                                            **backend_kwargs)
         elif engine == 'pydap':
-            store = backends.PydapDataStore.open(filename_or_obj)
+            store = backends.PydapDataStore.open(filename_or_obj,
+                                                 **backend_kwargs)
         elif engine == 'h5netcdf':
             store = backends.H5NetCDFStore(filename_or_obj, group=group,
-                                           autoclose=autoclose)
+                                           autoclose=autoclose,
+                                           **backend_kwargs)
         elif engine == 'pynio':
             store = backends.NioDataStore(filename_or_obj,
-                                          autoclose=autoclose)
+                                          autoclose=autoclose,
+                                           **backend_kwargs)
+        elif engine == 'pseudonetcdf':
+            store = backends.PseudoNetCDFDataStore.open(
+                filename_or_obj, autoclose=autoclose, **backend_kwargs)
         else:
             raise ValueError('unrecognized engine for open_dataset: %r'
                              % engine)
@@ -334,9 +355,10 @@ def maybe_decode_store(store, lock=False):
 
 
 def open_dataarray(filename_or_obj, group=None, decode_cf=True,
-                   mask_and_scale=True, decode_times=True, autoclose=False,
+                   mask_and_scale=None, decode_times=True, autoclose=False,
                    concat_characters=True, decode_coords=True, engine=None,
-                   chunks=None, lock=None, cache=None, drop_variables=None):
+                   chunks=None, lock=None, cache=None, drop_variables=None,
+                   backend_kwargs=None):
     """Open an DataArray from a netCDF file containing a single data variable.
 
     This is designed to read netCDF files with only one data variable. If
@@ -363,7 +385,8 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
         taken from variable attributes (if they exist).  If the `_FillValue` or
         `missing_value` attribute contains multiple values a warning will be
         issued and all array values matching one of the multiple values will
-        be replaced by NA.
+        be replaced by NA. mask_and_scale defaults to True except for the 
+        pseudonetcdf backend.
     decode_times : bool, optional
         If True, decode times encoded in the standard NetCDF datetime format
         into datetime objects. Otherwise, leave them encoded as numbers.
@@ -403,6 +426,10 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
         A variable or list of variables to exclude from being parsed from the
         dataset. This may be useful to drop variables with problems or
         inconsistent values.
+    backend_kwargs: dictionary, optional
+        A dictionary of keyword arguments to pass on to the backend. This
+        may be useful when backend options would improve performance or 
+        allow user control of dataset processing.
 
     Notes
     -----
@@ -417,13 +444,15 @@ def open_dataarray(filename_or_obj, group=None, decode_cf=True,
     --------
     open_dataset
     """
+
     dataset = open_dataset(filename_or_obj, group=group, decode_cf=decode_cf,
                            mask_and_scale=mask_and_scale,
                            decode_times=decode_times, autoclose=autoclose,
                            concat_characters=concat_characters,
                            decode_coords=decode_coords, engine=engine,
                            chunks=chunks, lock=lock, cache=cache,
-                           drop_variables=drop_variables)
+                           drop_variables=drop_variables,
+                           backend_kwargs=backend_kwargs)
 
     if len(dataset.data_vars) != 1:
         raise ValueError('Given file dataset contains more than one data '
diff --git a/xarray/backends/pseudonetcdf_.py b/xarray/backends/pseudonetcdf_.py
@@ -0,0 +1,101 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import functools
+
+import numpy as np
+
+from .. import Variable
+from ..core.pycompat import OrderedDict
+from ..core.utils import (FrozenOrderedDict, Frozen)
+from ..core import indexing
+
+from .common import AbstractDataStore, DataStorePickleMixin, BackendArray
+
+
+class PncArrayWrapper(BackendArray):
+
+    def __init__(self, variable_name, datastore):
+        self.datastore = datastore
+        self.variable_name = variable_name
+        array = self.get_array()
+        self.shape = array.shape
+        self.dtype = np.dtype(array.dtype)
+
+    def get_array(self):
+        self.datastore.assert_open()
+        return self.datastore.ds.variables[self.variable_name]
+
+    def __getitem__(self, key):
+        key, np_inds = indexing.decompose_indexer(
+            key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR)
+
+        with self.datastore.ensure_open(autoclose=True):
+            array = self.get_array()[key.tuple]  # index backend array
+
+        if len(np_inds.tuple) > 0:
+            # index the loaded np.ndarray
+            array = indexing.NumpyIndexingAdapter(array)[np_inds]
+        return array
+
+
+class PseudoNetCDFDataStore(AbstractDataStore, DataStorePickleMixin):
+    """Store for accessing datasets via PseudoNetCDF
+    """
+    @classmethod
+    def open(cls, filename, format=None, writer=None,
+             autoclose=False, **format_kwds):
+        from PseudoNetCDF import pncopen
+        opener = functools.partial(pncopen, filename, **format_kwds)
+        ds = opener()
+        mode = format_kwds.get('mode', 'r')
+        return cls(ds, mode=mode, writer=writer, opener=opener,
+                   autoclose=autoclose)
+
+    def __init__(self, pnc_dataset, mode='r', writer=None, opener=None,
+                 autoclose=False):
+
+        if autoclose and opener is None:
+            raise ValueError('autoclose requires an opener')
+
+        self._ds = pnc_dataset
+        self._autoclose = autoclose
+        self._isopen = True
+        self._opener = opener
+        self._mode = mode
+        super(PseudoNetCDFDataStore, self).__init__()
+
+    def open_store_variable(self, name, var):
+        with self.ensure_open(autoclose=False):
+            data = indexing.LazilyOuterIndexedArray(
+                PncArrayWrapper(name, self)
+            )
+        attrs = OrderedDict((k, getattr(var, k)) for k in var.ncattrs())
+        return Variable(var.dimensions, data, attrs)
+
+    def get_variables(self):
+        with self.ensure_open(autoclose=False):
+            return FrozenOrderedDict((k, self.open_store_variable(k, v))
+                                     for k, v in self.ds.variables.items())
+
+    def get_attrs(self):
+        with self.ensure_open(autoclose=True):
+            return Frozen(dict([(k, getattr(self.ds, k))
+                                for k in self.ds.ncattrs()]))
+
+    def get_dimensions(self):
+        with self.ensure_open(autoclose=True):
+            return Frozen(self.ds.dimensions)
+
+    def get_encoding(self):
+        encoding = {}
+        encoding['unlimited_dims'] = set(
+            [k for k in self.ds.dimensions
+             if self.ds.dimensions[k].isunlimited()])
+        return encoding
+
+    def close(self):
+        if self._isopen:
+            self.ds.close()
+        self._isopen = False
diff --git a/xarray/tests/__init__.py b/xarray/tests/__init__.py
@@ -68,6 +68,7 @@ def _importorskip(modname, minversion=None):
 has_netCDF4, requires_netCDF4 = _importorskip('netCDF4')
 has_h5netcdf, requires_h5netcdf = _importorskip('h5netcdf')
 has_pynio, requires_pynio = _importorskip('Nio')
+has_pseudonetcdf, requires_pseudonetcdf = _importorskip('PseudoNetCDF')
 has_cftime, requires_cftime = _importorskip('cftime')
 has_dask, requires_dask = _importorskip('dask')
 has_bottleneck, requires_bottleneck = _importorskip('bottleneck')
diff --git a/xarray/tests/data/example.ict b/xarray/tests/data/example.ict
@@ -0,0 +1,31 @@
+27, 1001
+Henderson, Barron
+U.S. EPA
+Example file with artificial data
+JUST_A_TEST
+1, 1
+2018, 04, 27, 2018, 04, 27
+0
+Start_UTC
+7
+1, 1, 1, 1, 1
+-9999, -9999, -9999, -9999, -9999
+lat, degrees_north
+lon, degrees_east
+elev, meters
+TEST_ppbv, ppbv
+TESTM_ppbv, ppbv
+0
+8
+ULOD_FLAG: -7777
+ULOD_VALUE: N/A
+LLOD_FLAG: -8888
+LLOD_VALUE: N/A, N/A, N/A, N/A, 0.025
+OTHER_COMMENTS: www-air.larc.nasa.gov/missions/etc/IcarttDataFormat.htm
+REVISION: R0
+R0: No comments for this revision.
+Start_UTC, lat, lon, elev, TEST_ppbv, TESTM_ppbv
+43200, 41.00000, -71.00000, 5, 1.2345, 2.220
+46800, 42.00000, -72.00000, 15, 2.3456, -9999
+50400, 42.00000, -73.00000, 20, 3.4567, -7777
+50400, 42.00000, -74.00000, 25, 4.5678, -8888
diff --git a/xarray/tests/data/example.uamiv b/xarray/tests/data/example.uamiv
diff --git a/xarray/tests/test_backends.py b/xarray/tests/test_backends.py