-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Added PNC backend to xarray #1905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PNC is used for GEOS-Chem, CAMx, CMAQ and other atmospheric data formats that have their own file formats and meta-data conventions. It can provide a CF compliant netCDF-like interface.
PNC is used for GEOS-Chem, CAMx, CMAQ and other atmospheric data formats including NOAA arl packed bit and HYSPLIT related files. The goal is to provide access to many file types that have their own file formats and meta-data conventions. It can provide a CF compliant netCDF-like interface.
p.s., I just noticed my edits to whats-new.rst were not pushed until after my pull request. |
First of all -- this is very cool, thanks for putting this together! In the long term, I think we would prefer to move more specialized backends out of xarray proper. But at the current time, it's difficult to do this easily, and the backend interface itself is not entirely stable. So it probably makes sense to add this directly into xarray for now. To merge this into xarray, I'm afraid that an automated test suite of some sort is non-negotiable. Are there example datasets can you can create on the fly from Python? |
Right now, pnc is simply being tested as a reader for NetCDF3 files
First, I too quickly tried to fast forward and clearly some test was failing. I have updated my code to pass all tests with py.test. Before closing this request and opening another, I want to make sure I am clear one the extent of what I should add. I can create binary data from within python and then read it, but all those tests are in my software package. Duplicating that seems like a bad idea. I have added a NetCDF3Only testcase to test_backends.py and it passes. That doesn't stress the multi-format capabilities of pnc, but as I've said all the numerical assertions for the other formats are in my system's test cases. Is the NetCDF3Only test sufficient in this case? Further, below are some simple applications that download my test data for CAMx and GEOS-Chem and plot it. Thanks for the input.
|
First, don't close this PR. You can just keep adding commits to your In terms of test coverage, let's start by detailing what functionality we expect the PseudoNetCDF backend to have.
Finally, two things that would really help grease the skids here would be:
Neither of these last two are strictly required but both are satisfied by all the other Xarray supported backends. |
Right. The goal isn't to duplicate your tests, but to provide a meaningful integration test for the xarray backend. Can we read data from PNC into xarray and do everything an xarray user would want to do with it? Testing your API with a netCDF3 file would probably be enough, assuming you have good test coverage internally. We already have a somewhat complete test suite for netCDF data that you could probably hook into, but for reference sorts of issues that tend to come up include:
|
I couldn't find a |
My variable objects present a pure numpy array, so they follow numpy indexing precisely with one exception. If the files are actually netCDF4, they have the same limitations of the netCDF4.Variable object.
I have not tested separate processes. In many cases, I use numpy memmap. So that will be the limitation.
Same as numpy, but also has support for the netCDF4 style.
I use relative dates following netcdf time conventions. Within my software, there are special functions for translation, but I have seen this be treated by xarray separately.
I added TravisCI, but haven't looked Appveyor.
I added a ci/requirements-py36-netcdf4-dev.yml as a part of my TravisCI integration. I am also working on a recipe (like https://github.com/conda-forge/xarray-feedstock). |
xarray/backends/pnc_.py
Outdated
def __init__(self, filename, mode='r', autoclose=False): | ||
from PseudoNetCDF import pncopen | ||
try: | ||
opener = functools.partial(pncopen, filename, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For new backend classes, we're trying to move to a style where there is a dedicated classmethod open
that handles opening a file from scratch and the backend constructor accepts an already open File object. Take a look at NetCDF4Backend for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would require changes to many of my classes that accept paths. Is this a preconditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a little bit -- how do you normally call pncopen
? I assume it accepts some set of required and optional arguments?
For the usual case, users would write:
# this would call pncopen(filename) internally
ds = xr.open_dataset(path, engine='pseudonetcdf')
For advanced cases, users would write something like:
# this would call pncopen(filename, mode='r', format='netcdf') internally
store = xr.backends.PsuedoNetCDFBackend.open(filename, mode='r', format='netcdf')
ds = xr.open_dataset(store)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A normal call looks like this
myfile = pncopen(path, format=fmt, **kwds)
The only required argument is path. The format keyword is a named format and is highly recommended. If format is omitted, pncopen will cycle through all the formats it knows using the isMine class method to find a class that can open it. The other keywords are optional from pncopen's perspective, but may be required if one is opening file formats that are not fully self-describing.
I had not realized that you can provide the open
interface and the resulting store to open_dataset. I was not planning to provide support for these because I assume your init interfaces need to be identical. This seems like a nice alternative.
I'll make an adjustment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add an optional backend_kwargs
argument to open_dataset
to make this a little easier.
xarray/backends/pnc_.py
Outdated
opener = functools.partial(pncopen, filename, | ||
mode=mode, format='netcdf') | ||
self.ds = opener() | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need all these fallbacks? This logic is usually best avoided -- it's better to require users to be explicit.
You should pick some default behavior for open
(which will be used by open_dataset
) and require explicitly opening the backend object if that does not suffice for some use-case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove the first except wrap, but the second is present because not all constructors except the mode keyword. Individual constructors are wrapped by pncopen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you simply leave out the mode
argument in that case? There's no particular reason xarray needs to include it, as long as the default behavior supports reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can leave out mode, but that will prevent the formats that do support mode from using it. I could add a **kwds
option to open
much like pncopen. The default behavior would, as you suggest, leave out mode.
xarray/backends/api.py
Outdated
@@ -295,6 +295,9 @@ def maybe_decode_store(store, lock=False): | |||
elif engine == 'pynio': | |||
store = backends.NioDataStore(filename_or_obj, | |||
autoclose=autoclose) | |||
elif engine == 'pnc': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would slightly rather use the full name "pseudonetcdf" than an abbreviation like "pnc". We don't use abbreviations for any of the other backend names (to be fair, yes, they are also shorter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
OK, we will need to surface this information in some way for xarray -- maybe as an attribute of some sort on the psuedonetcdf side? The good news is that we already have handling for indexing like NumPy arrays and netCDF4 variables, but we need to know which behavior to expect to make all of indexing operations work efficiently. It's also safe to say for now that only basic indexing is supported, but that will result in sub-optimal indexing behavior ( slower and more memory intensive than necessary). Note that we are currently in the process of refactoring how we handle indexers in backends, so you'll probably need to update things after #1899 is merged. |
What is the status of this? I would like to start using this ASAP but cannot convince my team to use it until it is an official backend of xarray |
@bbakernoaa - It may be a bit. I'm |
I only mean to comment, not close and comment. |
xarray/backends/pnc_.py
Outdated
|
||
def __getitem__(self, key): | ||
key = indexing.unwrap_explicit_indexer( | ||
key, target=self, allow=indexing.BasicIndexer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By #1899, we improved the indexing logic for backend arrays,
to support flexible indexing for all the backend.
Can you update here as similar to the following?
key, np_inds = indexing.decompose_indexer(
key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR)
array = self.get_array()[key.tuple] # index backend array
if len(np_inds.tuple) > 0:
array = indexing.NumpyIndexingAdapter(array)[np_inds] # index the loaded np.ndarray
return array
For example, see
xarray/xarray/backends/h5netcdf_.py
Lines 18 to 32 in 54468e1
def __getitem__(self, key): | |
key, np_inds = indexing.decompose_indexer( | |
key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR) | |
# h5py requires using lists for fancy indexing: | |
# https://github.com/h5py/h5py/issues/992 | |
key = tuple(list(k) if isinstance(k, np.ndarray) else k for k in | |
key.tuple) | |
with self.datastore.ensure_open(autoclose=True): | |
array = self.get_array()[key] | |
if len(np_inds.tuple) > 0: | |
array = indexing.NumpyIndexingAdapter(array)[np_inds] | |
return array |
Here, this
key, np_inds = indexing.decompose_indexer(
key, self.shape, indexing.IndexingSupport.OUTER_1VECTOR)
decomposes the given indexers into those for backend arrays (key
) and for loaded np.ndarray
(np_inds
).
The third argument (OUTER_1VECTOR
, OUTER
, VECTORIZED
, BASIC
) indicates what kind of indexing the backend support.
xarray/tests/test_backends.py
Outdated
'attrs': { | ||
'units': 'Start_UTC', | ||
'standard_name': 'Start_UTC', | ||
**stdattr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E999 SyntaxError: invalid syntax
xarray/tests/test_backends.py
Outdated
'attrs': { | ||
'units': 'Start_UTC', | ||
'standard_name': 'Start_UTC', | ||
**stdattr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E999 SyntaxError: invalid syntax
prevents inadvertent double scaling in PNC formats
The challenge here is that if specific formats use a similar functionality, it has already been applied and may or may not use the CF keywords. So, it should be disabled by default.
I was worried this would be hard to implement, but it was actually easier. So that is what I did. |
Updates in 3.0.1 will fix close in uamiv.
xarray/backends/pseudonetcdf_.py
Outdated
from PseudoNetCDF._getreader import getreader, getreaderdict | ||
readerdict = getreaderdict() | ||
reader = getreader(filename, format=format, **format_kwds) | ||
_genreaders = tuple([readerdict[rn] for rn in _genericncf if rn in readerdict]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (87 > 79 characters)
xarray/backends/pseudonetcdf_.py
Outdated
|
||
|
||
_genericncf = ('Dataset', 'netcdf', 'ncf', 'nc') | ||
_genreaders = tuple([readerdict[rn] for rn in _genericncf if rn in readerdict]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F821 undefined name 'readerdict'
xarray/backends/pseudonetcdf_.py
Outdated
def open(cls, filename, format=None, writer=None, | ||
autoclose=False, **format_kwds): | ||
from PseudoNetCDF._getreader import getreader, getreaderdict | ||
readerdict = getreaderdict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F841 local variable 'readerdict' is assigned to but never used
xarray/backends/pseudonetcdf_.py
Outdated
def open(cls, filename, format=None, writer=None, | ||
autoclose=False, **format_kwds): | ||
from PseudoNetCDF._getreader import getreader, getreaderdict | ||
_genreaders = tuple([readerdict[rn] for rn in _genericncf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F821 undefined name 'readerdict'
xarray/backends/pseudonetcdf_.py
Outdated
autoclose=False, **format_kwds): | ||
from PseudoNetCDF._getreader import getreader, getreaderdict | ||
_genreaders = tuple([readerdict[rn] for rn in _genericncf | ||
if rn in readerdict]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F821 undefined name 'readerdict'
xarray/backends/pseudonetcdf_.py
Outdated
from PseudoNetCDF._getreader import getreader, getreaderdict | ||
_genreaders = tuple([readerdict[rn] for rn in _genericncf | ||
if rn in readerdict]) | ||
readerdict = getreaderdict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F841 local variable 'readerdict' is assigned to but never used
@barronh clarifying question for you: does PNC support some sort of "lazy loading" of data, where it is only loaded into NumPy arrays when accessed? Or does it eagerly load data into NumPy arrays? (sorry if you already answered this somewhere above!) |
Depends on the format and expected size of data in that format. Some formats support lazy; others load immediately into memory. Sill others use memmaps, so that virtual memory is used immediately. |
@@ -222,6 +228,10 @@ def open_dataset(filename_or_obj, group=None, decode_cf=True, | |||
-------- | |||
open_mfdataset | |||
""" | |||
|
|||
if mask_and_scale is None: | |||
mask_and_scale = not engine == 'pseudonetcdf' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let's do this for decode_cf
instead of mask_and_scale
. I don't think the other netCDF specific decoding like concatenating characters is relevant to psuedonetcdf, either.
Also, let's be sure to change this for open_mfdataset
(and open_dataarray
) as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next push:
- I switched from
mask_and_scale
todecode_cf
. open_mfdataset
uses a pass-through of keywords.open_dataarray
has been updated
FYI - This leads to an odd outcome -- if I want to decode_times
, I have to explicitly enable decode_cf
and then disable all the others individually, and enable decode_times
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... I didn't realize you do want to decode times. In that case, I guess disabling only mask_and_scale is a reasonable choice.
xarray/backends/pseudonetcdf_.py
Outdated
@classmethod | ||
def open(cls, filename, format=None, writer=None, | ||
autoclose=False, **format_kwds): | ||
from PseudoNetCDF._getreader import getreader, getreaderdict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Importing from private APIs (with the underscore prefix) makes me nervous. In general we try to avoid doing this because private APIs can change at any time, though of course that isn't true here because you're the PNC maintainer. I'm also nervous about including constants like _genericncf = ('Dataset', 'netcdf', 'ncf', 'nc')
in xarray because that feels like logic that should live in libraries like PNC instead.
So maybe we should stick with the standard open method that you were using earlier. In particular, if we change the default to decode_cf=False
for PNC in open_dataset()
then I don't think it's essential to check the format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be confusing for users to have an indirect use of netCDF4
via PseudoNetCDF
that responds differently than the typical backend. This code is designed to address that xarray-specific issue. Typical behavior in PNC is to mask_and_scale via netCDF4's default functionality, so with decode_cf
a netCDF file would still be decoded. This is confusing to me, let alone a user. For this reason, I think it is right to exclude formats that already have specifically optimized backends.
I agree completely about private API. The standard pncopen, however, does not currently support the ability to enable a subset of readers. I will add this a feature request to PseudoNetCDF, and make a subsequent update there. As soon as I update pncopen, I would make the update here.
The _genericncf
is not logic that should live in PNC. It is logic designed specifically to address an xarray issue. In this case, xarray
has a feature that should pre-empt PNC typical capabilities.
The alternative is to update PNC, update the conda installation, and update this code again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking of someone explicitly calling xarray.decode_cf()
on a dataset loaded from PNC? I agree that that would give unexpected results, but I don't see that being done very often.
Another way to make all of this safe (and which wouldn't require disabling mask_and_scale
in open_dataset()
) would be to explicitly drop _FillValue
, missing_value
, add_offset
and scale_factor
attributes from variables loaded with PNC, moving them to Variable.encoding
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine that you have a netcdf file. If you open it with PNC, it uses netCDF4 as the library to open it. In xarray, however, you might get a different processing. If you write it out, it might have different properties. This seems undesirable at best. That is why I prefer removing the netCDF functionality.
In PNC, the goal of using netCDF4 is to make sure all pnc functionality is available for any type of file. In PNC, some plotting and processing tools are available (similar to xarray). I think plotting and processing in xarray is often better than PNC. The goal of adding PNC to xarray is to bring xarray formats that PNC can read. Since xarray already has mechanisms for supporting netCDF, I think bringing this functionality is unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My general feeling is that we should make it easy to do the right thing, but not put up artificial barriers to stop expert users. I think this is broadly consistent with most Python projects, including the Python language itself. Even if it's strange, somebody might have reasons for trying PNC/netCDF4 together, and as long as it will not give wrong results there is no reason to disable it.
That said, my real objection here are all the private imports from PNC, and the four different aliases we use to check for netCDF files. If you want to put a public API version of this in PNC before merging this PR, that would be OK with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I removed the type checking and private imports. I updated the testing.
I get one failed that I cannot identify as related to PNC: test_open_mfdataset_manyfiles
files with:
E OSError: [Errno 24] Too many open files: '/var/folders/g2/hwlpd21j4vl8hvldb4tyrz400000gn/T/tmp0bteny75'
I'm going to run it through the build to see if it is just my setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It didn't fail on the automated test, so all passing [x] and no private imports [x].
xarray/backends/pseudonetcdf_.py
Outdated
if rn in readerdict]) | ||
reader = getreader(filename, format=format, **format_kwds) | ||
if isinstance(reader, _genreaders): | ||
raise ValueError(('In xarray, PseudoNetCDF should not be used ' + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid this check/error message if we default to decode_cf=False
for PNC in open_dataset()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment.
setattr(pncf, pk, pv) | ||
|
||
pnc.pncwrite(pncf, path, **save_kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to add a test for open_mfdataset
with PNC if that's something we want to support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. I'll add that to my next push.
Testing by opening same file twice and stacking it.
The two failures were not pnc:
Neither seems related to PNC at all... I'm pulling the master and remerging to see if that helps... |
Mask and scale with PseudoNetCDF and NetCDF4 is not supported, but not prevented.
mask_and_scale is None (diagnosed by open_dataset) and decode_cf should be True
thanks for sticking with this @barronh ! |
@shoyer - Thanks for all the help and guidance. I learned a lot. In the development version of pnc, it is now flake8 compliant. I now use pytest to organize my unittests. I have TravisCI implemented. I have a conda-forge recipe. I’m grateful for the interactions. |
PNC is used for GEOS-Chem, CAMx, CMAQ and other atmospheric data formats
including NOAA arl packed bit and HYSPLIT related files. The goal is to
provide access to many file types that have their own file formats and
meta-data conventions. It can provide a CF compliant netCDF-like interface.
whats-new.rst
for all changes andapi.rst
for new API (remove if this change should not be visible to users, e.g., if it is an internal clean-up, or if this is part of a larger project that will be documented later)