-
Notifications
You must be signed in to change notification settings - Fork 42
Intake, catalogs, and datatree #134
Comments
@pbranson thanks for your ideas about integration of datatree with the intake ecosystem, this is definitely something I'm really interested in, and a potential use case I had in mind when originally creating this package.
I think this makes sense. Datatree is almost like an in-memory catalog of datasets.
Yep. There are probably lots of cool possibilities. My priority would be to build datatree in such a way that other packages can easily understand the model and experiment with interfacing in ways they think are sensible.
I think this poor performance could be a bunch of different problems, and I'm not sure if datatree actually solves any of the dask-side issues. Datatree just makes it easier to express the complex operation which behaves poorly when run via dask. cc @rabernat who has also pointed out the correspondence between datatree and intake catalogs to me before. |
@TomNicholas Thanks for breaking this out of #97! I should have guessed that this would have been part of your discussions! I just scanned back over the issues prompting the creation of datatree
The dask-side challenges could entirely be due to detail with my naïve usage! :-) |
The ability to open a set of intake catalogs as a Separately it's also been suggested that we might want to write a plugin for intake proper. |
Closed and moved to pydata/xarray#9438 |
Thanks @TomNicholas and sorry for creating issue noise. I guess I got a bit carried away with these comments in the readme:
I was thinking that maybe the datatree abstraction could be a more formalised and ultimately 'xarray native' approach to the the problems that have been tackled by e.g. intake-esm and intake-thredds. Leaves in the tree could compositions over netcdf files, which may be aggregated JSON indexes. I guess I was thinking that some sort of formalism over a nested datastructure could help in dask computational graph composition. I have run into issues where the scheduler gets overloaded, or just takes forever to start for calculations across large datasets composed with i.e. mf_opendataset
I wonder if @andersy005, @mdurant or @rsignell have any experience or thoughts about if it makes any sense for interface between this library and intake?
Originally posted by @pbranson in #97 (comment)
The text was updated successfully, but these errors were encountered: