Skip to content

Consider renaming DataTree.ds and/or the data argument in DataTree.__init__ #9458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Sep 8, 2024 · 8 comments
Closed
Labels
API design topic-DataTree Related to the implementation of a DataTree class

Comments

@shoyer
Copy link
Member

shoyer commented Sep 8, 2024

What is your issue?

It is not inherently clear to that DataTree(data=...) and DataTree.ds refer to the same thing, or that it refers specifically to the dataset associated with the node. We also do not use the abbreviation .ds anywhere else in Xarray.

Instead, what about calling this argument node? This makes the data model of DataTree much more transparent, it's just a node and children, with optional name and parent.

Example usage:

tree = DataTree(node=Dataset(...))
tree.node = Dataset(...)

CC @flamingbear @TomNicholas @eni-awowale @owenlittlejohns

@shoyer shoyer added needs triage Issue that has not been reviewed by xarray team member topic-DataTree Related to the implementation of a DataTree class API design and removed needs triage Issue that has not been reviewed by xarray team member labels Sep 8, 2024
@TomNicholas
Copy link
Member

TomNicholas commented Sep 9, 2024

It is not inherently clear to that DataTree(data=...) and DataTree.ds refer to the same thing, or that it refers specifically to the dataset associated with the node.

That's a fair criticism.

We also do not use the abbreviation .ds anywhere else in Xarray.

We don't in the API, but we do everywhere in the examples, e.g. ds = open_dataset(...). I agree that the current situation of .ds and .to_dataset() is not ideal though.

This makes the data model of DataTree much more transparent, it's just a node and children, with optional name and parent.

A node and zero or more children, to be really pedantic. As the type of children is not Optional.

Instead, what about calling this argument node?

This is fairly neat idea. The downsides are that "node" doesn't really imply data/dataset-like to me, and that currently we have a lot of explanation (and probably variable names / comments in code too) that uses "node" to mean "DataTree object at a specific location in the full tree", not "dataset data stored at at particular point in the tree".

Essentially if a tree "node" is no longer the thing that stores data, but is the data itself, we need a new word for what used to be called the "node". (In computer science lingo I think we would call each DataTree object a "vertex".)

In computer science terms a DataTree is roughly a "rooted (ordered) labeled tree whose vertices each contain a Dataset, with alignment enforced between Datasets".

@TomNicholas
Copy link
Member

TomNicholas commented Sep 9, 2024

I think I'm way too deep in the weeds here, and would appreciate perspective from others in @pydata/xarray on what they think the most intuitive names for these parts of the tree would be. Now, before we release in main, is the only time we can ever change this really.

@shoyer
Copy link
Member Author

shoyer commented Sep 10, 2024

Did we ever write the doc on DataTree terminology? That could be a good place to sort this all out.

@TomNicholas
Copy link
Member

There is a short, outdated terminology page on the xarray-contrib/datatree docs, but I think that's it. Being able to make separate PRs for these docs additions before releasing I think is a reason to make a dedicated (temporary) branch for the datatree docs (see #9033 (comment)).

@dcherian
Copy link
Contributor

How about simply .dataset? DataTree(dataset=...) seems pretty readable to me.

PS: tree.this struck me as nice but DataTree(this=...) is nonsensical haha.

@TomNicholas
Copy link
Member

How about simply .dataset?

That makes the intent quite clear, but

  1. It's a lot longer than .ds (or .node), and we expect this method to be called a lot,

  2. We also have .to_dataset(), which is different from .ds in that the former returns a mutable copy of the data and the latter returns an immutable view.

tree.this

👎 to this lol

@shoyer
Copy link
Member Author

shoyer commented Sep 10, 2024

Thanks @dcherian for the suggestions.

In the DataTree meeting today, we agreed to switch both names to .dataset. We'll leave the .ds property around for now as a (soft deprecated) alias.

@TomNicholas
Copy link
Member

Closed by #9476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API design topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

No branches or pull requests

3 participants