-
Notifications
You must be signed in to change notification settings - Fork 21
Get number of rows and columns #20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some notes from the call:
|
This seems to ignore wesm/dataframe-protocol#1. There are several people there that argue In general, I'd prefer not to rehash anything that is covered by that |
From the discussions in https://github.com/wesm/dataframe-protocol/pull/1/files#r469164058 and in the Aug 20 meeting I think there is agreement to use a function to return the number of rows in a dataframe. And probably less agreement, but interest in doing the same for the number of columns. I think this is the abstract implementation resulting from the discussions (feel free to propose different names for the methods): import abc
class dataframe(abc.ABC):
def __len__(self):
return NotImplementedError('Use the methods `.count_rows()` and `.count_columns()` to obtain the length of a dataframe.')
@abc.abstractmethod
def count_rows(self):
"""Return the number of rows in the dataframe."""
@abc.abstractmethod
def count_columns(self):
"""Return the number of columns in the dataframe.""" I guess implementing If there are no objections (or better method name proposals), I'll be opening a PR soon. |
What's the return type of these methods, and does it support lazy evaluation? |
This is a very good question. I guess the simple option is to return a Python
Since I think this will apply not only to these two methods, but most methods returning scalars, probably worth opening a new issue to discuss. |
Would it be desirable to use the exact same types we use in the array API? Or would this generate too deep a dependency which would force the data frame implementations to also implement, or depend on, an array one? |
I think there is an intersection where it may be desirable that they are the same (or as compatible as possible). But I think several types only make sense for dataframe, such as category, string, datetime. And not sure if for dataframes we want bfloat. So, I don't think the list should be the same. |
Coupling with the array API seems undesirable in general. One could consider a 0-D dataframe in analogy to arrays, but that may be a thing that not every library wants to implement. Maybe this should allow returning either a regular Python int or something that ducktypes properly as one? Note that this will come up for the array API as well, e.g. |
I would personally still prefer the names (but +1 on having them as functions and not having |
Thanks for the feedback Joris, Let's see if we can get more opinions, and then we make a decision. |
Either is fine by me.
…On Tue, Aug 25, 2020 at 5:25 AM Marc Garcia ***@***.***> wrote:
Thanks for the feedback Joris, num_rows() and num_columns() sounds good
to. I have the feeling that count_rows() and count_columns() will make it
easier to remember that they are functions and not properties, so that
would be my preference. But may be that's just in my mind, I'm happy with
both.
Let's see if we can get more opinions, and then we make a decision.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#20 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIR2FLZBE4HLGCCPQH3SCOGQFANCNFSM4OLRQILA>
.
|
Whether they must compute something will be implementation-dependent, right? Do we have ways to signal that programmatically? Or will it be up to the documentation? |
The amount of computation will be implementation dependent in this approach, certainly. Some applications may not want to trigger computation, no result or an estimation might be better in those cases. In this spec, it would probably be good to have certain guarantees on runtime for some of the APIs, but it should be clear from the API, not documentation. I think that can be a different discussion, so as not to hijack the original purpose of this thread. |
This was done, we indeed went with |
This issue is to discuss how to obtain the size of a dataframe. I'll show with an example, and base it in the pandas API.
Given a dataframe:
I think the Pythonic and simpler way to get the number of rows and columns is to just use Python's
len
, what pandas does:I guess an alternative could be to use
df.num_rows
anddf.num_columns
, but IMHO it doesn't add much value, and just makes the API more complex.One thing to note, is that pandas mostly implements the
dict
API for a dataframe (as if it was a dictionary of lists, like in the exampledata
). But when returning the number of rows withlen(df)
, this is inconsistent with thedict
API, which would return the number of columns (keys). So, with the proposed APIlen(data) != len(df)
. I think being fully consistent with thedict
API would be misleading, but worth considering it.Then, pandas offers some extra properties:
I guess the reason for the first two is that pandas originally implemented
Panel
, a three dimensional data structure, andndim
andshape
made sense with it. But I don't think they add much value now.I don't think
size
is that commonly used (will check once we have the data of analyzing pandas usage), and it's trivial for the users to implement it, so I wouldn't add it to the API.Proposal
len(df)
returning the number of rowslen(df.columns)
returning the number of columnsAnd nothing else regarding the shape of a dataframe.
The text was updated successfully, but these errors were encountered: