Get number of rows and columns #20

datapythonista · 2020-06-29T20:05:52Z

This issue is to discuss how to obtain the size of a dataframe. I'll show with an example, and base it in the pandas API.

Given a dataframe:

import pandas

data = {'col1': [1, 2, 3, 4],
        'col2': [5, 6, 7, 8]}

df = pandas.DataFrame(data)

I think the Pythonic and simpler way to get the number of rows and columns is to just use Python's len, what pandas does:

>>> len(df)  # number of rows
4
>>> len(df.columns)  # number of columns
2

I guess an alternative could be to use df.num_rows and df.num_columns, but IMHO it doesn't add much value, and just makes the API more complex.

One thing to note, is that pandas mostly implements the dict API for a dataframe (as if it was a dictionary of lists, like in the example data). But when returning the number of rows with len(df), this is inconsistent with the dict API, which would return the number of columns (keys). So, with the proposed API len(data) != len(df). I think being fully consistent with the dict API would be misleading, but worth considering it.

Then, pandas offers some extra properties:

df.ndim == 2

df.shape == len(df), len(df.columns)

df.size == len(df) * len(df.columns)

I guess the reason for the first two is that pandas originally implemented Panel, a three dimensional data structure, and ndim and shape made sense with it. But I don't think they add much value now.

I don't think size is that commonly used (will check once we have the data of analyzing pandas usage), and it's trivial for the users to implement it, so I wouldn't add it to the API.

Proposal

len(df) returning the number of rows
len(df.columns) returning the number of columns

And nothing else regarding the shape of a dataframe.

The text was updated successfully, but these errors were encountered:

datapythonista · 2020-07-02T18:12:50Z

Some notes from the call:

size is a convenient way to check if a dataframe is empty (zero rows or columns)
ndim and shape would allow to use a dataframe in simple code where an ndarray is expected. Do we want that?
What happens in a lazy implementation when calling len(df)? For example, in a distributed structure, do we even want to support returning the length, which can be unknown and too expensive to compute?
CPython seems to limit the value returned by __len__() to an int64, which could not be enough with distributed systems. Should be better to use a property to avoid this limitation?

rgommers · 2020-08-04T23:08:48Z

This seems to ignore wesm/dataframe-protocol#1. There are several people there that argue len(...) is ambiguous, and I'd agree with them.

In general, I'd prefer not to rehash anything that is covered by that __dataframe__ prototype in this repo, unless it builds on it. One thing that was suggested there was to write a requirements doc, which is kind of what we're doing here. So that is what we should focus on for basic data access/interchange topics imho.

datapythonista · 2020-08-24T13:57:25Z

From the discussions in https://github.com/wesm/dataframe-protocol/pull/1/files#r469164058 and in the Aug 20 meeting I think there is agreement to use a function to return the number of rows in a dataframe. And probably less agreement, but interest in doing the same for the number of columns.

I think this is the abstract implementation resulting from the discussions (feel free to propose different names for the methods):

import abc

class dataframe(abc.ABC):
    def __len__(self):
        return NotImplementedError('Use the methods `.count_rows()` and `.count_columns()` to obtain the length of a dataframe.')

    @abc.abstractmethod
    def count_rows(self):
        """Return the number of rows in the dataframe."""

    @abc.abstractmethod
    def count_columns(self):
        """Return the number of columns in the dataframe."""

I guess implementing shape doesn't make sense based on the discussions (being explicit that the number of rows and columns can be an expensive operation by not making them properties or using __len__). That means the ndim makes even more sense, since the only goal was NumPy compatibility, which we don't aim to achieve. And I don't think we want to implement size either, even if it's worth considering having a way to know if a dataframe is empty, which was its main known use case.

If there are no objections (or better method name proposals), I'll be opening a PR soon.

markusweimer · 2020-08-24T15:22:00Z

What's the return type of these methods, and does it support lazy evaluation?

datapythonista · 2020-08-24T15:32:35Z

What's the return type of these methods, and does it support lazy evaluation?

This is a very good question. I guess the simple option is to return a Python int. But if we want to allow lazy evaluation, or avoid the conversion to a Python object, we've got two options:

Use a 0-dimensional array, as in the array API
Implement a scalar type/class that takes care of these options

Since I think this will apply not only to these two methods, but most methods returning scalars, probably worth opening a new issue to discuss.

markusweimer · 2020-08-24T16:06:25Z

Would it be desirable to use the exact same types we use in the array API? Or would this generate too deep a dependency which would force the data frame implementations to also implement, or depend on, an array one?

datapythonista · 2020-08-24T16:28:36Z

Would it be desirable to use the exact same types we use in the array API?

I think there is an intersection where it may be desirable that they are the same (or as compatible as possible). But I think several types only make sense for dataframe, such as category, string, datetime. And not sure if for dataframes we want bfloat. So, I don't think the list should be the same.

rgommers · 2020-08-24T16:35:36Z

Coupling with the array API seems undesirable in general. One could consider a 0-D dataframe in analogy to arrays, but that may be a thing that not every library wants to implement. Maybe this should allow returning either a regular Python int or something that ducktypes properly as one?

Note that this will come up for the array API as well, e.g. .shape will give a Python tuple in NumPy and a custom Size object in PyTorch. That kind of variability in return types will be hard to get rid of completely without imposing a large implementation burden.

jorisvandenbossche · 2020-08-25T07:45:54Z

I would personally still prefer the names num_rows() and num_columns(), as I find those somewhat more logical names (I think the fact that it are functions + clear docstring should be enough to warn for it being potentially costly).

(but +1 on having them as functions and not having __len__, to be clear, so it's only a name bikeshedding ;))

datapythonista · 2020-08-25T10:24:51Z

Thanks for the feedback Joris, num_rows() and num_columns() sounds good too. I have the feeling that count_rows() and count_columns() will make it easier to remember that they are functions and not properties, so that would be my preference. But may be that's just in my mind, I'm happy with both.

Let's see if we can get more opinions, and then we make a decision.

TomAugspurger · 2020-08-25T10:59:40Z

Either is fine by me.

…

On Tue, Aug 25, 2020 at 5:25 AM Marc Garcia ***@***.***> wrote: Thanks for the feedback Joris, num_rows() and num_columns() sounds good to. I have the feeling that count_rows() and count_columns() will make it easier to remember that they are functions and not properties, so that would be my preference. But may be that's just in my mind, I'm happy with both. Let's see if we can get more opinions, and then we make a decision. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#20 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKAOIR2FLZBE4HLGCCPQH3SCOGQFANCNFSM4OLRQILA> .

markusweimer · 2020-08-25T14:26:50Z

Thanks for the feedback Joris, num_rows() and num_columns() sounds good too. I have the feeling that count_rows() and count_columns() will make it easier to remember that they are functions and not properties

Whether they must compute something will be implementation-dependent, right? Do we have ways to signal that programmatically? Or will it be up to the documentation?

devin-petersohn · 2020-08-25T15:03:12Z

The amount of computation will be implementation dependent in this approach, certainly.

Some applications may not want to trigger computation, no result or an estimation might be better in those cases. In this spec, it would probably be good to have certain guarantees on runtime for some of the APIs, but it should be clear from the API, not documentation. num_rows(estimate=True), for example, could be required to run in O(1), but num_rows() by default has no runtime guarantees.

I think that can be a different discussion, so as not to hijack the original purpose of this thread.

rgommers · 2021-06-25T20:32:23Z

This was done, we indeed went with num_rows and num_columns.

datapythonista mentioned this issue Jul 1, 2020

Dataframe MVP #14

Closed

datapythonista mentioned this issue Aug 7, 2020

Dataframe interchange protocol #25

Closed

datapythonista mentioned this issue Aug 26, 2020

Scalar representation #28

Closed

rgommers closed this as completed Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get number of rows and columns #20

Get number of rows and columns #20

datapythonista commented Jun 29, 2020 •

edited

Loading

datapythonista commented Jul 2, 2020

rgommers commented Aug 4, 2020

datapythonista commented Aug 24, 2020

markusweimer commented Aug 24, 2020

datapythonista commented Aug 24, 2020

markusweimer commented Aug 24, 2020

datapythonista commented Aug 24, 2020

rgommers commented Aug 24, 2020

jorisvandenbossche commented Aug 25, 2020

datapythonista commented Aug 25, 2020 •

edited

Loading

TomAugspurger commented Aug 25, 2020 via email

markusweimer commented Aug 25, 2020

devin-petersohn commented Aug 25, 2020

rgommers commented Jun 25, 2021

Get number of rows and columns #20

Get number of rows and columns #20

Comments

datapythonista commented Jun 29, 2020 • edited Loading

Proposal

datapythonista commented Jul 2, 2020

rgommers commented Aug 4, 2020

datapythonista commented Aug 24, 2020

markusweimer commented Aug 24, 2020

datapythonista commented Aug 24, 2020

markusweimer commented Aug 24, 2020

datapythonista commented Aug 24, 2020

rgommers commented Aug 24, 2020

jorisvandenbossche commented Aug 25, 2020

datapythonista commented Aug 25, 2020 • edited Loading

TomAugspurger commented Aug 25, 2020 via email

markusweimer commented Aug 25, 2020

devin-petersohn commented Aug 25, 2020

rgommers commented Jun 25, 2021

datapythonista commented Jun 29, 2020 •

edited

Loading

datapythonista commented Aug 25, 2020 •

edited

Loading