Skip to content

Get number of rows and columns #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Jun 29, 2020 · 14 comments
Closed

Get number of rows and columns #20

datapythonista opened this issue Jun 29, 2020 · 14 comments

Comments

@datapythonista
Copy link
Member

datapythonista commented Jun 29, 2020

This issue is to discuss how to obtain the size of a dataframe. I'll show with an example, and base it in the pandas API.

Given a dataframe:

import pandas

data = {'col1': [1, 2, 3, 4],
        'col2': [5, 6, 7, 8]}

df = pandas.DataFrame(data)

I think the Pythonic and simpler way to get the number of rows and columns is to just use Python's len, what pandas does:

>>> len(df)  # number of rows
4
>>> len(df.columns)  # number of columns
2

I guess an alternative could be to use df.num_rows and df.num_columns, but IMHO it doesn't add much value, and just makes the API more complex.

One thing to note, is that pandas mostly implements the dict API for a dataframe (as if it was a dictionary of lists, like in the example data). But when returning the number of rows with len(df), this is inconsistent with the dict API, which would return the number of columns (keys). So, with the proposed API len(data) != len(df). I think being fully consistent with the dict API would be misleading, but worth considering it.

Then, pandas offers some extra properties:

df.ndim == 2

df.shape == len(df), len(df.columns)

df.size == len(df) * len(df.columns)

I guess the reason for the first two is that pandas originally implemented Panel, a three dimensional data structure, and ndim and shape made sense with it. But I don't think they add much value now.

I don't think size is that commonly used (will check once we have the data of analyzing pandas usage), and it's trivial for the users to implement it, so I wouldn't add it to the API.

Proposal

  • len(df) returning the number of rows
  • len(df.columns) returning the number of columns

And nothing else regarding the shape of a dataframe.

@datapythonista
Copy link
Member Author

Some notes from the call:

  • size is a convenient way to check if a dataframe is empty (zero rows or columns)
  • ndim and shape would allow to use a dataframe in simple code where an ndarray is expected. Do we want that?
  • What happens in a lazy implementation when calling len(df)? For example, in a distributed structure, do we even want to support returning the length, which can be unknown and too expensive to compute?
  • CPython seems to limit the value returned by __len__() to an int64, which could not be enough with distributed systems. Should be better to use a property to avoid this limitation?

@rgommers
Copy link
Member

rgommers commented Aug 4, 2020

This seems to ignore wesm/dataframe-protocol#1. There are several people there that argue len(...) is ambiguous, and I'd agree with them.

In general, I'd prefer not to rehash anything that is covered by that __dataframe__ prototype in this repo, unless it builds on it. One thing that was suggested there was to write a requirements doc, which is kind of what we're doing here. So that is what we should focus on for basic data access/interchange topics imho.

@datapythonista
Copy link
Member Author

From the discussions in https://github.com/wesm/dataframe-protocol/pull/1/files#r469164058 and in the Aug 20 meeting I think there is agreement to use a function to return the number of rows in a dataframe. And probably less agreement, but interest in doing the same for the number of columns.

I think this is the abstract implementation resulting from the discussions (feel free to propose different names for the methods):

import abc

class dataframe(abc.ABC):
    def __len__(self):
        return NotImplementedError('Use the methods `.count_rows()` and `.count_columns()` to obtain the length of a dataframe.')

    @abc.abstractmethod
    def count_rows(self):
        """Return the number of rows in the dataframe."""

    @abc.abstractmethod
    def count_columns(self):
        """Return the number of columns in the dataframe."""

I guess implementing shape doesn't make sense based on the discussions (being explicit that the number of rows and columns can be an expensive operation by not making them properties or using __len__). That means the ndim makes even more sense, since the only goal was NumPy compatibility, which we don't aim to achieve. And I don't think we want to implement size either, even if it's worth considering having a way to know if a dataframe is empty, which was its main known use case.

If there are no objections (or better method name proposals), I'll be opening a PR soon.

@markusweimer
Copy link

What's the return type of these methods, and does it support lazy evaluation?

@datapythonista
Copy link
Member Author

What's the return type of these methods, and does it support lazy evaluation?

This is a very good question. I guess the simple option is to return a Python int. But if we want to allow lazy evaluation, or avoid the conversion to a Python object, we've got two options:

  • Use a 0-dimensional array, as in the array API
  • Implement a scalar type/class that takes care of these options

Since I think this will apply not only to these two methods, but most methods returning scalars, probably worth opening a new issue to discuss.

@markusweimer
Copy link

Would it be desirable to use the exact same types we use in the array API? Or would this generate too deep a dependency which would force the data frame implementations to also implement, or depend on, an array one?

@datapythonista
Copy link
Member Author

Would it be desirable to use the exact same types we use in the array API?

I think there is an intersection where it may be desirable that they are the same (or as compatible as possible). But I think several types only make sense for dataframe, such as category, string, datetime. And not sure if for dataframes we want bfloat. So, I don't think the list should be the same.

@rgommers
Copy link
Member

Coupling with the array API seems undesirable in general. One could consider a 0-D dataframe in analogy to arrays, but that may be a thing that not every library wants to implement. Maybe this should allow returning either a regular Python int or something that ducktypes properly as one?

Note that this will come up for the array API as well, e.g. .shape will give a Python tuple in NumPy and a custom Size object in PyTorch. That kind of variability in return types will be hard to get rid of completely without imposing a large implementation burden.

@jorisvandenbossche
Copy link
Member

I would personally still prefer the names num_rows() and num_columns(), as I find those somewhat more logical names (I think the fact that it are functions + clear docstring should be enough to warn for it being potentially costly).

(but +1 on having them as functions and not having __len__, to be clear, so it's only a name bikeshedding ;))

@datapythonista
Copy link
Member Author

datapythonista commented Aug 25, 2020

Thanks for the feedback Joris, num_rows() and num_columns() sounds good too. I have the feeling that count_rows() and count_columns() will make it easier to remember that they are functions and not properties, so that would be my preference. But may be that's just in my mind, I'm happy with both.

Let's see if we can get more opinions, and then we make a decision.

@TomAugspurger
Copy link

TomAugspurger commented Aug 25, 2020 via email

@markusweimer
Copy link

Thanks for the feedback Joris, num_rows() and num_columns() sounds good too. I have the feeling that count_rows() and count_columns() will make it easier to remember that they are functions and not properties

Whether they must compute something will be implementation-dependent, right? Do we have ways to signal that programmatically? Or will it be up to the documentation?

@devin-petersohn
Copy link
Member

The amount of computation will be implementation dependent in this approach, certainly.

Some applications may not want to trigger computation, no result or an estimation might be better in those cases. In this spec, it would probably be good to have certain guarantees on runtime for some of the APIs, but it should be clear from the API, not documentation. num_rows(estimate=True), for example, could be required to run in O(1), but num_rows() by default has no runtime guarantees.

I think that can be a different discussion, so as not to hijack the original purpose of this thread.

@rgommers
Copy link
Member

This was done, we indeed went with num_rows and num_columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants