Skip to content

"levels" of the Standard #201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Jul 13, 2023 · 13 comments
Closed

"levels" of the Standard #201

MarcoGorelli opened this issue Jul 13, 2023 · 13 comments

Comments

@MarcoGorelli
Copy link
Contributor

Try to think about how to resolve our differences in #194 (comment), and I think such disagreements are bound to keep coming up.

I'd like to suggest a way around this: that there be 2 levels of the standard:

  • level 1: core functionality (what we have already), can be used to do heavy lifting
  • level 2: in addition to level1, there are also methods which don't necessarily guarantee high performance (such as GroupBy.__iter__, to_json/to_pylist, or maybe even .to_array_object)

Implementations could then choose to provide level1-compliance, or level1 and level2 compliance. For example, we may get to:

  • cudf: level1
  • modin: level1
  • pandas: level1, level 2
  • polars: level1, level2

Then, dataframe libraries could declare some level of compliance. For example:

  • scikit-learn: works with any level1-compliant dataframe
  • feature-engine: works with any level1-compliant dataframe
  • plotly: works with any level2-compliant dataframe
  • altair: works with any level2-compliant dataframe

Thoughts?

Plotting was meant to be one of the concrete use cases, and I'd be disappointed if we had to give it up just because it doesn't do heavy lifting

@kkraus14
Copy link
Collaborator

Thank you for writing this up and pushing this forward Marco. Something along the lines of this sounds like a good idea to me and can hopefully give developers clear guidance and a good UX.

What are your thoughts on moving between levels in an application / workflow? Similar to the example I gave where you have heavy lifting up front to drastically reduce the size of the dataframe that can be done with "level 1" APIs, and then some tail logic that requires "level 2" APIs. Maybe we'd want an API that allows moving from "level 1" to "level 2" in a standardized and library agnostic way of some sort?

I don't know how this would work without forcing every "level 1" library to have a "level 2" library as a forced dependency, but maybe that isn't really a problem in practice?

@MarcoGorelli
Copy link
Contributor Author

thanks, glad you're on board with the general idea! 🙌

Maybe we'd want an API that allows moving from "level 1" to "level 2" in a standardized and library agnostic way of some sort?

hmm yeah sounds like a good idea, will think about it

@MarcoGorelli
Copy link
Contributor Author

Do we have an example of a library which would do heavy lifting first, and then some level-2 tail logic?

Perhaps for now, we could just say: if you want to use a level1 dataframe library to a package which requires level2-compliance, then you should convert it to a level1 library first?

@kkraus14
Copy link
Collaborator

Do we have an example of a library which would do heavy lifting first, and then some level-2 tail logic?

A couple that come to mind are Datashader (https://github.com/holoviz/datashader) and VegaFusion (https://github.com/hex-inc/vegafusion) which both do things like histogramming and binning (presumably via aggregations) in order to collect all of the points into something much smaller which can then be visualized.

You could imagine them wanting to do all of the histogramming / binning work using a level-1 library for the performance and then once the result set is significantly reduced they could move to the level-2 library for the increased flexibility.

@kkraus14
Copy link
Collaborator

Perhaps for now, we could just say: if you want to use a level1 dataframe library to a package which requires level2-compliance, then you should convert it to a level1 library first?

I would be -1 on that because it will strongly encourage libraries to keep anti patterns in their codebase and just throw a "level-2 requirement" in. In general, I imagine all of the "level-2" functions to be things that are desired for flexibility at the cost of performance and should only be used when there is not other options as opposed to used for convenience. To this point, I would implement it in a way that makes "level-2" explicitly opt in.

I.E. imagine some function pseudocode like this:

def my_func(input_df: DataFrame) -> List:
    a_level_1 = input_df.level_1_function_a(...)
    b_level_1 = input_df.level_1_function_b(...)
    c_level_1 = a_level_1.join(b_level_1)

    # A library like Pandas or Polars, could return self which would be "free".
    # For cuDF or other libraries, may trigger a copy / conversion as needed, where this
    # function should be assumed to be "expensive" and should only be used when needed.
    c_level_2 = c_level_1.to_dataframe_level(2) 
    
    # Trying to do this on c_level_1 is not guaranteed
    return c_level_2.to_pylist()

We would really want someone to and should encourage those first few functions using their input library and then go to a "level-2" library only if / when needed. For libraries which support "level-2" functionality, they could just as happily be used as inputs and the only downside from a developer perspective is a single function call to change the level from 1 to 2, which could happily return self if the input dataframe already supports "level-2".

@MarcoGorelli
Copy link
Contributor Author

sure sounds good

so I think we need:

  • a way to go from level_1 to level_2 (for example, cudf could define that polars is their level_2 library - just for the the sake of example), like namespace.to_level_2_dataframe
  • a way to get back to level_1 from level_2 (so, once you've converted your cudf to polars, you may need to get back to cudf), like namespace.from_level_2_dataframe

I don't think there should be a way to go from level2 to level1, right? A given library might be the "level2" of many different level1 libraries

@kkraus14
Copy link
Collaborator

Agreed on not needing a universal way to go from level 2 to level 1. That's why we have the interchange protocol if someone really wanted to do it for some reason. Are there any implementations we know about where using the interchange protocol would be inefficient (i.e. Spark / Dask / Modin or other distributed implementations) where they'd be level 2 and want to "migrate" to a level 1 implementation?

I also don't think we should have the ability to go back from level 2 to level 1 in the case someone did the "migration". It would generally encourage bouncing back and forth which isn't what we're trying to enable here.

@MarcoGorelli
Copy link
Contributor Author

Is there a standardised way of constructing a dataframe using the interchange protocol? I'm only aware of pandas.api.interchange.from_dataframe, which is pandas-specific and which converts to numpy dtypes

I presume for cudf you'd prefer to interchange to something arrow-backed like polars?

@rgommers
Copy link
Member

I like this level 1 vs. level 2 idea too, and agree with the main points made so far.

One question I have is about this bit:

    # A library like Pandas or Polars, could return self which would be "free".
    # For cuDF or other libraries, may trigger a copy / conversion as needed, where this
    # function should be assumed to be "expensive" and should only be used when needed.
    c_level_2 = c_level_1.to_dataframe_level(2) 

if you return self here, this actually means that the level 2 APIs are present on the c_level_1 object. It's not clear to me whether they can actually be called without the .to_dataframe_level(2) call. If so, it's hard to keep them apart. Maybe that .to_dataframe_level(2) call should set a dataframe attribute like ._allow_calling_potentially_slow_convenience_apis to avoid this?

Another implementation trick could be to use __getattr__ to hide the level 2 methods if .to_dataframe_level(2) has not been called.

@rgommers
Copy link
Member

rgommers commented Jul 20, 2023

Is there a standardised way of constructing a dataframe using the interchange protocol? I'm only aware of pandas.api.interchange.from_dataframe, which is pandas-specific and which converts to numpy dtypes

The from_dataframe constructor function is recommended in https://data-apis.org/dataframe-protocol/latest/purpose_and_scope.html and https://data-apis.org/blog/dataframe_protocol_rfc/, but it wasn't actually required. IIRC @kkraus14 preferred to hold off and not put that in the protocol spec itself as a requirement (rationale: it's a lonely one-off function in a non-standard namespace), but put it in the full API namespace once we got to that point. Which I think is about now. The signature is already settled I think, it should match what is implemented in pandas & co. So how about we add it now?

EDIT: xref #42

@MarcoGorelli
Copy link
Contributor Author

You could imagine them wanting to do all of the histogramming / binning work using a level-1 library for the performance and then once the result set is significantly reduced they could move to the level-2 library for the increased flexibility.

Time for some pings, sorry: @mwaskom and @nicolaskruchten. Both plotly.express and seaborn make histograms, which may benefit from acceleration.

It's possible to rewrite most of plotly / seaborn using the DataFrame Standard, but to get all the way there, you'd still need to convert / interchange back to pandas (or polars, depending on some discussions) for operations like iterating over groups and applying UDFs.

The crucial implication of doing the above would be:

  • either you lose support for all functionality which uses the pandas Index (e.g. wide plots where the index is displayed on the x-axis)
  • or you have to duplicate all dataframe code paths to handle both the pandas API and the Standard API

Curious to hear your thoughts here if possible - thanks 🙏

@nicolaskruchten
Copy link

So Plotly Express is a bit of an odd animal: it does very little aggregation in Python (even for histograms and boxplots, where it obviously should for large dataframes) so it doesn't really benefit from using higher-performance dataframe implementations than pandas, for which it is hardcoded today. On the other hand, it heavily leverages possibly-idiosyncratic pandas behaviours around indexes, columns, extracting ndarrays of dates without messing with timezones etc etc, and it was very time-consuming getting all that to work, so I'm not sure what the appetite would be to port this to the standard API at all (given the lack of performance benefits), never mind maintaining both.

@MarcoGorelli
Copy link
Contributor Author

Thanks Nicolas for explaining!

That makes sense, and is along the lines of what I was expecting. The interchange protocol should already be enough for plotly express, there shouldn't be much incremental gain from the Standard.

Feel free to unsubscribe from the issue 😄


As a group, it may be wise to drop plotting as a target, which would make these "levels" kind of unnecessary.

Gonna pivot my energy towards scikit-learn, which seems like a much more realistic target

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants