Skip to content

Commit 767cc8c

Browse files
authored
Add blog post on dataframe interchange protocol RFC (#10)
1 parent 80b0d16 commit 767cc8c

File tree

2 files changed

+220
-0
lines changed

2 files changed

+220
-0
lines changed
+220
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
+++
2+
date = "2021-08-24"
3+
author = "Ralf Gommers"
4+
title = "Towards dataframe interoperability"
5+
tags = ["APIs", "standard", "consortium", "dataframes", "community"]
6+
categories = ["Consortium", "Standardization"]
7+
description = "An RFC for a dataframe interchange protocol"
8+
draft = false
9+
weight = 40
10+
+++
11+
12+
13+
In the PyData ecosystem we have a large number of dataframe libraries as of
14+
today, each with their own strengths and weaknesses. Pandas is the most
15+
popular library today. Other libraries offer significant capabilities beyond
16+
what it provides though - impressive performance gains for Vaex (CPU) and
17+
cuDF (GPU), distributed dataframes for Modin and Dask, or leveraging Spark as
18+
an execution engine for Koalas. For downstream library authors, it would be
19+
powerful to be able to work with all these libraries. Which right now is
20+
quite difficult, and therefore in practice most library authors choose to
21+
focus only on Pandas.
22+
23+
The first step to improve this situation is to use a "data interchange
24+
protocol", which will allow converting one type of dataframe into another, as
25+
well as inspect the dataframe for basic properties ("how many columns does it
26+
have?", "what are the column names?", "what are the dtypes for a given
27+
column?") and convert only subsets of it.
28+
29+
We are happy to release a Request for Comments (RFC) today, containing both a
30+
design document with purpose, scope and requirements for such a dataframe
31+
interchange protocol, as well as a prototype design:
32+
[documentation](https://data-apis.org/dataframe-protocol/latest/index.html),
33+
[repository](https://github.com/data-apis/dataframe-api).
34+
35+
We note that an interchange protocol is not a completely new idea: for arrays
36+
we have had such protocols for a long time, e.g., `__array_interface__`, the
37+
buffer protocol (PEP 3118), `__cuda_array_interface__` and DLPack. The
38+
conversation about a dataframe interchange protocol was started by Gael
39+
Varoquaux last year in [this Discourse
40+
thread](https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267).
41+
In response Wes McKinney sketched up an initial prototype
42+
[here](https://github.com/wesm/dataframe-protocol/pull/1). There were a lot
43+
of good ideas in that initial conversation and prototype, however it was
44+
clear that it was a complex enough topic that a more thorough approach
45+
including collecting requirements and use cases from a large set of
46+
stakeholders was needed. The RFC we're announcing in this blog post is the
47+
result of taking that approach, and hopefully will be the starting point for
48+
implementations in all Python dataframe libraries.
49+
50+
_We want to emphasize that this is not a full dataframe API; the only
51+
attribute added to the dataframe class/object of a library will be
52+
`__dataframe__`. It is aimed at library authors, not at end users._
53+
54+
55+
## What is a "dataframe" anyway?
56+
57+
Defining what a dataframe _is_ turns out to be surprisingly difficult
58+
exercise. For example, can column named be integer or only strings, and must
59+
they be unique? Are row labels required, optional, or not a thing? Should
60+
there be any restriction on how data is stored inside a dataframe? Does it
61+
have other properties, like row-column symmetry, or support for certain
62+
operations?
63+
64+
For the purposes of data interchange, we need to describe a dataframe both
65+
conceptually, and in terms of data representation in memory so that another
66+
library can interpret that data. Furthermore, we want to impose as few extra
67+
constraints as possible. Here is our working definition: _A dataframe is an
68+
ordered collection of columns, which are conceptually 1-D arrays with a dtype
69+
and missing data support. A column has a name, which is a unique string. A
70+
dataframe or a column may be "chunked", meaning its data is not contiguous in
71+
memory._
72+
73+
![Conceptual model of a dataframe, with columns (possibly containing missing data), and chunks](/images/dataframe_conceptual_model.png)
74+
75+
For more on the conceptual model, and on requirements that a dataframe
76+
protocol must fulfill, see [this design
77+
document](https://data-apis.org/dataframe-protocol/latest/design_requirements.html).
78+
79+
80+
## Key design choices
81+
82+
Given the goals and requirements we had for the protocol, there were still a
83+
number of design choices to make. The single most important choice is: does
84+
the protocol offer a description of how data is laid out in memory, or does
85+
it offer a way (or multiple ways) of exporting data in a given format, e.g. a
86+
column as an Apache Arrow array or a NumPy array.
87+
88+
The choice we made here in [the current
89+
prototype](https://github.com/data-apis/dataframe-api/tree/main/protocol) is:
90+
do not assume a particular implementation, describe memory down to the level of
91+
buffers (=contiguous, 1-D blocks of memory). And at that buffer level, we can
92+
make the connection between this dataframe protocol and the
93+
[array API standard via `__dlpack__`](https://data-apis.org/array-api/latest/design_topics/data_interchange.html).
94+
95+
96+
### Similarity (and synergy?) with the Arrow C Data Interface
97+
98+
When looking at the requirements and native in-memory formats of all
99+
prominent dataframe libraries, we found that the Arrow C Data Interface is
100+
pretty close to meeting all the requirements. So a natural question is: can
101+
we use that interface, and standardize a Python API on top of it?
102+
103+
There are a couple of things in the current Arrow C Data Interface that
104+
didn't quite match everyone's needs. Most importantly, the Arrow C Data
105+
Interface does not have device support (e.g., GPUs). Other issues (or wishes)
106+
are:
107+
- The "deleter", which releases memory when it's no longer needed, lives at
108+
the column level in Arrow. Multiple people expressed the desire for more
109+
granular control. It seems more natural and performant to have the deleter
110+
at the buffer level.
111+
- Allowing a column to have its data split over different devices, e.g. part
112+
of the data lives on CPU and part on GPU (a necessity if the data doesn't
113+
fit in GPU memory).
114+
- Arrow supports masks, for null/missing values, as a bit mask. NumPy doesn't
115+
have bit masks, and boolean masks are normally one byte per value. This is
116+
a smaller issue though, because it can be solved via a convention like
117+
using (e.g.) a regular `int8` column with a certain name.
118+
119+
Compared to the similaries between the two protocols, the differences are
120+
relatively minor. And a lot of work has already gone into the Arrow C Data
121+
Interface, hence we are interested in exploring if we can contribute the
122+
identified improvements back to Apache Arrow. That would potentially let us
123+
support, for example, an `__arrow_column__` attribute at the column level in
124+
Python, which would save dataframe libraries that already use Apache Arrow a
125+
significant amount of implementation work.
126+
127+
128+
### A standard dataframe creation function
129+
130+
Also in the analogy to the array API standard, we are proposing a single new
131+
function, `from_dataframe`, for dataframe libraries to add in their top-level
132+
namespace. This function will know how to construct a library-native
133+
dataframe instance from any other dataframe object. Here is an example for
134+
Modin:
135+
136+
```python
137+
import modin.pandas as pd
138+
139+
140+
def somefunc(df, ...):
141+
"""
142+
Do something interesting with dataframe `df`.
143+
144+
Parameters
145+
----------
146+
df : dataframe instance
147+
Can be a Modin dataframe, or any other kind of dataframe supporting the `__dataframe__` protocol
148+
"""
149+
df_modin = pd.from_dataframe(df)
150+
# From now on, use Modin dataframe internally
151+
152+
153+
def somefunc2(df, col1, col2):
154+
"""
155+
Do something interesting with two columns from dataframe `df`.
156+
157+
Parameters
158+
----------
159+
df : dataframe instance
160+
Can be a Modin dataframe, or any other kind of dataframe supporting the `__dataframe__` protocol
161+
col1 : str
162+
Name of column 1
163+
col1 : str
164+
Name of column 2
165+
"""
166+
# This will extract just the two columns we need from `df`, and put them in
167+
# a Modin dataframe. This is much more efficient than converting the
168+
# (potentially very large) complete dataframe.
169+
df_modin = pd.from_dataframe(df, cols=[col1, col2])
170+
```
171+
172+
173+
## Next steps
174+
175+
This protocol is not completely done. We are releasing it now in order to get
176+
feedback from a wider range of stakeholders. We are interested to hear about
177+
everything from potential use cases we missed or should describe better, to
178+
whether the API feels natural, and low-level performance/implementation
179+
concerns or ideas for improvement.
180+
181+
Today we are releasing one prototype implementation, for Pandas. Most of that
182+
prototype can be reused for implementations in other libraries. What we'd
183+
really like to see next is: can this be used in downstream libraries like
184+
scikit-learn or Seaborn? Right now those accept Pandas dataframes; letting
185+
them work with other types of dataframes is potentially quite valuable. This
186+
is what we should see before finalizing the API and semantics of this
187+
protocol.
188+
189+
190+
## What about a full dataframe API?
191+
192+
At the end of last year we released a full
193+
[array API standard](https://data-apis.github.io/array-api/latest/).
194+
So what about a full dataframe API?
195+
196+
Our initial intent was to take the methodology we used for constructing the
197+
array API, and the lessons we learned doing so, to dataframes. We found that
198+
to be quite challenging however, due to two reasons:
199+
200+
1. It turns out that dataframe library authors & end users have quite
201+
different API design needs. Much more so than for arrays. Library authors
202+
need clear semantics, no surprises or performance cliffs, and explicit
203+
APIs. End users seem to want more "magic", where API calls can be chained
204+
and basically "do the right thing".
205+
2. For array libraries we used _API synthesis_, and based design decisions
206+
partly on data about how often current APIs are used. This worked because
207+
maintainers and end users are largely happy with the state of APIs for
208+
n-dimensional arrays. Those have an almost 25-year long history, so that's
209+
not surprising. Dataframes are much younger - Pandas was created in 2009
210+
and reached version 1.0 only last year. And much more is still in flux
211+
there. Hence freezing the current state of dataframe APIs via
212+
standardization did not seem like a good idea.
213+
214+
So, what's next for a larger dataframe API? Our strategy will be to
215+
focus on library authors as an audience, and based on the introduction of the
216+
interchange protocol see if we can identify next pieces that are useful. And
217+
then organically grow the size of the API, while being careful to not
218+
standardize APIs that dataframe library maintainers are not completely
219+
satisfied with.
220+
39.8 KB
Loading

0 commit comments

Comments
 (0)