|
| 1 | ++++ |
| 2 | +date = "2021-08-24" |
| 3 | +author = "Ralf Gommers" |
| 4 | +title = "Towards dataframe interoperability" |
| 5 | +tags = ["APIs", "standard", "consortium", "dataframes", "community"] |
| 6 | +categories = ["Consortium", "Standardization"] |
| 7 | +description = "An RFC for a dataframe interchange protocol" |
| 8 | +draft = false |
| 9 | +weight = 40 |
| 10 | ++++ |
| 11 | + |
| 12 | + |
| 13 | +In the PyData ecosystem we have a large number of dataframe libraries as of |
| 14 | +today, each with their own strengths and weaknesses. Pandas is the most |
| 15 | +popular library today. Other libraries offer significant capabilities beyond |
| 16 | +what it provides though - impressive performance gains for Vaex (CPU) and |
| 17 | +cuDF (GPU), distributed dataframes for Modin and Dask, or leveraging Spark as |
| 18 | +an execution engine for Koalas. For downstream library authors, it would be |
| 19 | +powerful to be able to work with all these libraries. Which right now is |
| 20 | +quite difficult, and therefore in practice most library authors choose to |
| 21 | +focus only on Pandas. |
| 22 | + |
| 23 | +The first step to improve this situation is to use a "data interchange |
| 24 | +protocol", which will allow converting one type of dataframe into another, as |
| 25 | +well as inspect the dataframe for basic properties ("how many columns does it |
| 26 | +have?", "what are the column names?", "what are the dtypes for a given |
| 27 | +column?") and convert only subsets of it. |
| 28 | + |
| 29 | +We are happy to release a Request for Comments (RFC) today, containing both a |
| 30 | +design document with purpose, scope and requirements for such a dataframe |
| 31 | +interchange protocol, as well as a prototype design: |
| 32 | +[documentation](https://data-apis.org/dataframe-protocol/latest/index.html), |
| 33 | +[repository](https://github.com/data-apis/dataframe-api). |
| 34 | + |
| 35 | +We note that an interchange protocol is not a completely new idea: for arrays |
| 36 | +we have had such protocols for a long time, e.g., `__array_interface__`, the |
| 37 | +buffer protocol (PEP 3118), `__cuda_array_interface__` and DLPack. The |
| 38 | +conversation about a dataframe interchange protocol was started by Gael |
| 39 | +Varoquaux last year in [this Discourse |
| 40 | +thread](https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267). |
| 41 | +In response Wes McKinney sketched up an initial prototype |
| 42 | +[here](https://github.com/wesm/dataframe-protocol/pull/1). There were a lot |
| 43 | +of good ideas in that initial conversation and prototype, however it was |
| 44 | +clear that it was a complex enough topic that a more thorough approach |
| 45 | +including collecting requirements and use cases from a large set of |
| 46 | +stakeholders was needed. The RFC we're announcing in this blog post is the |
| 47 | +result of taking that approach, and hopefully will be the starting point for |
| 48 | +implementations in all Python dataframe libraries. |
| 49 | + |
| 50 | +_We want to emphasize that this is not a full dataframe API; the only |
| 51 | +attribute added to the dataframe class/object of a library will be |
| 52 | +`__dataframe__`. It is aimed at library authors, not at end users._ |
| 53 | + |
| 54 | + |
| 55 | +## What is a "dataframe" anyway? |
| 56 | + |
| 57 | +Defining what a dataframe _is_ turns out to be surprisingly difficult |
| 58 | +exercise. For example, can column named be integer or only strings, and must |
| 59 | +they be unique? Are row labels required, optional, or not a thing? Should |
| 60 | +there be any restriction on how data is stored inside a dataframe? Does it |
| 61 | +have other properties, like row-column symmetry, or support for certain |
| 62 | +operations? |
| 63 | + |
| 64 | +For the purposes of data interchange, we need to describe a dataframe both |
| 65 | +conceptually, and in terms of data representation in memory so that another |
| 66 | +library can interpret that data. Furthermore, we want to impose as few extra |
| 67 | +constraints as possible. Here is our working definition: _A dataframe is an |
| 68 | +ordered collection of columns, which are conceptually 1-D arrays with a dtype |
| 69 | +and missing data support. A column has a name, which is a unique string. A |
| 70 | +dataframe or a column may be "chunked", meaning its data is not contiguous in |
| 71 | +memory._ |
| 72 | + |
| 73 | + |
| 74 | + |
| 75 | +For more on the conceptual model, and on requirements that a dataframe |
| 76 | +protocol must fulfill, see [this design |
| 77 | +document](https://data-apis.org/dataframe-protocol/latest/design_requirements.html). |
| 78 | + |
| 79 | + |
| 80 | +## Key design choices |
| 81 | + |
| 82 | +Given the goals and requirements we had for the protocol, there were still a |
| 83 | +number of design choices to make. The single most important choice is: does |
| 84 | +the protocol offer a description of how data is laid out in memory, or does |
| 85 | +it offer a way (or multiple ways) of exporting data in a given format, e.g. a |
| 86 | +column as an Apache Arrow array or a NumPy array. |
| 87 | + |
| 88 | +The choice we made here in [the current |
| 89 | +prototype](https://github.com/data-apis/dataframe-api/tree/main/protocol) is: |
| 90 | +do not assume a particular implementation, describe memory down to the level of |
| 91 | +buffers (=contiguous, 1-D blocks of memory). And at that buffer level, we can |
| 92 | +make the connection between this dataframe protocol and the |
| 93 | +[array API standard via `__dlpack__`](https://data-apis.org/array-api/latest/design_topics/data_interchange.html). |
| 94 | + |
| 95 | + |
| 96 | +### Similarity (and synergy?) with the Arrow C Data Interface |
| 97 | + |
| 98 | +When looking at the requirements and native in-memory formats of all |
| 99 | +prominent dataframe libraries, we found that the Arrow C Data Interface is |
| 100 | +pretty close to meeting all the requirements. So a natural question is: can |
| 101 | +we use that interface, and standardize a Python API on top of it? |
| 102 | + |
| 103 | +There are a couple of things in the current Arrow C Data Interface that |
| 104 | +didn't quite match everyone's needs. Most importantly, the Arrow C Data |
| 105 | +Interface does not have device support (e.g., GPUs). Other issues (or wishes) |
| 106 | +are: |
| 107 | +- The "deleter", which releases memory when it's no longer needed, lives at |
| 108 | + the column level in Arrow. Multiple people expressed the desire for more |
| 109 | + granular control. It seems more natural and performant to have the deleter |
| 110 | + at the buffer level. |
| 111 | +- Allowing a column to have its data split over different devices, e.g. part |
| 112 | + of the data lives on CPU and part on GPU (a necessity if the data doesn't |
| 113 | + fit in GPU memory). |
| 114 | +- Arrow supports masks, for null/missing values, as a bit mask. NumPy doesn't |
| 115 | + have bit masks, and boolean masks are normally one byte per value. This is |
| 116 | + a smaller issue though, because it can be solved via a convention like |
| 117 | + using (e.g.) a regular `int8` column with a certain name. |
| 118 | + |
| 119 | +Compared to the similaries between the two protocols, the differences are |
| 120 | +relatively minor. And a lot of work has already gone into the Arrow C Data |
| 121 | +Interface, hence we are interested in exploring if we can contribute the |
| 122 | +identified improvements back to Apache Arrow. That would potentially let us |
| 123 | +support, for example, an `__arrow_column__` attribute at the column level in |
| 124 | +Python, which would save dataframe libraries that already use Apache Arrow a |
| 125 | +significant amount of implementation work. |
| 126 | + |
| 127 | + |
| 128 | +### A standard dataframe creation function |
| 129 | + |
| 130 | +Also in the analogy to the array API standard, we are proposing a single new |
| 131 | +function, `from_dataframe`, for dataframe libraries to add in their top-level |
| 132 | +namespace. This function will know how to construct a library-native |
| 133 | +dataframe instance from any other dataframe object. Here is an example for |
| 134 | +Modin: |
| 135 | + |
| 136 | +```python |
| 137 | +import modin.pandas as pd |
| 138 | + |
| 139 | + |
| 140 | +def somefunc(df, ...): |
| 141 | + """ |
| 142 | + Do something interesting with dataframe `df`. |
| 143 | +
|
| 144 | + Parameters |
| 145 | + ---------- |
| 146 | + df : dataframe instance |
| 147 | + Can be a Modin dataframe, or any other kind of dataframe supporting the `__dataframe__` protocol |
| 148 | + """ |
| 149 | + df_modin = pd.from_dataframe(df) |
| 150 | + # From now on, use Modin dataframe internally |
| 151 | + |
| 152 | + |
| 153 | +def somefunc2(df, col1, col2): |
| 154 | + """ |
| 155 | + Do something interesting with two columns from dataframe `df`. |
| 156 | +
|
| 157 | + Parameters |
| 158 | + ---------- |
| 159 | + df : dataframe instance |
| 160 | + Can be a Modin dataframe, or any other kind of dataframe supporting the `__dataframe__` protocol |
| 161 | + col1 : str |
| 162 | + Name of column 1 |
| 163 | + col1 : str |
| 164 | + Name of column 2 |
| 165 | + """ |
| 166 | + # This will extract just the two columns we need from `df`, and put them in |
| 167 | + # a Modin dataframe. This is much more efficient than converting the |
| 168 | + # (potentially very large) complete dataframe. |
| 169 | + df_modin = pd.from_dataframe(df, cols=[col1, col2]) |
| 170 | +``` |
| 171 | + |
| 172 | + |
| 173 | +## Next steps |
| 174 | + |
| 175 | +This protocol is not completely done. We are releasing it now in order to get |
| 176 | +feedback from a wider range of stakeholders. We are interested to hear about |
| 177 | +everything from potential use cases we missed or should describe better, to |
| 178 | +whether the API feels natural, and low-level performance/implementation |
| 179 | +concerns or ideas for improvement. |
| 180 | + |
| 181 | +Today we are releasing one prototype implementation, for Pandas. Most of that |
| 182 | +prototype can be reused for implementations in other libraries. What we'd |
| 183 | +really like to see next is: can this be used in downstream libraries like |
| 184 | +scikit-learn or Seaborn? Right now those accept Pandas dataframes; letting |
| 185 | +them work with other types of dataframes is potentially quite valuable. This |
| 186 | +is what we should see before finalizing the API and semantics of this |
| 187 | +protocol. |
| 188 | + |
| 189 | + |
| 190 | +## What about a full dataframe API? |
| 191 | + |
| 192 | +At the end of last year we released a full |
| 193 | +[array API standard](https://data-apis.github.io/array-api/latest/). |
| 194 | +So what about a full dataframe API? |
| 195 | + |
| 196 | +Our initial intent was to take the methodology we used for constructing the |
| 197 | +array API, and the lessons we learned doing so, to dataframes. We found that |
| 198 | +to be quite challenging however, due to two reasons: |
| 199 | + |
| 200 | +1. It turns out that dataframe library authors & end users have quite |
| 201 | + different API design needs. Much more so than for arrays. Library authors |
| 202 | + need clear semantics, no surprises or performance cliffs, and explicit |
| 203 | + APIs. End users seem to want more "magic", where API calls can be chained |
| 204 | + and basically "do the right thing". |
| 205 | +2. For array libraries we used _API synthesis_, and based design decisions |
| 206 | + partly on data about how often current APIs are used. This worked because |
| 207 | + maintainers and end users are largely happy with the state of APIs for |
| 208 | + n-dimensional arrays. Those have an almost 25-year long history, so that's |
| 209 | + not surprising. Dataframes are much younger - Pandas was created in 2009 |
| 210 | + and reached version 1.0 only last year. And much more is still in flux |
| 211 | + there. Hence freezing the current state of dataframe APIs via |
| 212 | + standardization did not seem like a good idea. |
| 213 | + |
| 214 | +So, what's next for a larger dataframe API? Our strategy will be to |
| 215 | +focus on library authors as an audience, and based on the introduction of the |
| 216 | +interchange protocol see if we can identify next pieces that are useful. And |
| 217 | +then organically grow the size of the API, while being careful to not |
| 218 | +standardize APIs that dataframe library maintainers are not completely |
| 219 | +satisfied with. |
| 220 | + |
0 commit comments