-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Pandas string dtype needs from NumPy - prototyping & plan of attack #47884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A potential tricky aspect might be missing values. Currently, all the variants of string data types in pandas support that (with There will of course be workarounds possible if the numpy array itself doesn't support missing values, like using it together with a boolean mask (just like we do for the numeric nullable arrays, while right now for StringDtype / StringArray, we don't use a mask because the object dtype array can already hold the missing value). So in that sense it would only be consistent with how to deal with that for numeric data types (it was only that, because of using object dtype for strings, it was more flexible up to now) |
We can encode NA into the dtype, that is no problem. But I agree, there a couple of open questions around NA. NumPy could just support the NA part, in which case Sinnce we probably want to do |
Hey all, here's a little bit of background information on this effort as well BackgroundOriginally, One of the main motivations of the dtypes work has been improved string dtypes, The new dtypes will be built in a repo external to Goals & Discussion
As it stands, On a related note, Initial feedback from folks during the 2022-07-28 Data API callThe main implementation choice will use two arrays, one to hold the characters, Alternative implementationsThe
See these discussions for more information. This format is already popular, and is used in a number of databases: |
Presumably, this is a per array storage. NumPy has no clear concept of that, although maybe it could be added (the problem is mainly about views I suspect). The alternative may be a per "dtype" storage.
In principle we could have |
I have some updates on this effort. tl;dr: we're thinking of initially implementing the string dtype as a strongly-typed object array by storing pointers to string buffers instead of storing the string data as variable width elements in the ndarray array storage buffer. I'd like feedback on this idea from stakeholders before we go further. Repository for user dtypes in the Numpy github orgSince @peytondmurray's last update, @seberg created a new repository for the dtype code to live in: https://github.com/numpy/numpy-user-dtypes It's likely that eventually some of these dtypes will be upstreamed to Numpy, but keeping them separate for now allows easier iteration and experimentation. Initial work on string dtypesFor the past month or so I've been working on I'm now feeling more confident about moving on to the real variable width string dtype Pandas needs, but I think we need to implement this using a different approach than what Peyton described in August. Problems with storing variable-width data in Numpy arraysOur initial plan for this involved storing the variable-length strings in the numpy array itself, with an auxiliary array holding an index into the array buffer for the locations of the array elements. This additional array would be stored on the dtype or make use of an as-yet undeveloped facility in Numpy for per-array storage. Both approaches would require modifications to Numpy. We would either need to add a facility for per-array storage, or if we store the offsets on the dtype, we would likely need to modify how dtypes are handled when array copies or views are created to ensure a new offset array is created as appropriate. Even if we solve those issues, I realized recently that any kind of variable-length dtype is going to run into other, more fundamental issues with Numpy's assumptions that array data are fixed-width. One example in the new dtype API is the current signature of
That is, it takes only a reference to the array's dtype instance and a pointer to the array buffer. For variable-width strings, we would have no way of knowing the length of the string New plan: store pointers to stringsWe are now leaning in the direction of storing pointers to string arrays in the array storage. This avoids the issues with variable-width data storage in ndarray, since interally we'd be storing one pointer-width integer for each string. I believe it should also be possible to implement this approach using the experimental dtype API in Numpy as it currently exists. The main downside is that ufuncs, casts, and other operations that require looping over all the data will need to go through a pointer for each array element and without some care around the storage strategy, will not use CPU caches efficiently. That said, performance will likely be improved compared with the object dtype, as the dtype will know that the pointers are to string arrays, and there will be no need to go through the Python C API and no need to acquire the GIL in ufuncs or casting loops to unwrap PyObject instances and access the string data. My take is that we don't know if the performance of this approach is unacceptable until we try, and we can always go back and apply optimizations afterwards if we need to. A simpler implementation will also allow us to have a prototype we can use to explore integration with pandas. |
Since my last update @peytondmurray and I have made a bunch of progress on the string dtype. It now mostly works. I'm sure there's still lots of stuff that needs to be added, but basic operations work fine. To get a feeling where we stand in terms of explicitly supported things, take a look at the unit tests. @peytondmurray has been focusing on expanding functionality and adding support for ufuncs where that makes sense. I have a development branch of pandas in my github fork that supports creating pandas data types from I also added support for missing data and added a hook so that the missing data value used by the instances of the dtype used by The development branch of pandas I'm working on can't be upstreamed until Numpy 1.25 is released at the earliest. At that point it will become possible to run tests on pandas' CI that use I'm trying to regularly rebase my changes on the pandas Speaking of benchmarks, right now performance of most operations are roughly comparable between object string arrays and the |
Your call, but it isn't obvious to me this is the way to go. (I'm generally ornery about pd.NA). My preference would be for pandas to treat this like any other numpy array/dtype. Ideally we wouldn't even need to stuff it into an ExtensionArray. Maybe that's what you're referring to in your last paragraph about using StringDtype directly? |
Yes I am using an ExtensionArray. I want both the ExtensionArray and using the dtype directly to function correctly. I initially wanted to avoid the ExtensionArray but doing it this way lets me use the ExtensionArray tests “for free” to find places where we need to fix things on the numpy side. Also it allows users to switch from |
I proposed a NEP to upstream the dtype prototype to NumPy, targeting NumPy 2.0. If the NEP is accepted, I will start to work on moving the code into NumPy itself. Once the DType implementation is merged in numpy's development branch, I will rework my Pandas fork to use the built-in StringDType instead of the implementation outside of NumPy. This is the version I will propose to get upstreamed to Pandas, hopefully in only a few pull requests. This might happen as early as this fall or winter if the NEP process and upstreaming goes smoothly, but could also slip to 2024. I hope it won't be a controversial feature given the improvements over object arrays, but I also understand there's some desire to move away from numpy and towards pyarrow, so I'm not assuming support will be merged in Pandas. If anyone is interested in playing with or giving feedback on the NumPy |
Hi all, it's now looking like NEP 55 will be accepted and stringdtype will ship in NumPy 2.0. I already have patches to support stringdtype in Pandas, so hopefully we'll be able to simultaneously have versions of pandas and numpy that support UTF-8 strings very soon and have no need for string object arrays unless a user explicitly passes one in. If for some reason that timing slips and we don't ship it, I still expect stringdtype to be available in numpy dev within the next few weeks. As soon as stringdtype is available in numpy dev, my plan is to update my pandas patches to account for stringdtype being available in numpy and propose a pandas PR. It's not a trivial amount of code, but it's also not a huge amount (currently +484 -150 lines, spread across 20 or so files). A lot of that code will be simplified once I don't need to depend on an external stringdtype package outside numpy, which caused a number of circular imports in my prototype. I went over all of this a bit with @MarcoGorelli and @jreback in an internal call for the NASA ROSES grant late last year, and I'm happy to chat about this on a video call and present my slides about it if anyone is interested before I propose the PR to get early feedback. |
The purpose of this issue is to discuss a plan of attack for improving string dtypes in NumPy to better suit Pandas.
Context
object
(no longer recommended) and viaStringDtype
(which can have multiple implementations it looks like): https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-types.There's a ton of relevant threads and issues for both NumPy and Pandas, I'm not going to try to link them all here.
Proposed way of approaching this
There's folks from Pandas (I think at least @jreback, @jbrockmendel and @jorisvandenbossche), NumPy (@seberg, @mattip), the NASA grant (@peytondmurray who will do some of the heavy lifting here on the prototype, Cc @dharhas as PI) with an interest in this. It's probably also relevant for other dataframe libraries; what Arrow provides is relevant; the dataframe interchange protocol probably too. In short: many potentially interested people and projects. So I'd suggest we add comments, new ideas, concerns on this issue - and then also have a call next week with whoever is interested, to have a bit higher-bandwidth conversation on how to get started.
A few thoughts on what to do
The text was updated successfully, but these errors were encountered: