-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Question: Guaranteed zero-copy round-trip from numpy? #27211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Today, this is True In [23]: X = np.random.randint(0, 10, size=(10, 2))
In [24]: pd.DataFrame(X)._data.blocks[0].values.base is X
Out[24]: True but for better or worse it's possible that a future refactor will change that. We have a long-standing desire to simplify pandas internals, part of which may require storing a DataFrame as a collection of 1D arrays. Those 1D arrays would be a view on |
I know it's true right now ;) The question might be how likely the change is and on what timeframe. But it sounds like it is a bad idea for us to bank on this staying the same, right? xarray would be another option but it seems a bit weird to produce xarrays if the user inputs pandas dataframes. We could also output pandas dataframes until pandas changes the internal structure and then change to xarray? But that all introduces a bunch of uncertainties. |
This makes me a bit sad, because it means my dream of a "pandas in, pandas out" scikit-learn seems unrealistic unless we accept numerous avoidable data copies. |
Can't add much to what Tom already said, but: if the use case is to add metadata to numpy arrays going in, it might indeed not be very future proof to use pandas DataFrames for that if you want to avoid a copy of the data (twice, again when converting back to numpy).
That's always difficult to answer in open source ;) |
Hope y'all are applying to chan-zuckerberg? Ok but it sounds like this might not be a good solution for us. There could be a "feature names until you doe something multivariate" but that's also a bit weird? We should probably discuss this in our SLEP, and not here. But I think my original question is answered in that doing wrapping and unwrapping with zero-copy is not realistic long-term. |
We once had a discussion about having two different data structures (like a DataFrame and DataMatrix, there was one long ago) to meet such needs if we would go towards column-wise store, where a DataMatrix would be limited to a single dtype and stored as 2D array (and maybe also fixed number of columns?). It's only an idea that was floated once, so never really worked out and given the additional complexity potentially not a good idea for a project with limited resources. But it might be interesting to think about if there are specific needs. |
Well, there's DataArray in xarray that we could use. Or we could add our own, because that'll be fun, right? ;) |
This is for informing a scikit-learn design decision, I had briefly talked with @jorisvandenbossche about this a bit ago.
The question is whether we can rely on having zero-copy wrapping and unwrapping of numpy arrays into pandas dataframes, i.e. is it future proof to assume something like
doesn't result in a copy of the data and
X_again
shares the memory ofX
?Context: We want to attach some meta-data to our numpy arrays, in particular I'm interested in column names. Pandas is an obvious candidate for doing that, but core sklearn works on numpy arrays.
So if we want to use pandas, we need to make sure that there's no overhead in wrapping and unwrapping.
And this is a design decision that's very hard to undo, so I want to make sure that it's reasonably future-proof.
@jorisvandenbossche had mentioned that there were thoughts about making pandas a column store, which sounds like it would break the zero copy requirement.
Thanks!
The text was updated successfully, but these errors were encountered: