-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/API: concat with empty DataFrames or all-NA columns #43507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable, for sure needs a whatsnew note. i think its actually needs a small subsection to highlite when this change takes effect as its quite subtle.
whatsnew added + green |
cool, can you check the asv's also on this. IIRC a long time ago, this was an expensive path (e.g. concatting with an empty frame), i am sure improved a lot since then, but want to make this change survives it. |
|
Sorry for the slow reply, but I am personally -1 on this change. It's a breaking change, that we don't have to do now. I seem to remember we discussed this before, and those special cases were intentionally kept in previous PRs that touched the concat code. I certainly agree that eventually we want the value-independent behaviour of this PR. But:
I don't think it is "fully strange". It comes from the fact that all-NaN float/object is the default dtype you get for "emtpy" data (eg when reindexing a DataFrame). IMO it's usability regression that you don't preserve dtypes in this case if you concat reindexed dataframes where a column is missing in one of the dataframes. Example:
That's on master, while before this PR that preserved the datetime64 dtype. |
|
There can be good reasons for doing a reindex. For example, to determine the exact output columns: in the past, concat did have a
Yes, but that's the unfortunate consequence of using float64 as the empty dtype (which we are actually deprecating, so once that is gone, I think we can also restrict the special case to object dtype)
You mean that I did not implement the special case for ArrayManager? First, I never said that I like this behaviour. I think I already said regularly that long term we should try to get rid of this value-dependent behaviour. I only said above that I disagree with doing this breaking change now. Personally I think it could be fine for ArrayManager to have different behaviour (but I know we have a different opinion here). But also, the AM implementation was not necessarily intended as final. Note the "# TODO(ArrayManager) decide on exact casting rules in concat" that was inside the code. Eg if we don't get a solution for the empty dtype, IMO we might need to special case at least object dtype also in the AM code path. |
…ns (pandas-dev#43507)" This reverts commit 084c543.
…ns (pandas-dev#43507)" This reverts commit 084c543.
This is in the grey area between a bugfix and an API change. Towards the bugfix side, it fixes some truly strange behavior:
The change in indexing code/tests is the least-invasive I could make it.
cc @jorisvandenbossche, this moves us towards having less values-dependent behavior, also makes the BM behavior more like the AM behavior in several affected cases.
Lots of cleanup becomes available after this, leaving for follow-up.