Skip to content

BUG/API: concat with empty DataFrames or all-NA columns #43507

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Sep 15, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

This is in the grey area between a bugfix and an API change. Towards the bugfix side, it fixes some truly strange behavior:

  1. Concatenating ignores the dtypes of all-NaN columns (see test_append_dtypes)
import numpy as np
import pandas as pd
import pandas._testing as tm

df1 = pd.DataFrame({"bar": pd.Timestamp("20130101")}, index=range(1))
df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2))
df3 = df2.astype(object)

result = df1.append(df2)
expected = pd.DataFrame(
    {"bar": pd.Series([pd.Timestamp("20130101"), np.nan], dtype="M8[ns]")}
)
tm.assert_frame_equal(result, expected)

result2 = df1.append(df3)
tm.assert_frame_equal(result2, expected)
  1. Concatenating sometimes ignores the dtype on empty (see test_join_append_timedeltas)
df = pd.DataFrame(columns=["d", "t"])  # <- object dtype
df2 = pd.DataFrame({"d": [pd.Timestamp(2013, 11, 5, 5, 56)], "t": [pd.Timedelta(seconds=22500)]})

result = df.append(df2, ignore_index=True)  # <- *not* object dtype
assert (result.dtypes == df2.dtypes).all()

The change in indexing code/tests is the least-invasive I could make it.

cc @jorisvandenbossche, this moves us towards having less values-dependent behavior, also makes the BM behavior more like the AM behavior in several affected cases.

Lots of cleanup becomes available after this, leaving for follow-up.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 12, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems reasonable, for sure needs a whatsnew note. i think its actually needs a small subsection to highlite when this change takes effect as its quite subtle.

@jbrockmendel
Copy link
Member Author

whatsnew added + green

@jreback
Copy link
Contributor

jreback commented Sep 14, 2021

cool, can you check the asv's also on this. IIRC a long time ago, this was an expensive path (e.g. concatting with an empty frame), i am sure improved a lot since then, but want to make this change survives it.

@jreback jreback added this to the 1.4 milestone Sep 14, 2021
@jbrockmendel
Copy link
Member Author

asv continuous -E virtualenv master ref-concat-void-3 -b join_merge
[...]
       before           after         ratio
     [82a102b3]       [0b4cde05]
     <collect-apply>       <ref-concat-void-3>
-      9.36±0.1ms      7.71±0.08ms     0.82  join_merge.Join.time_join_dataframe_index_single_key_small(True)
-      37.4±0.6ms         29.5±1ms     0.79  join_merge.Merge.time_merge_2intkey(True)
-      10.9±0.4ms       8.55±0.1ms     0.79  join_merge.Join.time_join_dataframe_index_single_key_small(False)
-        419±20ms          328±6ms     0.78  join_merge.Merge.time_merge_dataframes_cross(True)
-        434±20ms          333±6ms     0.77  join_merge.Merge.time_merge_dataframes_cross(False)
-        152±10μs          117±9μs     0.77  join_merge.JoinNonUnique.time_join_non_unique_equal
-      5.85±0.2ms       4.31±0.3ms     0.74  join_merge.Merge.time_merge_dataframe_integer_2key(False)
-        18.3±1ms       13.4±0.5ms     0.73  join_merge.Merge.time_merge_2intkey(False)
-      3.90±0.3ms      2.83±0.06ms     0.73  join_merge.Join.time_join_dataframes_cross(True)
-      2.13±0.1ms      1.52±0.09ms     0.71  join_merge.Merge.time_merge_dataframe_integer_key(False)

@jreback jreback merged commit 084c543 into pandas-dev:master Sep 15, 2021
@jbrockmendel jbrockmendel deleted the ref-concat-void-3 branch September 15, 2021 01:41
@jorisvandenbossche
Copy link
Member

Sorry for the slow reply, but I am personally -1 on this change. It's a breaking change, that we don't have to do now. I seem to remember we discussed this before, and those special cases were intentionally kept in previous PRs that touched the concat code.

I certainly agree that eventually we want the value-independent behaviour of this PR. But:

it fixes some truly strange behavior:

I don't think it is "fully strange". It comes from the fact that all-NaN float/object is the default dtype you get for "emtpy" data (eg when reindexing a DataFrame). IMO it's usability regression that you don't preserve dtypes in this case if you concat reindexed dataframes where a column is missing in one of the dataframes.

Example:

In [1]: df1 = pd.DataFrame({'a': [1], 'b': [pd.Timestamp("2012-01-01")]})

In [2]: df2 = pd.DataFrame({'a': [2]})

In [3]: pd.concat([df1, df2.reindex(columns=df1.columns)]).dtypes
Out[4]: 
a     int64
b    object
dtype: object

That's on master, while before this PR that preserved the datetime64 dtype.

@jbrockmendel
Copy link
Member Author

That's on master, while before this PR that preserved the datetime64 dtype.

  1. If you don't do the reindex and just do pd.concat([df1, df2]) then you still get dt64 on master.

  2. if instead of reindexing you did df2["b"] = pd.Series([np.nan], dtype=np.float64, name="really specifically float64"), wouldn't you want pd.concat([df1, df2]) to not preserve dt64?

  3. If you don't like this behavior, why the frak did you implement it for ArrayManager?

@jorisvandenbossche
Copy link
Member

  1. If you don't do the reindex and just do pd.concat([df1, df2]) then you still get dt64 on master.

There can be good reasons for doing a reindex. For example, to determine the exact output columns: in the past, concat did have a join_axes keyword for this, but this was deprecated, pointing the user to use reindex instead.

2. if instead of reindexing you did df2["b"] = pd.Series([np.nan], dtype=np.float64, name="really specifically float64"), wouldn't you want pd.concat([df1, df2]) to not preserve dt64?

Yes, but that's the unfortunate consequence of using float64 as the empty dtype (which we are actually deprecating, so once that is gone, I think we can also restrict the special case to object dtype)

3. If you don't like this behavior, why the frak did you implement it for ArrayManager?

You mean that I did not implement the special case for ArrayManager?

First, I never said that I like this behaviour. I think I already said regularly that long term we should try to get rid of this value-dependent behaviour. I only said above that I disagree with doing this breaking change now.

Personally I think it could be fine for ArrayManager to have different behaviour (but I know we have a different opinion here). But also, the AM implementation was not necessarily intended as final. Note the "# TODO(ArrayManager) decide on exact casting rules in concat" that was inside the code. Eg if we don't get a solution for the empty dtype, IMO we might need to special case at least object dtype also in the AM code path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants