BUG/API: concat with empty DataFrames or all-NA columns #43507

jbrockmendel · 2021-09-11T02:44:19Z

closes #xxxx
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

This is in the grey area between a bugfix and an API change. Towards the bugfix side, it fixes some truly strange behavior:

Concatenating ignores the dtypes of all-NaN columns (see test_append_dtypes)

import numpy as np
import pandas as pd
import pandas._testing as tm

df1 = pd.DataFrame({"bar": pd.Timestamp("20130101")}, index=range(1))
df2 = pd.DataFrame({"bar": np.nan}, index=range(1, 2))
df3 = df2.astype(object)

result = df1.append(df2)
expected = pd.DataFrame(
    {"bar": pd.Series([pd.Timestamp("20130101"), np.nan], dtype="M8[ns]")}
)
tm.assert_frame_equal(result, expected)

result2 = df1.append(df3)
tm.assert_frame_equal(result2, expected)

Concatenating sometimes ignores the dtype on empty (see test_join_append_timedeltas)

df = pd.DataFrame(columns=["d", "t"])  # <- object dtype
df2 = pd.DataFrame({"d": [pd.Timestamp(2013, 11, 5, 5, 56)], "t": [pd.Timedelta(seconds=22500)]})

result = df.append(df2, ignore_index=True)  # <- *not* object dtype
assert (result.dtypes == df2.dtypes).all()

The change in indexing code/tests is the least-invasive I could make it.

cc @jorisvandenbossche, this moves us towards having less values-dependent behavior, also makes the BM behavior more like the AM behavior in several affected cases.

Lots of cleanup becomes available after this, leaving for follow-up.

jreback

seems reasonable, for sure needs a whatsnew note. i think its actually needs a small subsection to highlite when this change takes effect as its quite subtle.

jbrockmendel · 2021-09-14T22:26:05Z

whatsnew added + green

jreback · 2021-09-14T22:54:33Z

cool, can you check the asv's also on this. IIRC a long time ago, this was an expensive path (e.g. concatting with an empty frame), i am sure improved a lot since then, but want to make this change survives it.

jbrockmendel · 2021-09-15T01:31:34Z

asv continuous -E virtualenv master ref-concat-void-3 -b join_merge
[...]
       before           after         ratio
     [82a102b3]       [0b4cde05]
     <collect-apply>       <ref-concat-void-3>
-      9.36±0.1ms      7.71±0.08ms     0.82  join_merge.Join.time_join_dataframe_index_single_key_small(True)
-      37.4±0.6ms         29.5±1ms     0.79  join_merge.Merge.time_merge_2intkey(True)
-      10.9±0.4ms       8.55±0.1ms     0.79  join_merge.Join.time_join_dataframe_index_single_key_small(False)
-        419±20ms          328±6ms     0.78  join_merge.Merge.time_merge_dataframes_cross(True)
-        434±20ms          333±6ms     0.77  join_merge.Merge.time_merge_dataframes_cross(False)
-        152±10μs          117±9μs     0.77  join_merge.JoinNonUnique.time_join_non_unique_equal
-      5.85±0.2ms       4.31±0.3ms     0.74  join_merge.Merge.time_merge_dataframe_integer_2key(False)
-        18.3±1ms       13.4±0.5ms     0.73  join_merge.Merge.time_merge_2intkey(False)
-      3.90±0.3ms      2.83±0.06ms     0.73  join_merge.Join.time_join_dataframes_cross(True)
-      2.13±0.1ms      1.52±0.09ms     0.71  join_merge.Merge.time_merge_dataframe_integer_key(False)

jorisvandenbossche · 2021-09-15T21:06:27Z

Sorry for the slow reply, but I am personally -1 on this change. It's a breaking change, that we don't have to do now. I seem to remember we discussed this before, and those special cases were intentionally kept in previous PRs that touched the concat code.

I certainly agree that eventually we want the value-independent behaviour of this PR. But:

it fixes some truly strange behavior:

I don't think it is "fully strange". It comes from the fact that all-NaN float/object is the default dtype you get for "emtpy" data (eg when reindexing a DataFrame). IMO it's usability regression that you don't preserve dtypes in this case if you concat reindexed dataframes where a column is missing in one of the dataframes.

Example:

In [1]: df1 = pd.DataFrame({'a': [1], 'b': [pd.Timestamp("2012-01-01")]})

In [2]: df2 = pd.DataFrame({'a': [2]})

In [3]: pd.concat([df1, df2.reindex(columns=df1.columns)]).dtypes
Out[4]: 
a     int64
b    object
dtype: object

That's on master, while before this PR that preserved the datetime64 dtype.

jbrockmendel · 2021-09-20T17:22:27Z

That's on master, while before this PR that preserved the datetime64 dtype.

If you don't do the reindex and just do pd.concat([df1, df2]) then you still get dt64 on master.
if instead of reindexing you did df2["b"] = pd.Series([np.nan], dtype=np.float64, name="really specifically float64"), wouldn't you want pd.concat([df1, df2]) to not preserve dt64?
If you don't like this behavior, why the frak did you implement it for ArrayManager?

jorisvandenbossche · 2021-09-21T20:04:43Z

If you don't do the reindex and just do pd.concat([df1, df2]) then you still get dt64 on master.

There can be good reasons for doing a reindex. For example, to determine the exact output columns: in the past, concat did have a join_axes keyword for this, but this was deprecated, pointing the user to use reindex instead.

2. if instead of reindexing you did df2["b"] = pd.Series([np.nan], dtype=np.float64, name="really specifically float64"), wouldn't you want pd.concat([df1, df2]) to not preserve dt64?

Yes, but that's the unfortunate consequence of using float64 as the empty dtype (which we are actually deprecating, so once that is gone, I think we can also restrict the special case to object dtype)

3. If you don't like this behavior, why the frak did you implement it for ArrayManager?

You mean that I did not implement the special case for ArrayManager?

First, I never said that I like this behaviour. I think I already said regularly that long term we should try to get rid of this value-dependent behaviour. I only said above that I disagree with doing this breaking change now.

Personally I think it could be fine for ArrayManager to have different behaviour (but I know we have a different opinion here). But also, the AM implementation was not necessarily intended as final. Note the "# TODO(ArrayManager) decide on exact casting rules in concat" that was inside the code. Eg if we don't get a solution for the empty dtype, IMO we might need to special case at least object dtype also in the AM code path.

…ns (pandas-dev#43507)" This reverts commit 084c543.

jbrockmendel added 4 commits September 10, 2021 19:23

REF: remove no-longer reachable cases from internals.concat

bef208b

fix incorrect assertion

80fcb9a

typo fixup

d50f5d4

BUG/API: concat with empty DataFrames

ce0dcaa

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 12, 2021

jreback requested changes Sep 12, 2021

View reviewed changes

jbrockmendel added 2 commits September 14, 2021 13:12

Merge branch 'master' into ref-concat-void-3

f1a9a38

whatsnew

defa013

jreback added this to the 1.4 milestone Sep 14, 2021

jreback approved these changes Sep 15, 2021

View reviewed changes

jreback merged commit 084c543 into pandas-dev:master Sep 15, 2021

jbrockmendel deleted the ref-concat-void-3 branch September 15, 2021 01:41

jbrockmendel mentioned this pull request Sep 15, 2021

CLN: remove unused concat code #43577

Merged

jorisvandenbossche mentioned this pull request Sep 21, 2021

REF: remove JoinUnit.shape #43651

Merged

4 tasks

phofl mentioned this pull request Jan 25, 2022

BUG: 1.4.0 does not preserve initially empty Index and appended by loc assignment. #45621

Closed

3 tasks

jorisvandenbossche mentioned this pull request Feb 9, 2022

BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637

Closed

3 tasks

simonjayhawkins mentioned this pull request Mar 12, 2022

BUG: Difference in CSV roundtrip between 1.3.5 and 1.4.1 #46317

Closed

3 tasks

simonjayhawkins mentioned this pull request Jun 8, 2022

REGR: assignment of pd.NA with enlargement gives object dtype with IntegerArray #47284

Closed

3 tasks

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jun 15, 2022

Partial Revert "BUG/API: concat with empty DataFrames or all-NA colum…

6454ad3

…ns (pandas-dev#43507)" This reverts commit 084c543.

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jun 15, 2022

Partial Revert "BUG/API: concat with empty DataFrames or all-NA colum…

cf095e1

…ns (pandas-dev#43507)" This reverts commit 084c543.

jorisvandenbossche mentioned this pull request Jun 15, 2022

REGR: revert behaviour change for concat with empty/all-NaN data #47372

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/API: concat with empty DataFrames or all-NA columns #43507

BUG/API: concat with empty DataFrames or all-NA columns #43507

jbrockmendel commented Sep 11, 2021

jreback left a comment

jbrockmendel commented Sep 14, 2021

jreback commented Sep 14, 2021

jbrockmendel commented Sep 15, 2021

jorisvandenbossche commented Sep 15, 2021

jbrockmendel commented Sep 20, 2021

jorisvandenbossche commented Sep 21, 2021

BUG/API: concat with empty DataFrames or all-NA columns #43507

BUG/API: concat with empty DataFrames or all-NA columns #43507

Conversation

jbrockmendel commented Sep 11, 2021

jreback left a comment

Choose a reason for hiding this comment

jbrockmendel commented Sep 14, 2021

jreback commented Sep 14, 2021

jbrockmendel commented Sep 15, 2021

jorisvandenbossche commented Sep 15, 2021

jbrockmendel commented Sep 20, 2021

jorisvandenbossche commented Sep 21, 2021