Skip to content

API: disallow duplicate level names #18882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 29, 2017

Conversation

toobaz
Copy link
Member

@toobaz toobaz commented Dec 20, 2017

@jbrockmendel
Copy link
Member

+1 on the idea. Some cases to think about: pd.NaT, np.nan, anything else where x != x

@toobaz
Copy link
Member Author

toobaz commented Dec 21, 2017

anything else where x != x

I see only three options:

  1. we don't do anything: the user just knows that passing np.nan as level name is stupid because np.nan != np.nan
  2. we build (and keep updated) an Index out of level names, and use it to access levels with the same rules with which you access items in a Series
  3. just like None, np.nan and friends can be used as level names and can be duplicated but cannot be used to refer to the levels (e.g. get_level_values(np.nan) will immediately raise). We describe this behavior in the docstring of get_level_values() and in all other methods which accept a level name, we state "see get_level_values()"

The current PR does 1., I would be OK with 3., while I'm skeptical that 2. is worth the effort.

(If we go for 3, I would leave it to a separate PR)

@@ -560,16 +560,6 @@ def test_unstack_dtypes(self):
assert left.shape == (3, 2)
tm.assert_frame_equal(left, right)

def test_unstack_non_unique_index_names(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It raises when you try to create the MultiIndex with duplicated name.

[[0, 1]] * 3, names=names)

# With .rename()
mi = pd.MultiIndex.from_product([[0, 1]] * 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check the error messages here, use tm.assert_raises_regex

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(done)

@@ -214,17 +214,6 @@ def test_reorder_levels(self):
expected = Series(np.arange(6), index=e_idx)
assert_series_equal(result, expected)

result = s.reorder_levels([0, 0, 0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test that raises

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the new test_set_index_duplicate_names (the constructor itself raises).

@@ -179,6 +179,7 @@ Other API Changes
- A :class:`Series` of ``dtype=category`` constructed from an empty ``dict`` will now have categories of ``dtype=object`` rather than ``dtype=float64``, consistently with the case in which an empty list is passed (:issue:`18515`)
- ``NaT`` division with :class:`datetime.timedelta` will now return ``NaN`` instead of raising (:issue:`17876`)
- All-NaN levels in a ``MultiIndex`` are now assigned ``float`` rather than ``object`` dtype, promoting consistency with ``Index`` (:issue:`17929`).
- Levels names of a ``MultiIndex`` (when not None) are now required to be unique: trying to create a ``MultiIndex`` with repeated names will raise a ``ValueError`` (:issue:`18872`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase on master. can you move to 0.23 (docs were renamed), prob easiest to just check this file from master and past in new one

@toobaz toobaz force-pushed the dup_lev_names_18872 branch 3 times, most recently from 5503b34 to 0044bbe Compare December 22, 2017 11:50
@codecov
Copy link

codecov bot commented Dec 22, 2017

Codecov Report

Merging #18882 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18882      +/-   ##
==========================================
+ Coverage   91.58%   91.58%   +<.01%     
==========================================
  Files         150      150              
  Lines       48967    48962       -5     
==========================================
- Hits        44846    44842       -4     
+ Misses       4121     4120       -1
Flag Coverage Δ
#multiple 89.94% <100%> (ø) ⬆️
#single 41.73% <60%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/reshape/reshape.py 100% <ø> (ø) ⬆️
pandas/core/indexes/multi.py 96.22% <100%> (-0.08%) ⬇️
pandas/util/testing.py 84.9% <0%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4883a43...53cb26c. Read the comment docs.

@toobaz
Copy link
Member Author

toobaz commented Dec 22, 2017

rebased, ping

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor request on the error message. ping on green.


# set the name
for l, name in zip(level, names):
if name is not None and name in used:
raise ValueError("Duplicated level name: {}.".format(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you enumerate on this and report the level number where the error is as well
e.g. level 2, duplicate name 'foo' or somesuch

@toobaz toobaz force-pushed the dup_lev_names_18872 branch from 0044bbe to a1db0e0 Compare December 23, 2017 23:23
@pep8speaks
Copy link

pep8speaks commented Dec 23, 2017

Hello @toobaz! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 29, 2017 at 09:15 Hours UTC

@toobaz toobaz force-pushed the dup_lev_names_18872 branch from a1db0e0 to 33aa1f3 Compare December 23, 2017 23:24
@toobaz toobaz force-pushed the dup_lev_names_18872 branch from 33aa1f3 to 53cb26c Compare December 29, 2017 09:15
@toobaz
Copy link
Member Author

toobaz commented Dec 29, 2017

@jreback ping

@jreback jreback added this to the 0.23.0 milestone Dec 29, 2017
@jreback jreback merged commit 7958018 into pandas-dev:master Dec 29, 2017
@jreback
Copy link
Contributor

jreback commented Dec 29, 2017

thanks!

@toobaz toobaz deleted the dup_lev_names_18872 branch December 29, 2017 14:45
hexgnu pushed a commit to hexgnu/pandas that referenced this pull request Jan 1, 2018
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 1, 2018

FYI, this is causing failures on dask due to operations like:

In [1]: import dask.dataframe as dd

In [2]: import pandas as pd

In [3]:     pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7, 8, 9],
   ...:                         'b': [4, 5, 6, 3, 2, 1, 0, 0, 0]},
   ...:                        index=[0, 1, 3, 5, 6, 8, 9, 9, 9]).set_index("a")
   ...:
   ...:

In [4]: pdf.groupby(pdf.index).apply(lambda x: x.b)
Out[4]:
a  a
1  1    4
2  2    5
3  3    6
4  4    3
5  5    2
6  6    1
7  7    0
8  8    0
9  9    0
Name: b, dtype: int64

essentially, grouping by an index and doing a .apply that returns a Series whose index is the same as the original (at least the same name). This seems like a fairly common use-case.

I think we should make this a warning for now and and raise later.

TomAugspurger added a commit to TomAugspurger/dask that referenced this pull request Jan 1, 2018
Pandas 0.23 is disallowing duplicate names in MultiIndexes. This
adjusts a test that relied on that behavior, and `groupby().nunique` which
produced it as a by-product.

Closes dask#3039

xref pandas-dev/pandas#18882
TomAugspurger added a commit to dask/dask that referenced this pull request Jan 2, 2018
* COMPAT: Pandas 0.23 duplicate names in MI

Pandas 0.23 is disallowing duplicate names in MultiIndexes. This
adjusts a test that relied on that behavior, and `groupby().nunique` which
produced it as a by-product.

Closes #3039

xref pandas-dev/pandas#18882

* Cleanup renaming
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems when accessing MultiIndex level with duplicated level names
5 participants