ENH: added nunique function to Index #6734

sinhrks · 2014-03-29T09:40:22Z

Added nunique function to Index same as Series.

jreback · 2014-03-29T11:54:49Z

I like all the tests!

I would rather see this added in core/base.py (and then remove the Series) one
you can leave some of the tests but add a couple in test_base

did you cover datetime/period index for tests?

sinhrks · 2014-03-29T12:29:33Z

Sure. I'll try it.

Actually tests are derived from test_series.py. I'll prepare more tests for time-related indexes.

jreback · 2014-03-29T13:17:28Z

pandas/tests/test_index.py

+        # NAs in object arrays #714
+        i = np.array(['foo'] * 100)
+        i[::2] = np.nan
+        idx = Index(i, dtype='O')


don't specify dtype when constructing an Index, let it infer it properly (unless the test actually is trying to test that).

sinhrks · 2014-03-31T15:25:09Z

I've moved value_counts, unique and nunique to the base class, and added more tests. Could you review this?

jreback · 2014-03-31T15:39:32Z

FYI, I find it helpful to pull master and rebase when I push a PR
(you don't have to , but always nice to get an auto merge), which are usually caused by release note conflicts

jreback · 2014-03-31T16:56:36Z

also for some reason travis is not picking this up...can you rebase and force push?

sinhrks · 2014-03-31T21:28:40Z

Rebased, and I'll take care.

Test has passed.

jreback · 2014-03-31T22:08:39Z

can you move the imports to the top of the test script?
better to just have them centralized.

was a little leary of a .value_counts() on an index returning a Series, but it DOES make sense.

can you add a timezone aware series & index in the test_base/Ops/objs.py (all should work, but not sure that case is there).

nehalecky · 2014-04-01T13:11:23Z

This is great. I use such functionality all the time and indeed wrap like pd.Series(s.index.tolist()).value_counts().

+1 for .value_counts() added to base Index class.

jreback · 2014-04-01T13:13:14Z

@nehalecky thanks for the feedback!

see #6382 for methods/properties that I think should be moved/changed to the OpsMin (some more tricky than others). Anything else pls comment.

sinhrks · 2014-04-01T13:47:44Z

@jreback When I include tz series and index, I get ValueError: Cannot compare tz-naive and tz-aware timestamps during min and max tests. Maybe Timestamp compares checking tzinfo property which datetime64 doesn't have?

jreback · 2014-04-01T13:52:26Z

can you post the comparion? (e.g. you need to have the test use a Timestamp with a tz otherwise you SHOULD get that error).

sinhrks · 2014-04-01T14:18:44Z

Sorry, I'm reffering to existing TestIndexOps.test_ops (L189). Tests for test_value_counts_unique_nunique can be passed including tz. I pushed my current code.

Is it OK for me to do:

Inculde tz-related Series and Index to self.obj.
Modify 'TestIndexOps.test_ops' to exclude tz-related Series and Index from the arguments.

jreback · 2014-04-01T14:26:50Z

These should all work (which is what the structure tests)

I think you need to use keep_tz when you create the series (so that the tz are kept!)

In [14]: i = tm.makeDateIndex(10).tz_localize(tz='US/Eastern')

In [15]: i
Out[15]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-03 00:00:00-05:00, ..., 2000-01-14 00:00:00-05:00]
Length: 10, Freq: B, Timezone: US/Eastern

In [16]: i.to_series(keep_tz=True).value_counts()
Out[16]: 
2000-01-05 00:00:00-05:00    1
2000-01-11 00:00:00-05:00    1
2000-01-14 00:00:00-05:00    1
2000-01-04 00:00:00-05:00    1
2000-01-07 00:00:00-05:00    1
2000-01-10 00:00:00-05:00    1
2000-01-13 00:00:00-05:00    1
2000-01-03 00:00:00-05:00    1
2000-01-06 00:00:00-05:00    1
2000-01-12 00:00:00-05:00    1
dtype: int64

In [17]: i.to_series(keep_tz=True).max()
Out[17]: Timestamp('2000-01-14 00:00:00-0500', tz='US/Eastern', offset='B')

sinhrks · 2014-04-01T15:06:41Z

Thanks. I feel I could understand the problem now, the error isn't caused from min or max itselves, but by comparison done in AssertEqual

>>> i = tm.makeDateIndex(10).tz_localize(tz='US/Eastern')
# What TestIndexOps.test_ops does:
>>> i.max() == i.values.max()
ValueError: Cannot compare tz-naive and tz-aware timestamps
>>> i.values.max() == i.max()
False
>>> i.values.max().tolist() == i.max().value
True

Changing the test logic seems to be a workaround, but Timestamp class should be changed to allow the comparison with datetime64? If I still misunderstood, please let me know.

jreback · 2014-04-01T15:47:25Z

you can't really compare Timestamp WITH a tz with a np.datetime64 which does not have a tz (I know it shows that it does, but that's just a local representation), their is a proposal to 'fix' this (see http://mail.scipy.org/pipermail/numpy-discussion/2014-March/069282.html thread).

These SHOULD compare equal though (but because of the tz are not).

In [26]: np.datetime64('2001-01-01 00:00:00').astype('M8[ns]').astype('int64')
Out[26]: 978325200000000000

In [29]: x = Timestamp('20010101').tz_localize('EST')
In [34]: x.asm8
Out[34]: numpy.datetime64('2001-01-01T00:00:00.000000000-0500')

In [35]: x.value
Out[35]: 978325200000000000

for now. just text those cases explicity

sinhrks · 2014-04-02T14:24:59Z

Thanks! I've added error handling, and rebased onto latest master.

jreback · 2014-04-02T14:38:23Z

pandas/tests/test_base.py

+            expected = np.array(pd.to_datetime(['2010-01-01 00:00:00',
+                                                '2009-01-01 00:00:00',
+                                                '2008-09-09 00:00:00']))
+            self.assertEqual(result.index.dtype, 'datetime64[ns]')


how much of these tests did you change? I don't like ANY tests where you have to use equalContets/assertEqual

I get for a numpy arrange out put that is ok (e.g. unique). but if the output is a Series (e.g. with value_counts), I don't think that should used at all. I know this is waht the original test did, but value_counts here is a series.) Can you go thru tests and change as much as possible to assert_series_equal?

In the cases of a numpy array is returned use self.assert_numpy_array_equal. The problem with checking contents and equal is it can mask issues with the ordering and/or return types.

sinhrks · 2014-04-06T14:06:10Z

Understood. I've modified tests to use series_equal for value_counts and numpy_array_equal for unique.

And is the current value_counts behavior ok to include NaT?

jreback · 2014-04-06T14:16:19Z

You mean like this? yes NaT should be included

In [64]: x = pd.DatetimeIndex(date_range('20130101',periods=10).take(np.random.randint(0,10,size=100)).tolist() + [pd.NaT,pd.NaT])

In [65]: x
Out[65]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-10, ..., NaT]
Length: 102, Freq: None, Timezone: None

In [66]: x.to_series().value_counts()
Out[66]: 
2013-01-06    16
2013-01-05    13
2013-01-08    11
2013-01-07    11
2013-01-10    11
2013-01-01     9
2013-01-03     9
2013-01-09     7
2013-01-02     7
2013-01-04     6
NaT            2
dtype: int64

jreback · 2014-04-06T14:18:04Z

pandas/tests/test_base.py

+            self.assertEqual(unique.dtype, 'datetime64[ns]')
+            # numpy_array_equal cannot compare pd.NaT
+            self.assert_numpy_array_equal(unique[:3], expected)
+            self.assertTrue(unique[3] is pd.NaT or unique[3].astype('int64') == pd.tslib.iNaT)


this is fine, though array_equavalent will match this (as it uses pd.isnull() under the hood which handles NaT properly). even np.array_equal works here. The actual pd.NaT values are translated to pd.tslib.iNaT which is actually an integer; this type of testing is even easier that floats FYI.

ENH: added nunique function to Index

jreback · 2014-04-06T14:21:48Z

thanks @sinhrks

lots of nice PR's from you!

keep em coming

FYI my comments about the NaT testing were really just for your information...this PR is good!

sinhrks · 2014-04-07T12:02:59Z

Thank you for your review & info. I remember.

I owe much to pandas and hope to assist a little!

jreback · 2014-04-07T12:36:59Z

no thank you for the contributions!

keep em coming

jreback mentioned this pull request Mar 29, 2014

API/CLN: more common ops to integrate with Series/index OpsMixin #6382

Closed

17 tasks

jreback added API Design labels Mar 29, 2014

jreback added this to the 0.14.0 milestone Mar 29, 2014

jreback reviewed Mar 29, 2014
View reviewed changes

jreback reviewed Apr 2, 2014
View reviewed changes

ENH: added nunique and value_counts functions to Index

91befdd

jreback reviewed Apr 6, 2014
View reviewed changes

jreback added a commit that referenced this pull request Apr 6, 2014

Merge pull request #6734 from sinhrks/ind_nunique

657d255

ENH: added nunique function to Index

jreback merged commit 657d255 into pandas-dev:master Apr 6, 2014

sinhrks deleted the ind_nunique branch April 7, 2014 11:59

sinhrks mentioned this pull request Jun 11, 2014

FIX value_counts should skip NaT #7424

Merged

sinhrks mentioned this pull request Jul 17, 2014

nunique is slower than len(set(x.dropna())) for smaller Series. #7771

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: added nunique function to Index #6734

ENH: added nunique function to Index #6734

sinhrks commented Mar 29, 2014

jreback commented Mar 29, 2014

sinhrks commented Mar 29, 2014

jreback Mar 29, 2014

sinhrks commented Mar 31, 2014

jreback commented Mar 31, 2014

jreback commented Mar 31, 2014

sinhrks commented Mar 31, 2014

jreback commented Mar 31, 2014

nehalecky commented Apr 1, 2014

jreback commented Apr 1, 2014

sinhrks commented Apr 1, 2014

jreback commented Apr 1, 2014

sinhrks commented Apr 1, 2014

jreback commented Apr 1, 2014

sinhrks commented Apr 1, 2014

jreback commented Apr 1, 2014

sinhrks commented Apr 2, 2014

jreback Apr 2, 2014

sinhrks commented Apr 6, 2014

jreback commented Apr 6, 2014

jreback Apr 6, 2014

jreback commented Apr 6, 2014

sinhrks commented Apr 7, 2014

jreback commented Apr 7, 2014

ENH: added nunique function to Index #6734

ENH: added nunique function to Index #6734

Conversation

sinhrks commented Mar 29, 2014

jreback commented Mar 29, 2014

sinhrks commented Mar 29, 2014

jreback Mar 29, 2014

Choose a reason for hiding this comment

sinhrks commented Mar 31, 2014

jreback commented Mar 31, 2014

jreback commented Mar 31, 2014

sinhrks commented Mar 31, 2014

jreback commented Mar 31, 2014

nehalecky commented Apr 1, 2014

jreback commented Apr 1, 2014

sinhrks commented Apr 1, 2014

jreback commented Apr 1, 2014

sinhrks commented Apr 1, 2014

jreback commented Apr 1, 2014

sinhrks commented Apr 1, 2014

jreback commented Apr 1, 2014

sinhrks commented Apr 2, 2014

jreback Apr 2, 2014

Choose a reason for hiding this comment

sinhrks commented Apr 6, 2014

jreback commented Apr 6, 2014

jreback Apr 6, 2014

Choose a reason for hiding this comment

jreback commented Apr 6, 2014

sinhrks commented Apr 7, 2014

jreback commented Apr 7, 2014