[MRG] EHN refactoring of the ratio argument. #413

glemaitre · 2018-03-20T14:33:39Z

Reference Issue

closes #411
closes #406

What does this implement/fix? Explain your changes.

TODO

Any other comments?

codecov · 2018-03-20T16:02:21Z

Codecov Report

Merging #413 into master will decrease coverage by 0.06%.
The diff coverage is 99.43%.

@@            Coverage Diff             @@
##           master     #413      +/-   ##
==========================================
- Coverage   98.77%   98.71%   -0.07%     
==========================================
  Files          68       70       +2     
  Lines        4014     4188     +174     
==========================================
+ Hits         3965     4134     +169     
- Misses         49       54       +5

Impacted Files	Coverage Δ
imblearn/ensemble/tests/test_balance_cascade.py	`100% <100%> (ø)`	⬆️
imblearn/ensemble/tests/test_easy_ensemble.py	`100% <100%> (ø)`	⬆️
...rn/under_sampling/prototype_generation/__init__.py	`100% <100%> (ø)`	⬆️
imblearn/tests/test_common.py	`95.45% <100%> (ø)`	⬆️
imblearn/over_sampling/adasyn.py	`98.57% <100%> (+0.06%)`	⬆️
...ampling/prototype_selection/tests/test_nearmiss.py	`100% <100%> (ø)`	⬆️
..._sampling/prototype_selection/tests/test_allknn.py	`100% <100%> (ø)`	⬆️
imblearn/combine/tests/test_smote_tomek.py	`100% <100%> (ø)`	⬆️
...prototype_selection/neighbourhood_cleaning_rule.py	`100% <100%> (ø)`	⬆️
...sampling/prototype_generation/cluster_centroids.py	`100% <100%> (ø)`	⬆️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24f4973...09c5aaa. Read the comment docs.

pep8speaks · 2018-03-27T15:11:28Z

Hello @glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 27, 2018 at 15:54 Hours UTC

glemaitre · 2018-03-27T15:54:22Z

@massich @chkoar This is ready to be reviewed. It is a big one.

glemaitre · 2018-03-27T16:03:11Z

@jorisvandenbossche I would like to have your feedback as well.

You can check the documentation of those three classes which is representative of the full PR:

massich · 2018-03-27T16:10:24Z

doc/under_sampling.rst

-behaviour.
+The parameter ``sampling_target`` control which sample of the link will be
+removed. For instance, the default (i.e., ``sampling_target='auto'``) will
+remove the sample from the majority class. Both samples from the majority and


I would change remove the sample from for remove samples from. + I don't really understand what control which sample of the link will be removed means. link confuses me

massich · 2018-03-27T16:12:15Z

examples/plot_sampling_target_usage.py

@@ -0,0 +1,241 @@
+"""
+======================================================================
+Usage of the ``sampling_target`` parameter for the different algorithm


Howto use the sampling_target parameter (depending on the sampling strategy)

jorisvandenbossche

Big diff so didn't yet look at everything, but:

given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition
I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives: sampling_strategy, sampling_protocol

jorisvandenbossche · 2018-03-27T21:29:22Z

imblearn/over_sampling/random_over_sampler.py

+          minority class after resampling and the number of samples in the
+          majority class, respectively.
+
+        .. warning::


if you indent this two spaced, then it is included in the list (which is better I think)

jorisvandenbossche · 2018-03-27T21:30:03Z

imblearn/over_sampling/random_over_sampler.py

+    sampling_target : float, str, dict or callable, (default='auto')
+        Sampling information to resample the data set.
+
+        - When ``float``, it correspond to the ratio :math:`\\alpha_{os}`


correspond -> corresponds

jorisvandenbossche · 2018-03-27T21:31:37Z

imblearn/over_sampling/random_over_sampler.py

+
+            ``'minority'``: resample only the minority class;
+
+            ``'majority'``: resample only the majority class;


Since this is a RandomOversampler, does 'majority' make any sense?

jorisvandenbossche · 2018-03-27T21:34:44Z

imblearn/under_sampling/prototype_selection/edited_nearest_neighbours.py

+
+            ``'auto'``: equivalent to ``'not minority'``.
+
+        - When ``list``, the list contains the targeted classes.


This is not clear to me what it does.

jorisvandenbossche · 2018-03-27T21:35:16Z

imblearn/under_sampling/prototype_selection/nearmiss.py

-        - If ``dict``, the keys correspond to the targeted classes. The values
-          correspond to the desired number of samples.
-        - If callable, function taking ``y`` and returns a ``dict``. The keys
+        sampling_target : float, str, dict, callable, (default='auto')


jorisvandenbossche · 2018-03-27T21:45:07Z

examples/plot_sampling_target_usage.py

+plot_pie(y)
+
+###############################################################################
+# Using ``sampling_target`` in resampling algorithm


algorithm -> algorithms

jorisvandenbossche · 2018-03-27T21:46:53Z

examples/plot_sampling_target_usage.py

+
+print('Information of the iris data set after making it'
+      ' imbalanced using a callable: \n sampling_target={} \n y: {}'
+      .format(sampling_target, Counter(y)))


sampling_target is from the previous example

jorisvandenbossche · 2018-03-27T21:48:43Z

examples/plot_sampling_target_usage.py

+binary_mask = np.bitwise_or(y == 0, y == 2)
+binary_y = y[binary_mask]
+binary_X = X[binary_mask]
+


can you show the counter of the data? So you can afterwards compare the number after resampling

jorisvandenbossche · 2018-03-27T21:49:18Z

examples/plot_sampling_target_usage.py

+#
+# ``sampling_target`` can be given as a string which specify the class targeted
+# by the resampling. With under- and over-sampling, the number of samples will
+# be equalized.


emphasize you are no longer using the binary data

jorisvandenbossche · 2018-03-27T21:50:37Z

examples/plot_sampling_target_usage.py

+
+    fig, ax = plt.subplots()
+    ax.pie(sizes, explode=explode, labels=labels, shadow=True,
+           autopct='%1.1f%%')


would it be possible to add both absolute number and percentages?

glemaitre · 2018-03-28T05:23:11Z

Thanks @jorisvandenbossche
I don't like protocol but strategy seems generic enough.

Regarding the repetition, do you have something in mind? If it comes back to inject the proper docstring, I am not sure how to do that. If you think about a glossary, it could be cool but the issue is that we will have a docstring which will be generic for all over-, under-, cleaning-samplers.

What I mean is:

class MySampler(...):
"""....

    sampling_target: float, str, dict, callable
         Sampling strategy ....
         - If float, represent the balancing ratio, check `glossary <float_ratio>` for more details

""""

---
Glossary:

sampling_target as float : for over-sampling ...; for under-sampling

So somehow the user needs to know what he is using and select the proper explanation which was something that I wanted to avoid from the previous things that we had.

chkoar · 2018-03-28T16:01:17Z

Lack of time here to review this PR but I have two comments to make.

given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition

I had the same idea in #241

I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives: sampling_strategy, sampling_protocol

I believe that words like strategy and protocol are very nice, even on their own.

glemaitre · 2018-03-28T16:08:23Z

I believe that words like strategy and protocol are very nice, even on their own.

I think that pre-adding sampling is not harming. I can imagine the case of a meta-estimator using an estimator from scikit-learn which use the strategy keyword` and then we are doomed.

I had the same idea in #241

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

glemaitre · 2018-03-30T16:03:35Z

Ok so sampling_strategy kicked in and the docstring are factorize using the base class.

@massich @jorisvandenbossche @chkoar if you have any other remarks regarding the API, it would be nice.

Regarding the examples, I want to make them better in a next PR.

chkoar · 2018-04-02T00:01:51Z

I had the same idea in #241

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

Sorry @glemaitre. My bad. I was never referred to the class docstrings. Maybe I haven't said that explicitly. I actually said that for the fit and the _sample methods for the derived classes. As I am seeing the class docstrings you committed I understood why you had concerns. :D

glemaitre · 2018-04-02T08:51:40Z

Ac‎tually good point. We can do that in another PR when subsitution class will be merged.

EHN add ratio as a float and refactor tests

166f183

glemaitre added 12 commits March 20, 2018 22:46

FIX add new way of sampling

452926e

Check deprecation warning

dff36e7

TST fix test

8ef9416

FIX depreacte ratio and ratio_

f3fef5c

FIX rename the different functions

c12805c

TST udpate test and deprecating

311b269

FIX use check_sampling_target internally instead of check_ratio

596a475

DOC add check_sampling_target into the API

f5a1254

DOC update all docstring

00fc0bb

DOC update ratio example

fcf8b29

DOC remove ratio occurences

b11e154

EXA add example sampling target

1fe87df

glemaitre added 3 commits March 27, 2018 17:35

FIX test remove ratio from the test

8157bd7

EXA fix underline

78f91a0

DOC add whatsnew entry

b5c91c4

glemaitre changed the title ~~[WIP] EHN refactoring of the ratio argument.~~ [MRG] EHN refactoring of the ratio argument. Mar 27, 2018

DOC udpate whats new

db3f550

massich reviewed Mar 27, 2018

View reviewed changes

jorisvandenbossche reviewed Mar 27, 2018

View reviewed changes

glemaitre mentioned this pull request Mar 28, 2018

[WIP] ENH: Class Senstive Scaling #416

Open

add docstring substitution

199f320

glemaitre added 6 commits March 29, 2018 13:02

iter

89e27d9

DOC factorize docstring

7ebfb68

joris comments

e5a4dd3

TST add tests for injection in docstring

4ba215d

go back to old type class for python 2

0bf8f85

Rename and PEP8

09c5aaa

glemaitre merged commit 71ff0f6 into scikit-learn-contrib:master May 8, 2018


		``'minority'``: resample only the minority class;

		``'majority'``: resample only the majority class;


		``'auto'``: equivalent to ``'not minority'``.

		- When ``list``, the list contains the targeted classes.

[MRG] EHN refactoring of the ratio argument. #413

[MRG] EHN refactoring of the ratio argument. #413

Uh oh!

Conversation

glemaitre commented Mar 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

codecov bot commented Mar 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pep8speaks commented Mar 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on March 27, 2018 at 15:54 Hours UTC

Uh oh!

glemaitre commented Mar 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Mar 27, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Mar 28, 2018

Uh oh!

chkoar commented Mar 28, 2018

Uh oh!

glemaitre commented Mar 28, 2018

Uh oh!

glemaitre commented Mar 30, 2018

Uh oh!

chkoar commented Apr 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glemaitre commented Apr 2, 2018 via email

Uh oh!

Uh oh!

glemaitre commented Mar 20, 2018 •

edited

Loading

codecov bot commented Mar 20, 2018 •

edited

Loading

pep8speaks commented Mar 27, 2018 •

edited

Loading

glemaitre commented Mar 27, 2018 •

edited

Loading

chkoar commented Apr 2, 2018 •

edited

Loading