Skip to content

[MRG] EHN refactoring of the ratio argument. #413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 8, 2018

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Mar 20, 2018

Reference Issue

closes #411
closes #406

What does this implement/fix? Explain your changes.

TODO

  • Implement ratio as a float for binary class + test
  • remove the dict for cleaning-method + test
  • deprecate the previous behaviour.
  • enable to pass a list for cleaning method + test
  • change 'auto' from 'all' to 'not minority'
  • add check_sampling_target
  • with float, check that the ratio will not over-sample or under-sample when it should not.
  • add documentation check_sampling_target
  • add an alias to ratio called sampling_target
  • deprecate check_ratio for check_sampling_target
  • add documentation for sampling_target
  • update all docstring
  • update the example ratio
  • update API documentation (add check_sampling_target)
  • update docstring of the datasets
  • check coverage report
  • entry in what's new
  • Rename everything to sampling_strategy

Any other comments?

@codecov
Copy link

codecov bot commented Mar 20, 2018

Codecov Report

Merging #413 into master will decrease coverage by 0.06%.
The diff coverage is 99.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #413      +/-   ##
==========================================
- Coverage   98.77%   98.71%   -0.07%     
==========================================
  Files          68       70       +2     
  Lines        4014     4188     +174     
==========================================
+ Hits         3965     4134     +169     
- Misses         49       54       +5
Impacted Files Coverage Δ
imblearn/ensemble/tests/test_balance_cascade.py 100% <100%> (ø) ⬆️
imblearn/ensemble/tests/test_easy_ensemble.py 100% <100%> (ø) ⬆️
...rn/under_sampling/prototype_generation/__init__.py 100% <100%> (ø) ⬆️
imblearn/tests/test_common.py 95.45% <100%> (ø) ⬆️
imblearn/over_sampling/adasyn.py 98.57% <100%> (+0.06%) ⬆️
...ampling/prototype_selection/tests/test_nearmiss.py 100% <100%> (ø) ⬆️
..._sampling/prototype_selection/tests/test_allknn.py 100% <100%> (ø) ⬆️
imblearn/combine/tests/test_smote_tomek.py 100% <100%> (ø) ⬆️
...prototype_selection/neighbourhood_cleaning_rule.py 100% <100%> (ø) ⬆️
...sampling/prototype_generation/cluster_centroids.py 100% <100%> (ø) ⬆️
... and 53 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24f4973...09c5aaa. Read the comment docs.

@pep8speaks
Copy link

pep8speaks commented Mar 27, 2018

Hello @glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 27, 2018 at 15:54 Hours UTC

@glemaitre
Copy link
Member Author

glemaitre commented Mar 27, 2018

@massich @chkoar This is ready to be reviewed. It is a big one.

@glemaitre glemaitre changed the title [WIP] EHN refactoring of the ratio argument. [MRG] EHN refactoring of the ratio argument. Mar 27, 2018
behaviour.
The parameter ``sampling_target`` control which sample of the link will be
removed. For instance, the default (i.e., ``sampling_target='auto'``) will
remove the sample from the majority class. Both samples from the majority and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change remove the sample from for remove samples from. + I don't really understand what control which sample of the link will be removed means. link confuses me

@@ -0,0 +1,241 @@
"""
======================================================================
Usage of the ``sampling_target`` parameter for the different algorithm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Howto use the sampling_target parameter (depending on the sampling strategy)

Copy link

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big diff so didn't yet look at everything, but:

  • given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition

  • I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives: sampling_strategy, sampling_protocol

minority class after resampling and the number of samples in the
majority class, respectively.

.. warning::

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you indent this two spaced, then it is included in the list (which is better I think)

sampling_target : float, str, dict or callable, (default='auto')
Sampling information to resample the data set.

- When ``float``, it correspond to the ratio :math:`\\alpha_{os}`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correspond -> corresponds


``'minority'``: resample only the minority class;

``'majority'``: resample only the majority class;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a RandomOversampler, does 'majority' make any sense?


``'auto'``: equivalent to ``'not minority'``.

- When ``list``, the list contains the targeted classes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not clear to me what it does.

- If ``dict``, the keys correspond to the targeted classes. The values
correspond to the desired number of samples.
- If callable, function taking ``y`` and returns a ``dict``. The keys
sampling_target : float, str, dict, callable, (default='auto')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identation

plot_pie(y)

###############################################################################
# Using ``sampling_target`` in resampling algorithm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

algorithm -> algorithms


print('Information of the iris data set after making it'
' imbalanced using a callable: \n sampling_target={} \n y: {}'
.format(sampling_target, Counter(y)))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sampling_target is from the previous example

binary_mask = np.bitwise_or(y == 0, y == 2)
binary_y = y[binary_mask]
binary_X = X[binary_mask]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show the counter of the data? So you can afterwards compare the number after resampling

#
# ``sampling_target`` can be given as a string which specify the class targeted
# by the resampling. With under- and over-sampling, the number of samples will
# be equalized.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emphasize you are no longer using the binary data


fig, ax = plt.subplots()
ax.pie(sizes, explode=explode, labels=labels, shadow=True,
autopct='%1.1f%%')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to add both absolute number and percentages?

@glemaitre
Copy link
Member Author

Thanks @jorisvandenbossche
I don't like protocol but strategy seems generic enough.

Regarding the repetition, do you have something in mind? If it comes back to inject the proper docstring, I am not sure how to do that. If you think about a glossary, it could be cool but the issue is that we will have a docstring which will be generic for all over-, under-, cleaning-samplers.

What I mean is:

class MySampler(...):
"""....

    sampling_target: float, str, dict, callable
         Sampling strategy ....
         - If float, represent the balancing ratio, check `glossary <float_ratio>` for more details

""""

---
Glossary:

sampling_target as float : for over-sampling ...; for under-sampling

So somehow the user needs to know what he is using and select the proper explanation which was something that I wanted to avoid from the previous things that we had.

@chkoar
Copy link
Member

chkoar commented Mar 28, 2018

Lack of time here to review this PR but I have two comments to make.

given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition

I had the same idea in #241

I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives: sampling_strategy, sampling_protocol

I believe that words like strategy and protocol are very nice, even on their own.

@glemaitre
Copy link
Member Author

I believe that words like strategy and protocol are very nice, even on their own.

I think that pre-adding sampling is not harming. I can imagine the case of a meta-estimator using an estimator from scikit-learn which use the strategy keyword` and then we are doomed.

I had the same idea in #241

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

@glemaitre
Copy link
Member Author

Ok so sampling_strategy kicked in and the docstring are factorize using the base class.

@massich @jorisvandenbossche @chkoar if you have any other remarks regarding the API, it would be nice.

Regarding the examples, I want to make them better in a next PR.

@chkoar
Copy link
Member

chkoar commented Apr 2, 2018

I had the same idea in #241

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :)

Sorry @glemaitre. My bad. I was never referred to the class docstrings. Maybe I haven't said that explicitly. I actually said that for the fit and the _sample methods for the derived classes. As I am seeing the class docstrings you committed I understood why you had concerns. :D

@glemaitre
Copy link
Member Author

glemaitre commented Apr 2, 2018 via email

@glemaitre glemaitre merged commit 71ff0f6 into scikit-learn-contrib:master May 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Refactor ratio Improve the ratio explanation in the docstring of the samplers
5 participants