-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[MRG] EHN refactoring of the ratio argument. #413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #413 +/- ##
==========================================
- Coverage 98.77% 98.71% -0.07%
==========================================
Files 68 70 +2
Lines 4014 4188 +174
==========================================
+ Hits 3965 4134 +169
- Misses 49 54 +5
Continue to review full report at Codecov.
|
Hello @glemaitre! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 27, 2018 at 15:54 Hours UTC |
@jorisvandenbossche I would like to have your feedback as well. You can check the documentation of those three classes which is representative of the full PR:
|
doc/under_sampling.rst
Outdated
behaviour. | ||
The parameter ``sampling_target`` control which sample of the link will be | ||
removed. For instance, the default (i.e., ``sampling_target='auto'``) will | ||
remove the sample from the majority class. Both samples from the majority and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change remove the sample from
for remove samples from
. + I don't really understand what control which sample of the link will be removed
means. link
confuses me
@@ -0,0 +1,241 @@ | |||
""" | |||
====================================================================== | |||
Usage of the ``sampling_target`` parameter for the different algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Howto use the sampling_target
parameter (depending on the sampling strategy)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big diff so didn't yet look at everything, but:
-
given how many times you repeat the explanation in the docstring, might be worth looking at a way how to share this to avoid repetition
-
I am not fully sure about "sampling_target" as keyword name. For the string options, this is an appropriate name, but for the float not really. Possible (although longer) alternatives:
sampling_strategy
,sampling_protocol
minority class after resampling and the number of samples in the | ||
majority class, respectively. | ||
|
||
.. warning:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you indent this two spaced, then it is included in the list (which is better I think)
sampling_target : float, str, dict or callable, (default='auto') | ||
Sampling information to resample the data set. | ||
|
||
- When ``float``, it correspond to the ratio :math:`\\alpha_{os}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correspond -> corresponds
|
||
``'minority'``: resample only the minority class; | ||
|
||
``'majority'``: resample only the majority class; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a RandomOversampler, does 'majority' make any sense?
|
||
``'auto'``: equivalent to ``'not minority'``. | ||
|
||
- When ``list``, the list contains the targeted classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not clear to me what it does.
- If ``dict``, the keys correspond to the targeted classes. The values | ||
correspond to the desired number of samples. | ||
- If callable, function taking ``y`` and returns a ``dict``. The keys | ||
sampling_target : float, str, dict, callable, (default='auto') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
identation
plot_pie(y) | ||
|
||
############################################################################### | ||
# Using ``sampling_target`` in resampling algorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
algorithm -> algorithms
|
||
print('Information of the iris data set after making it' | ||
' imbalanced using a callable: \n sampling_target={} \n y: {}' | ||
.format(sampling_target, Counter(y))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sampling_target
is from the previous example
binary_mask = np.bitwise_or(y == 0, y == 2) | ||
binary_y = y[binary_mask] | ||
binary_X = X[binary_mask] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show the counter of the data? So you can afterwards compare the number after resampling
# | ||
# ``sampling_target`` can be given as a string which specify the class targeted | ||
# by the resampling. With under- and over-sampling, the number of samples will | ||
# be equalized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emphasize you are no longer using the binary data
|
||
fig, ax = plt.subplots() | ||
ax.pie(sizes, explode=explode, labels=labels, shadow=True, | ||
autopct='%1.1f%%') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to add both absolute number and percentages?
Thanks @jorisvandenbossche Regarding the repetition, do you have something in mind? If it comes back to inject the proper docstring, I am not sure how to do that. If you think about a glossary, it could be cool but the issue is that we will have a docstring which will be generic for all over-, under-, cleaning-samplers. What I mean is:
So somehow the user needs to know what he is using and select the proper explanation which was something that I wanted to avoid from the previous things that we had. |
Lack of time here to review this PR but I have two comments to make.
I had the same idea in #241
I believe that words like strategy and protocol are very nice, even on their own. |
I think that pre-adding sampling is not harming. I can imagine the case of a meta-estimator using an estimator from scikit-learn which use the
I still have the concern that it makes it a bit more difficult to contribute at first but at the end we ensure documentation quality. So I am incline to admit that I was wrong :) |
Ok so @massich @jorisvandenbossche @chkoar if you have any other remarks regarding the API, it would be nice. Regarding the examples, I want to make them better in a next PR. |
Sorry @glemaitre. My bad. I was never referred to the class docstrings. Maybe I haven't said that explicitly. I actually said that for the |
Actually good point. We can do that in another PR when subsitution class will be merged.
|
Reference Issue
closes #411
closes #406
What does this implement/fix? Explain your changes.
TODO
float
, check that the ratio will not over-sample or under-sample when it should not.sampling_strategy
Any other comments?