Skip to content

[MRG] EHN refactoring of the ratio argument. #413

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 8, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -205,4 +205,5 @@ Imbalance-learn provides some fast-prototyping tools.
utils.estimator_checks.check_estimator
utils.check_neighbors_object
utils.check_ratio
utils.check_sampling_strategy
utils.hash_X_y
24 changes: 13 additions & 11 deletions doc/datasets/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,29 +94,31 @@ Imbalanced generator
====================

:func:`make_imbalance` turns an original dataset into an imbalanced
dataset. This behaviour is driven by the parameter ``ratio`` which behave
similarly to other resampling algorithm. ``ratio`` can be given as a dictionary
where the key corresponds to the class and the value is the the number of
samples in the class::
dataset. This behaviour is driven by the parameter ``sampling_strategy`` which
behave similarly to other resampling algorithm. ``sampling_strategy`` can be
given as a dictionary where the key corresponds to the class and the value is
the number of samples in the class::

>>> from sklearn.datasets import load_iris
>>> from imblearn.datasets import make_imbalance
>>> iris = load_iris()
>>> ratio = {0: 20, 1: 30, 2: 40}
>>> X_imb, y_imb = make_imbalance(iris.data, iris.target, ratio=ratio)
>>> sampling_strategy = {0: 20, 1: 30, 2: 40}
>>> X_imb, y_imb = make_imbalance(iris.data, iris.target,
... sampling_strategy=sampling_strategy)
>>> sorted(Counter(y_imb).items())
[(0, 20), (1, 30), (2, 40)]

Note that all samples of a class are passed-through if the class is not mentioned
in the dictionary::

>>> ratio = {0: 10}
>>> X_imb, y_imb = make_imbalance(iris.data, iris.target, ratio=ratio)
>>> sampling_strategy = {0: 10}
>>> X_imb, y_imb = make_imbalance(iris.data, iris.target,
... sampling_strategy=sampling_strategy)
>>> sorted(Counter(y_imb).items())
[(0, 10), (1, 50), (2, 50)]

Instead of a dictionary, a function can be defined and directly pass to
``ratio``::
``sampling_strategy``::

>>> def ratio_multiplier(y):
... multiplier = {0: 0.5, 1: 0.7, 2: 0.95}
Expand All @@ -125,9 +127,9 @@ Instead of a dictionary, a function can be defined and directly pass to
... target_stats[key] = int(value * multiplier[key])
... return target_stats
>>> X_imb, y_imb = make_imbalance(iris.data, iris.target,
... ratio=ratio_multiplier)
... sampling_strategy=ratio_multiplier)
>>> sorted(Counter(y_imb).items())
[(0, 25), (1, 35), (2, 47)]

See :ref:`sphx_glr_auto_examples_datasets_plot_make_imbalance.py` and
:ref:`sphx_glr_auto_examples_plot_ratio_usage.py`.
:ref:`sphx_glr_auto_examples_plot_sampling_strategy_usage.py`.
7 changes: 4 additions & 3 deletions doc/developers_utils.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,10 @@ which accepts arrays, matrices, or sparse matrices as arguments, the following
should be used when applicable.

- :func:`check_neighbors_object`: Check the objects is consistent to be a NN.
- :func:`check_target_type`: Check the target types to be conform to the current samplers.
- :func:`check_ratio`: Checks ratio for consistent type and return a dictionary
containing each targeted class with its corresponding number of pixel.
- :func:`check_target_type`: Check the target types to be conform to the current sam plers.
- :func:`check_sampling_strategy`: Checks that sampling target is onsistent with
the type and return a dictionary containing each targeted class with its
corresponding number of pixel.


Deprecation
Expand Down
5 changes: 3 additions & 2 deletions doc/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,13 @@ output of an :class:`EasyEnsemble` sampler with an ensemble of classifiers
(i.e. ``BaggingClassifier``). Therefore, :class:`BalancedBaggingClassifier`
takes the same parameters than the scikit-learn
``BaggingClassifier``. Additionally, there is two additional parameters,
``ratio`` and ``replacement``, as in the :class:`EasyEnsemble` sampler::
``sampling_strategy`` and ``replacement``, as in the :class:`EasyEnsemble`
sampler::


>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
... ratio='auto',
... sampling_strategy='auto',
... replacement=False,
... random_state=0)
>>> bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
Expand Down
12 changes: 6 additions & 6 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ by considering independently each targeted class::
>>> print(np.vstack({tuple(row) for row in X_resampled}).shape)
(181, 2)

See :ref:`sphx_glr_auto_examples_plot_ratio_usage.py`,
See :ref:`sphx_glr_auto_examples_plot_sampling_strategy_usage.py`.,
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`,
and :ref:`sphx_glr_auto_examples_under-sampling_plot_random_under_sampler.py`.

Expand Down Expand Up @@ -214,11 +214,11 @@ the samples of interest in green.
:scale: 60
:align: center

The parameter ``ratio`` control which sample of the link will be removed. For
instance, the default (i.e., ``ratio='auto'``) will remove the sample from the
majority class. Both samples from the majority and minority class can be
removed by setting ``ratio`` to ``'all'``. The figure illustrates this
behaviour.
The parameter ``sampling_strategy`` control which sample of the link will be
removed. For instance, the default (i.e., ``sampling_strategy='auto'``) will
remove the sample from the majority class. Both samples from the majority and
minority class can be removed by setting ``sampling_strategy`` to ``'all'``. The
figure illustrates this behaviour.

.. image:: ./auto_examples/under-sampling/images/sphx_glr_plot_illustration_tomek_links_002.png
:target: ./auto_examples/under-sampling/plot_illustration_tomek_links.html
Expand Down
29 changes: 29 additions & 0 deletions doc/whats_new/v0.0.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,18 @@ Version 0.4 (under development)
Changelog
---------

API
...

- Replace the parameter ``ratio`` by ``sampling_strategy``. :issue:`411` by
:user:`Guillaume Lemaitre <glemaitre>`.

- Enable to use a ``float`` with binary classification for
``sampling_strategy``. :issue:`411` by :user:`Guillaume Lemaitre <glemaitre>`.

- Enable to use a ``list`` for the cleaning methods to specify the class to
sample. :issue:`411` by :user:`Guillaume Lemaitre <glemaitre>`.

Enhancement
...........

Expand Down Expand Up @@ -34,3 +46,20 @@ Maintenance

- Remove deprecated parameters in 0.2 - :issue:`331` by :user:`Guillaume
Lemaitre <glemaitre>`.

Deprecation
...........

- Deprecate ``ratio`` in favor of ``sampling_strategy``. :issue:`411` by
:user:`Guillaume Lemaitre <glemaitre>`.

- Deprecate the use of a ``dict`` for cleaning methods. a ``list`` should be
used. :issue:`411` by :user:`Guillaume Lemaitre <glemaitre>`.

- Deprecate ``random_state`` in :class:`imblearn.under_sampling.NearMiss`,
:class:`imblearn.under_sampling.EditedNearestNeighbors`,
:class:`imblearn.under_sampling.RepeatedEditedNearestNeighbors`,
:class:`imblearn.under_sampling.AllKNN`,
:class:`imblearn.under_sampling.NeighbourhoodCleaningRule`,
:class:`imblearn.under_sampling.InstanceHardnessThreshold`,
:class:`imblearn.under_sampling.CondensedNearestNeighbours`.
7 changes: 4 additions & 3 deletions examples/applications/plot_multi_class_under_sampling.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,9 @@

# Create a folder to fetch the dataset
iris = load_iris()
X, y = make_imbalance(iris.data, iris.target, ratio={0: 25, 1: 50, 2: 50},
random_state=0)
X, y = make_imbalance(iris.data, iris.target,
sampling_strategy={0: 25, 1: 50, 2: 50},
random_state=RANDOM_STATE)

X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state=RANDOM_STATE)
Expand All @@ -39,7 +40,7 @@
print('Testing target statistics: {}'.format(Counter(y_test)))

# Create a pipeline
pipeline = make_pipeline(NearMiss(version=2, random_state=RANDOM_STATE),
pipeline = make_pipeline(NearMiss(version=2),
LinearSVC(random_state=RANDOM_STATE))
pipeline.fit(X_train, y_train)

Expand Down
4 changes: 2 additions & 2 deletions examples/datasets/plot_make_imbalance.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,12 +55,12 @@ def ratio_func(y, multiplier, minority_class):
for i, multiplier in enumerate(multipliers, start=1):
ax = axs[i]

X_, y_ = make_imbalance(X, y, ratio=ratio_func,
X_, y_ = make_imbalance(X, y, sampling_strategy=ratio_func,
**{"multiplier": multiplier,
"minority_class": 1})
ax.scatter(X_[y_ == 0, 0], X_[y_ == 0, 1], label="Class #0", alpha=0.5)
ax.scatter(X_[y_ == 1, 0], X_[y_ == 1, 1], label="Class #1", alpha=0.5)
ax.set_title('ratio = {}'.format(multiplier))
ax.set_title('sampling_strategy = {}'.format(multiplier))
plot_decoration(ax)

plt.tight_layout()
Expand Down
134 changes: 0 additions & 134 deletions examples/plot_ratio_usage.py

This file was deleted.

Loading