-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Sampler should also sample sample_weight and return it #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I really do not know. What will be the sample_weight of a new instance in the case of over sampling? |
A constant I would think but I don't know what would be meaningful. |
It would still be useful to allow users to pass arbitrary arrays (as long as they have the correct number of rows). The way to proceed for undersampling is straightforward, but for oversampling I think the user would need to redefine sample_weights manually, since since there's no one-size-fits-all answer. Outside of the specific case of sample_weights, arrays to be resampled could simply contain NA for artificial data points after oversampling. Short of that, it may be a viable workaround to retain original row indices after over- or under-sampling of Pandas DataFrames, so users can re-join matching Pandas Series to the DataFrame after resampling. I can imagine a scenario where a user wants to oversample a dataset but ignore certain columns for
|
So one type of fairness related methods are reweighing methods, which would change the |
Is there anything new on this topic? I am facing a similar issue, were passing sample_weight to the pipeline takes priority over using a sampler. I agree with @jimbudarz artificial (oversampled) data points might could just get a default value of nan for their sample weight per default. While I am not to deep in the architecture, could it be another option to give the user the ability to pass a lambda function (to the oversampling constructor) that tells the pipeline how to build the sample weight. E.g., from sklearn import datasets
from imblearn.over_sampling import SMOTE
df = datasets.load_iris(as_frame = True)['data']
build_weight = lambda x: 1/x["sepal length (cm)"]
# initial construct sample weights
sample_weights = df.apply(build_weight,axis=1)
# idea
sampler = SMOTE(sample_weight_lambda=build_weight) |
Some scikit-learn estimators rely on the sample_weight but the current Sampler do not accept it. We should be able to at least sample the sample_weight as well. However, it should be compatible with the Pipeline API.
@chkoar do you have a clue how to handle it.
The text was updated successfully, but these errors were encountered: