Skip to content

Add a Filter for text-based columns #1763

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CESARDELATORRE opened this issue Nov 28, 2018 · 1 comment
Closed

Add a Filter for text-based columns #1763

CESARDELATORRE opened this issue Nov 28, 2018 · 1 comment
Labels
API Issues pertaining the friendly API enhancement New feature or request

Comments

@CESARDELATORRE
Copy link
Contributor

Afaik, the new filter APIs can target just numeric-based columns, but not text-based columns.
As of v0.8 we have:
1)

//FilterByColumn()

IDataView trainingDataView = mlContext.Data.FilterByColumn(baseTrainingDataView, "FareAmount", lowerBound: 1, upperBound: 150);

This is a very convenient filter, but for NUMERIC values. I’m currently using this for the sample code-snippet

FilterByKeyColumnFraction()
Good for hashed values.

But if we want to filter by text-based columns, let's say I want to remove the rows where a text based column has no value (this might be doable when transforming to numeric values, then missing value is NaN and it will get filtered, but you need an additional not straightforward step) or to remove specific rows when categorical values are equal to "some text", I think we cannot do that, yet.

@CESARDELATORRE CESARDELATORRE added the enhancement New feature or request label Nov 28, 2018
@TomFinley
Copy link
Contributor

TomFinley commented Nov 29, 2018

This is fine, but I wonder if we can go even a step further and make something generally applicable to many things rather than just something very, very specific to text.

So: data pipelines must be serializable, and prior to ML.NET being open sourced filters were part of data pipelines. Pursuant to #933, it is our position that filters should not be part of data pipelines any longer, hence why we see the old functionality of filters not being exposed as IEstimator/ITransformer, but instead just straight functions on IDataView, e.g.:

public IDataView FilterByColumn(IDataView input, string columnName, double lowerBound = double.NegativeInfinity, double upperBound = double.PositiveInfinity)

The implication means that there is no longer any requirement for things to be serializable, which means we could actually probably simplify a lot of code by deleting practically all of the existing filters, and just replace them with some sort of bool evaluating delegate akin to a LINQ .Where. (We can't serialize delegates, but since we no longer care about that, that's fine.)

This means this code here:

var data1 = ML.Data.FilterByColumn(data, "Floats", upperBound: 2.8);

Could potentially just be:

var data1 = ML.Data.KeepWhere(data, "Floats", (float v) => v < 2.8);

Your specific example (assuming that this is in some text column called MyAwesomeText), might be something akin to:

var data1 = ML.Data.KeepWhere(data, "MyAwesomeText", (ReadOnlyMemory<char> v) => v == v.Length > 0);

Whether we want to have these existing methods as a convenience, or add the convenience you suggest, I don't really know.

@glebuk glebuk added the API Issues pertaining the friendly API label Jan 18, 2019
@codemzs codemzs closed this as completed Jun 30, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants