Skip to content

[ML] Add audit message when categorization detects too many categories #50319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
droberts195 opened this issue Dec 18, 2019 · 2 comments
Closed
Assignees
Labels
>enhancement :ml Machine learning

Comments

@droberts195
Copy link
Contributor

If data that is not suitable for categorization is categorized then it is possible for an excessive number of categories to be created, each with a very small number of messages.

The resultant categories are not very useful, and also resource hungry, both in terms of results documents to process and because they increase the cardinality of the chained anomaly detection.

To make it clearer that such a situation has occurred and to encourage the user to stop the affected job we should write an audit message when there are lots of categories for a job.

The condition for doing this could be as simple as "number of categories > 1000".

Or we could go for something more advanced like "number of input documents > 1000 and number of categories > number of input documents / 10" or "number of categories > 3 * √number of input documents".

@droberts195 droberts195 added >enhancement :ml Machine learning labels Dec 18, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@droberts195 droberts195 self-assigned this Feb 11, 2020
@droberts195
Copy link
Contributor Author

A rudimentary check was added in 7.6 in #51146

This was replaced with a better check for 7.7 and above in #52195

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning
Projects
None yet
Development

No branches or pull requests

2 participants