Skip to content

[ML] Frequent Items: use a bitset for deduplication #88943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Aug 1, 2022

Conversation

hendrikmuhs
Copy link

@hendrikmuhs hendrikmuhs commented Jul 29, 2022

By using bitsets instead of lists of longs item sets can be faster de-duplicated. A bit is set according to the order of top items (by count).

Screenshot_20220729_153822

Notes:

  • the bitset might be useful for transactions and can speedup the lookup to find out if a candidate set matches the transaction
  • bitsets reduce memory requirements(memory for remembering collected sets)

@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Jul 29, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @hendrikmuhs, I've created a changelog YAML for you.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those numbers look great! I like some of the refactors as well. Just some minor comments.

@hendrikmuhs hendrikmuhs merged commit e64eb8c into elastic:main Aug 1, 2022
@hendrikmuhs hendrikmuhs deleted the frequent-items-bitset3 branch August 1, 2022 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:ML Meta label for the ML team v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants