Skip to content

Simplify BucketedSort/Teach BitArray a useful trick #53199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 6, 2020

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Mar 5, 2020

Our lovely BitArray compactly stores "flags", lazilly growing its
underlying storage. It is super useful when you need to store one bit of
data for a zillion buckets or a documents or something. Usefully, it
defaults to false. But there is a wrinkle! If you ask it whether or
not a bit is set but it hasn't grown its underlying storage array
"around" that index then it'll throw an ArrayIndexOutOfBoundsException.
The per-document use cases tend to show up in order and don't tend to
mind this too much. But the use case in aggregations, the per-bucket use
case, does. Because buckets are collected out of order all the time.

This changes BitArray so it'll return false if the index is too big
for the underlying storage. After all, that index can't have been set
or else we would have grown the underlying array. Logically, I believe
this makes sense. And it makes my life easy. At the cost of three lines.

but this adds an extra test to every call to get. I think this is
likely ok because it is "very close" to an array index lookup that
already runs the same test. So I think it'll end up merged with the
array bounds check.

Our lovely `BitArray` compactly stores "flags", lazilly growing its
underlying storage. It is super useful when you need to store one bit of
data for a zillion buckets or a documents or something. Usefully, it
defaults to `false`. But there is a wrinkle! If you ask it whether or
not a bit is set but it hasn't grown its underlying storage array
"around" that index then it'll throw an `ArrayIndexOutOfBoundsException`.
The per-document use cases tend to show up in order and don't tend to
mind this too much. But the use case in aggregations, the per-bucket use
case, does. Because buckets are collected out of order all the time.

This changes `BitArray` so it'll return `false` if the index is too big
for the underlying storage. After all, that index *can't* have been set
or else we would have grown the underlying array. Logically, I believe
this makes sense. And it makes my life easy. At the cost of three lines.

*but* this adds an extra test to every call to `get`. I think this is
likely ok because it is "very close" to an array index lookup that
already runs the same test. So I *think* it'll end up merged with the
array bounds check.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

Copy link
Contributor

@csoulios csoulios left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +59 to +65
if (wordNum >= bits.size()) {
/*
* If the word is bigger than the array then it could *never* have
* been set.
*/
return false;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good catch. It feels more natural and robust to test BitArray size within the BitArray implementation than add it to the calling methods (such as in BucketSort).

When it comes to performance, we can't really say if this test affects some compiler optimization unless we test/benchmark it. Also, I wonder if this is faster or just catching the ArrayIndexOutOfBoundsException thrown by bits.get(wordNum) within the method and simply returning false. I know try/catch blocks can have performance impacts too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can check that, but I expect it'll depend on the frequency of throw. I already have some benchmarks i can run it against though so it is pretty quick to test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume that most of the times an exception is not thrown. Yet, a try-catch block has its performance implications even no exceptions are thrown.

@@ -37,16 +37,32 @@ public BitArray(int initialSize, BigArrays bigArrays) {
this.bits = bigArrays.newLongArray(initialSize, true);
}

/**
Copy link
Contributor

@csoulios csoulios Mar 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always appreciate it when I see some javadoc added 👍

@nik9000
Copy link
Member Author

nik9000 commented Mar 6, 2020

I ran my rally benchmarks and didn't really so a huge difference, oddly. So I'm going to stick with the explicit check because I think it is more clear.

@nik9000 nik9000 merged commit d26face into elastic:master Mar 6, 2020
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Mar 6, 2020
Our lovely `BitArray` compactly stores "flags", lazilly growing its
underlying storage. It is super useful when you need to store one bit of
data for a zillion buckets or a documents or something. Usefully, it
defaults to `false`. But there is a wrinkle! If you ask it whether or
not a bit is set but it hasn't grown its underlying storage array
"around" that index then it'll throw an `ArrayIndexOutOfBoundsException`.
The per-document use cases tend to show up in order and don't tend to
mind this too much. But the use case in aggregations, the per-bucket use
case, does. Because buckets are collected out of order all the time.

This changes `BitArray` so it'll return `false` if the index is too big
for the underlying storage. After all, that index *can't* have been set
or else we would have grown the underlying array. Logically, I believe
this makes sense. And it makes my life easy. At the cost of three lines.

*but* this adds an extra test to every call to `get`. I think this is
likely ok because it is "very close" to an array index lookup that
already runs the same test. So I *think* it'll end up merged with the
array bounds check.
nik9000 added a commit that referenced this pull request Mar 6, 2020
Our lovely `BitArray` compactly stores "flags", lazilly growing its
underlying storage. It is super useful when you need to store one bit of
data for a zillion buckets or a documents or something. Usefully, it
defaults to `false`. But there is a wrinkle! If you ask it whether or
not a bit is set but it hasn't grown its underlying storage array
"around" that index then it'll throw an `ArrayIndexOutOfBoundsException`.
The per-document use cases tend to show up in order and don't tend to
mind this too much. But the use case in aggregations, the per-bucket use
case, does. Because buckets are collected out of order all the time.

This changes `BitArray` so it'll return `false` if the index is too big
for the underlying storage. After all, that index *can't* have been set
or else we would have grown the underlying array. Logically, I believe
this makes sense. And it makes my life easy. At the cost of three lines.

*but* this adds an extra test to every call to `get`. I think this is
likely ok because it is "very close" to an array index lookup that
already runs the same test. So I *think* it'll end up merged with the
array bounds check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants