-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Simplify BucketedSort/Teach BitArray a useful trick #53199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Our lovely `BitArray` compactly stores "flags", lazilly growing its underlying storage. It is super useful when you need to store one bit of data for a zillion buckets or a documents or something. Usefully, it defaults to `false`. But there is a wrinkle! If you ask it whether or not a bit is set but it hasn't grown its underlying storage array "around" that index then it'll throw an `ArrayIndexOutOfBoundsException`. The per-document use cases tend to show up in order and don't tend to mind this too much. But the use case in aggregations, the per-bucket use case, does. Because buckets are collected out of order all the time. This changes `BitArray` so it'll return `false` if the index is too big for the underlying storage. After all, that index *can't* have been set or else we would have grown the underlying array. Logically, I believe this makes sense. And it makes my life easy. At the cost of three lines. *but* this adds an extra test to every call to `get`. I think this is likely ok because it is "very close" to an array index lookup that already runs the same test. So I *think* it'll end up merged with the array bounds check.
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if (wordNum >= bits.size()) { | ||
/* | ||
* If the word is bigger than the array then it could *never* have | ||
* been set. | ||
*/ | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good catch. It feels more natural and robust to test BitArray size within the BitArray implementation than add it to the calling methods (such as in BucketSort).
When it comes to performance, we can't really say if this test affects some compiler optimization unless we test/benchmark it. Also, I wonder if this is faster or just catching the ArrayIndexOutOfBoundsException
thrown by bits.get(wordNum)
within the method and simply returning false. I know try/catch blocks can have performance impacts too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check that, but I expect it'll depend on the frequency of throw
. I already have some benchmarks i can run it against though so it is pretty quick to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would assume that most of the times an exception is not thrown. Yet, a try-catch block has its performance implications even no exceptions are thrown.
@@ -37,16 +37,32 @@ public BitArray(int initialSize, BigArrays bigArrays) { | |||
this.bits = bigArrays.newLongArray(initialSize, true); | |||
} | |||
|
|||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always appreciate it when I see some javadoc added 👍
I ran my rally benchmarks and didn't really so a huge difference, oddly. So I'm going to stick with the explicit check because I think it is more clear. |
Our lovely `BitArray` compactly stores "flags", lazilly growing its underlying storage. It is super useful when you need to store one bit of data for a zillion buckets or a documents or something. Usefully, it defaults to `false`. But there is a wrinkle! If you ask it whether or not a bit is set but it hasn't grown its underlying storage array "around" that index then it'll throw an `ArrayIndexOutOfBoundsException`. The per-document use cases tend to show up in order and don't tend to mind this too much. But the use case in aggregations, the per-bucket use case, does. Because buckets are collected out of order all the time. This changes `BitArray` so it'll return `false` if the index is too big for the underlying storage. After all, that index *can't* have been set or else we would have grown the underlying array. Logically, I believe this makes sense. And it makes my life easy. At the cost of three lines. *but* this adds an extra test to every call to `get`. I think this is likely ok because it is "very close" to an array index lookup that already runs the same test. So I *think* it'll end up merged with the array bounds check.
Our lovely `BitArray` compactly stores "flags", lazilly growing its underlying storage. It is super useful when you need to store one bit of data for a zillion buckets or a documents or something. Usefully, it defaults to `false`. But there is a wrinkle! If you ask it whether or not a bit is set but it hasn't grown its underlying storage array "around" that index then it'll throw an `ArrayIndexOutOfBoundsException`. The per-document use cases tend to show up in order and don't tend to mind this too much. But the use case in aggregations, the per-bucket use case, does. Because buckets are collected out of order all the time. This changes `BitArray` so it'll return `false` if the index is too big for the underlying storage. After all, that index *can't* have been set or else we would have grown the underlying array. Logically, I believe this makes sense. And it makes my life easy. At the cost of three lines. *but* this adds an extra test to every call to `get`. I think this is likely ok because it is "very close" to an array index lookup that already runs the same test. So I *think* it'll end up merged with the array bounds check.
Our lovely
BitArray
compactly stores "flags", lazilly growing itsunderlying storage. It is super useful when you need to store one bit of
data for a zillion buckets or a documents or something. Usefully, it
defaults to
false
. But there is a wrinkle! If you ask it whether ornot a bit is set but it hasn't grown its underlying storage array
"around" that index then it'll throw an
ArrayIndexOutOfBoundsException
.The per-document use cases tend to show up in order and don't tend to
mind this too much. But the use case in aggregations, the per-bucket use
case, does. Because buckets are collected out of order all the time.
This changes
BitArray
so it'll returnfalse
if the index is too bigfor the underlying storage. After all, that index can't have been set
or else we would have grown the underlying array. Logically, I believe
this makes sense. And it makes my life easy. At the cost of three lines.
but this adds an extra test to every call to
get
. I think this islikely ok because it is "very close" to an array index lookup that
already runs the same test. So I think it'll end up merged with the
array bounds check.