-
Notifications
You must be signed in to change notification settings - Fork 25.2k
_uid
should be indexed in Lucene in binary form, not base64
#18154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Would this also make the choice for which doc values implementation(#11887) to use more obvious? |
Well, both sorted and binary doc values in Lucene accept full binary terms, so this change shouldn't favor one over another. |
@mikemccand I'm curious why you added the discuss label. Can you foresee any potential problem if we do that? |
Well, just because this seems like a biggish change (for me!)... there are so many places in ES where we pass around I would change them all to Uid values would also need to be represented as Should we allow users to also pass binary id values in indexing/bulk/get requests? I think yes, but then we need to accept either incoming |
as a first step couldn't it just be an internal encoding thing? e.g. just make sure we base64-encode before putting it in the terms dict and decode before doing anything with it. this seems like it would require less api changes but would give us the smaller index. |
Ooh that's a great idea! It would require that down in Hmm but then what would happen if a user does auto-generated IDs at first, then adds some documents with their own IDs? This is allowed today right? At search time we wouldn't know whether we're supposed to base64 encode it or not? Maybe we could disallow this? |
Currently the uids have the following format:
I think we could keep everything working while saving space for auto-generated ids and not modifying the API. Actually users could even use binary ids themselves by providing base64 strings as ids. A side-effect would be that numeric ids (eg. auto-increment ids generated by a database) would also require less space since elasticsearch would base64-decode them. |
That's a clever solution @jpountz! I think the mapping is safe, because \0 is not a character that is allowed as a This would also mean we cannot represent uid values in the code as Hmm, you can also query |
Right: range, prefix and fuzzy queries would not work anymore. But maybe it is not something that should be supported on a uid field as it reduces our freedom to improve its efficiency for indexing and get operations (there was a similar discussion on #17994 about ip fields)? I also suspect that performance of ranges and prefixes (maybe fuzzy too) is terrible due to the cardinality of the field and the length of the terms (in the auto-generated case), so it is not something you can really rely on anyway? |
OK I think it makes sense to restrict the queries you can run on |
I think the main reason one would use a range query on |
I ran some simulations to check whether this would actually save space. For an app that would index 1M docs at a rate of about 100 docs per second per shard (the rate is important, since faster rates mean that ids have longer common prefixes and vice-versa), disk space used by the This is substantial, but the reduction drops to 7% if we put the mac address before the timestamp as discussed in #18209 (17794115 bytes to 16540472), which gets quickly drowned if you have other fields. I am unsure whether the gain is worth the complexity so I will probably wait to see what we do on #18209 and whether this change would have a positive impact on indexing speed in addition to disk space. |
I did the same changes as Mike did in #18209 to simulate 100 docs per second. The classic benchmark reported the same indexing speed and the index was about 0.5% smaller. I am not sure this is worth the complexity of this change. |
I think we should revisit this once we somehow remove types and enforce ids to be unique per index (rather than per type). |
@jpountz I'm not sure exactly why types are in the way but if it's because their stringy nature, we can also assign each type a numerical id? |
@bleskes They are not exactly in the way, but prepending types to the ids makes lookups slower since the FST still needs to walk the bytes of the type before reaching bytes that make a difference. It is tempting to put them in the end but then the compression of the terms dictionary degrades considerably given that the terms dict only performs prefix-based compression (https://issues.apache.org/jira/browse/LUCENE-4702 might help a bit from that perspective but the compression ratio would still be worse than if prepending _type to the _uid). We could make things a bit better by assigning ids to types as you suggest, but I have the feeling that we reached a point where gettig rid of types, or at least decoupling them from mappings and treating them as index partitions, has become an acceptable path. Then we could directly use the _id as a primary key and I think we should reevaluate this since the space/speed benefit of not base64-ing the ids might become more interesting and worth the complexity of the change. |
I have been working on a super hacky patch (tests don't even compile) to try to estimate how much we could win by doing this: https://github.com/jpountz/elasticsearch/tree/enhancement/optional_type. This branch has an index setting called
@mikemccand helped me benchmark the change and this resulted in a 13% improvement in indexing throughput with the NYC taxi rides dataset. |
Very cool! |
Indexing ids in binary form should help with indexing speed since we would have to compare fewer bytes upon sorting, should help with memory usage of the live version map since keys will be shorter, and might help with disk usage depending on how efficient the terms dictionary is at compressing terms. Since we can only expect base64 ids in the auto-generated case, this PR tries to use an encoding that makes the binary id equal to the base64-decoded id in the majority of cases (253 out of 256). It also specializes numeric ids, since this seems to be common when content that is stored in Elasticsearch comes from another database that uses eg. auto-increment ids. Another option could be to require base64 ids all the time. It would make things simpler but I'm not sure users would welcome this requirement. This PR should bring some benefits, but I expect it to be mostly useful when coupled with something like elastic#24615. Closes elastic#18154
Indexing ids in binary form should help with indexing speed since we would have to compare fewer bytes upon sorting, should help with memory usage of the live version map since keys will be shorter, and might help with disk usage depending on how efficient the terms dictionary is at compressing terms. Since we can only expect base64 ids in the auto-generated case, this PR tries to use an encoding that makes the binary id equal to the base64-decoded id in the majority of cases (253 out of 256). It also specializes numeric ids, since this seems to be common when content that is stored in Elasticsearch comes from another database that uses eg. auto-increment ids. Another option could be to require base64 ids all the time. It would make things simpler but I'm not sure users would welcome this requirement. This PR should bring some benefits, but I expect it to be mostly useful when coupled with something like #24615. Closes #18154
@rmuir had this idea:
Today, when ES auto-generates an ID (
TimeBasedUUIDGenerator.getBase64UUID
), it uses 15 bytes, but then we immediately Base64 encode that to 20 bytes, a 33% "waste".This is really a holdover from the past when Lucene could not index fully binary terms.
I think we should explore passing the raw binary form to Lucene instead? We could implement back-compat based on the version as of when the index was created.
The text was updated successfully, but these errors were encountered: