Skip to content

_uid should be indexed in Lucene in binary form, not base64 #18154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mikemccand opened this issue May 5, 2016 · 18 comments
Closed

_uid should be indexed in Lucene in binary form, not base64 #18154

mikemccand opened this issue May 5, 2016 · 18 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label :Search Foundations/Mapping Index mappings, including merging and defining field types stalled Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v6.0.0-beta1

Comments

@mikemccand
Copy link
Contributor

@rmuir had this idea:

Today, when ES auto-generates an ID (TimeBasedUUIDGenerator.getBase64UUID), it uses 15 bytes, but then we immediately Base64 encode that to 20 bytes, a 33% "waste".

This is really a holdover from the past when Lucene could not index fully binary terms.

I think we should explore passing the raw binary form to Lucene instead? We could implement back-compat based on the version as of when the index was created.

@clintongormley clintongormley added :Core/Infra/Core Core issues without another label v5.0.0-alpha3 :Search Foundations/Mapping Index mappings, including merging and defining field types labels May 5, 2016
@clintongormley
Copy link
Contributor

Would this also make the choice for which doc values implementation(#11887) to use more obvious?

@mikemccand
Copy link
Contributor Author

Would this also make the choice for which doc values implementation(#11887) to use more obvious?

Well, both sorted and binary doc values in Lucene accept full binary terms, so this change shouldn't favor one over another.

@jpountz
Copy link
Contributor

jpountz commented May 5, 2016

@mikemccand I'm curious why you added the discuss label. Can you foresee any potential problem if we do that?

@mikemccand
Copy link
Contributor Author

@mikemccand I'm curious why you added the discuss label

Well, just because this seems like a biggish change (for me!)... there are so many places in ES where we pass around String id now.

I would change them all to byte[] (I think BytesRef is overkill?). But this is mutable (vs String today) ... is that dangerous?

Uid values would also need to be represented as byte[] everywhere.

Should we allow users to also pass binary id values in indexing/bulk/get requests? I think yes, but then we need to accept either incoming String or byte[] (encoded via base64?) via rest and Java client APIs? Actually, it seems like we must allow this, since with auto-ids, the binary ids will come back in search results, and the user can then e.g. do a get or update from there.

@rmuir
Copy link
Contributor

rmuir commented May 5, 2016

as a first step couldn't it just be an internal encoding thing? e.g. just make sure we base64-encode before putting it in the terms dict and decode before doing anything with it. this seems like it would require less api changes but would give us the smaller index.

@mikemccand
Copy link
Contributor Author

Ooh that's a great idea!

It would require that down in InternalEngine we can know that an incoming String id was in fact base64 encoded (because it was auto-gen'd) vs not base64 encoded (because user passed in their own id). We used to record this as a autoGeneratedId boolean in Engine.Create but looks like we removed it.

Hmm but then what would happen if a user does auto-generated IDs at first, then adds some documents with their own IDs? This is allowed today right? At search time we wouldn't know whether we're supposed to base64 encode it or not? Maybe we could disallow this?

@jpountz jpountz self-assigned this May 6, 2016
@jpountz
Copy link
Contributor

jpountz commented May 7, 2016

Currently the uids have the following format: ${utf8-encoded type}#${utf8-encoded id}. Maybe we could do the following: if the id looks like an url-safe base64 string, then we would use a different separator (maybe \0?) so that:

  • either the id looks like a base 64 string (like our auto-generated ids) and the uid would be ${utf8-encoded type}\0${base64-decoded id}
  • otherwise, the uid would remain ${utf8-encoded type}#${utf8-encoded id}

I think we could keep everything working while saving space for auto-generated ids and not modifying the API. Actually users could even use binary ids themselves by providing base64 strings as ids. A side-effect would be that numeric ids (eg. auto-increment ids generated by a database) would also require less space since elasticsearch would base64-decode them.

@mikemccand
Copy link
Contributor Author

That's a clever solution @jpountz! I think the mapping is safe, because \0 is not a character that is allowed as a type field value, right (no utf-8 encoded string has the 0 byte)? So then to users it just appears that we are particularly compact at storing ID values that are valid base64 encoded strings.

This would also mean we cannot represent uid values in the code as String anymore.

Hmm, you can also query _uid right? So we would have to munge the incoming query to match the binary form? For e.g. TermsQuery that's easy, but what about e.g. PrefixQuery or others ... hmm?

@jpountz
Copy link
Contributor

jpountz commented May 8, 2016

Right: range, prefix and fuzzy queries would not work anymore. But maybe it is not something that should be supported on a uid field as it reduces our freedom to improve its efficiency for indexing and get operations (there was a similar discussion on #17994 about ip fields)? I also suspect that performance of ranges and prefixes (maybe fuzzy too) is terrible due to the cardinality of the field and the length of the terms (in the auto-generated case), so it is not something you can really rely on anyway?

@mikemccand
Copy link
Contributor Author

OK I think it makes sense to restrict the queries you can run on _uid, so this will be a breaking change, but it's unlikely users really do this heavily.

@clintongormley
Copy link
Contributor

I think the main reason one would use a range query on _uid is to partition an index for reindexing purposes. This requirement is obviated by the changes proposed in #13494 (comment) so +1 to restricting the queries that you can run on _uid

@jpountz
Copy link
Contributor

jpountz commented May 9, 2016

I ran some simulations to check whether this would actually save space. For an app that would index 1M docs at a rate of about 100 docs per second per shard (the rate is important, since faster rates mean that ids have longer common prefixes and vice-versa), disk space used by the _uid field goes from 25429976 bytes to 21608441 (-15%).

This is substantial, but the reduction drops to 7% if we put the mac address before the timestamp as discussed in #18209 (17794115 bytes to 16540472), which gets quickly drowned if you have other fields. I am unsure whether the gain is worth the complexity so I will probably wait to see what we do on #18209 and whether this change would have a positive impact on indexing speed in addition to disk space.

@jpountz
Copy link
Contributor

jpountz commented May 10, 2016

I did the same changes as Mike did in #18209 to simulate 100 docs per second. The classic benchmark reported the same indexing speed and the index was about 0.5% smaller. I am not sure this is worth the complexity of this change.

@jpountz
Copy link
Contributor

jpountz commented May 27, 2016

I think we should revisit this once we somehow remove types and enforce ids to be unique per index (rather than per type).

@bleskes
Copy link
Contributor

bleskes commented May 29, 2016

@jpountz I'm not sure exactly why types are in the way but if it's because their stringy nature, we can also assign each type a numerical id?

@jpountz
Copy link
Contributor

jpountz commented May 30, 2016

@bleskes They are not exactly in the way, but prepending types to the ids makes lookups slower since the FST still needs to walk the bytes of the type before reaching bytes that make a difference. It is tempting to put them in the end but then the compression of the terms dictionary degrades considerably given that the terms dict only performs prefix-based compression (https://issues.apache.org/jira/browse/LUCENE-4702 might help a bit from that perspective but the compression ratio would still be worse than if prepending _type to the _uid). We could make things a bit better by assigning ids to types as you suggest, but I have the feeling that we reached a point where gettig rid of types, or at least decoupling them from mappings and treating them as index partitions, has become an acceptable path. Then we could directly use the _id as a primary key and I think we should reevaluate this since the space/speed benefit of not base64-ing the ids might become more interesting and worth the complexity of the change.

@jpountz
Copy link
Contributor

jpountz commented Aug 5, 2016

I have been working on a super hacky patch (tests don't even compile) to try to estimate how much we could win by doing this: https://github.com/jpountz/elasticsearch/tree/enhancement/optional_type. This branch has an index setting called index.mapping.single_type, which when set does several things:

  • it forbids you from configuring another type than the value of this setting
  • it does not prepend <typename># to the _uid
  • it expects ids to be base64 strings and encodes them in binary form (so that eg. auto-generated ids would use 15 bytes rather than 20)

@mikemccand helped me benchmark the change and this resulted in a 13% improvement in indexing throughput with the NYC taxi rides dataset.

@nik9000
Copy link
Member

nik9000 commented Aug 5, 2016

@mikemccand helped me benchmark the change and this resulted in a 13% improvement in indexing throughput with the NYC taxi rides dataset.

Very cool!

jpountz added a commit to jpountz/elasticsearch that referenced this issue Jul 6, 2017
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.

Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.

Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.

This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like elastic#24615.

Closes elastic#18154
jpountz added a commit that referenced this issue Jul 7, 2017
Indexing ids in binary form should help with indexing speed since we would
have to compare fewer bytes upon sorting, should help with memory usage of
the live version map since keys will be shorter, and might help with disk
usage depending on how efficient the terms dictionary is at compressing
terms.

Since we can only expect base64 ids in the auto-generated case, this PR tries
to use an encoding that makes the binary id equal to the base64-decoded id in
the majority of cases (253 out of 256). It also specializes numeric ids, since
this seems to be common when content that is stored in Elasticsearch comes
from another database that uses eg. auto-increment ids.

Another option could be to require base64 ids all the time. It would make things
simpler but I'm not sure users would welcome this requirement.

This PR should bring some benefits, but I expect it to be mostly useful when
coupled with something like #24615.

Closes #18154
@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label :Search Foundations/Mapping Index mappings, including merging and defining field types stalled Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v6.0.0-beta1
Projects
None yet
Development

No branches or pull requests

9 participants