`_uid` should be indexed in Lucene in binary form, not base64 #18154

mikemccand · 2016-05-05T10:54:42Z

@rmuir had this idea:

Today, when ES auto-generates an ID (TimeBasedUUIDGenerator.getBase64UUID), it uses 15 bytes, but then we immediately Base64 encode that to 20 bytes, a 33% "waste".

This is really a holdover from the past when Lucene could not index fully binary terms.

I think we should explore passing the raw binary form to Lucene instead? We could implement back-compat based on the version as of when the index was created.

The text was updated successfully, but these errors were encountered:

clintongormley · 2016-05-05T11:40:42Z

Would this also make the choice for which doc values implementation(#11887) to use more obvious?

mikemccand · 2016-05-05T12:42:31Z

Would this also make the choice for which doc values implementation(#11887) to use more obvious?

Well, both sorted and binary doc values in Lucene accept full binary terms, so this change shouldn't favor one over another.

jpountz · 2016-05-05T17:45:41Z

@mikemccand I'm curious why you added the discuss label. Can you foresee any potential problem if we do that?

mikemccand · 2016-05-05T18:55:09Z

@mikemccand I'm curious why you added the discuss label

Well, just because this seems like a biggish change (for me!)... there are so many places in ES where we pass around String id now.

I would change them all to byte[] (I think BytesRef is overkill?). But this is mutable (vs String today) ... is that dangerous?

Uid values would also need to be represented as byte[] everywhere.

Should we allow users to also pass binary id values in indexing/bulk/get requests? I think yes, but then we need to accept either incoming String or byte[] (encoded via base64?) via rest and Java client APIs? Actually, it seems like we must allow this, since with auto-ids, the binary ids will come back in search results, and the user can then e.g. do a get or update from there.

rmuir · 2016-05-05T19:06:39Z

as a first step couldn't it just be an internal encoding thing? e.g. just make sure we base64-encode before putting it in the terms dict and decode before doing anything with it. this seems like it would require less api changes but would give us the smaller index.

mikemccand · 2016-05-05T19:54:02Z

Ooh that's a great idea!

It would require that down in InternalEngine we can know that an incoming String id was in fact base64 encoded (because it was auto-gen'd) vs not base64 encoded (because user passed in their own id). We used to record this as a autoGeneratedId boolean in Engine.Create but looks like we removed it.

Hmm but then what would happen if a user does auto-generated IDs at first, then adds some documents with their own IDs? This is allowed today right? At search time we wouldn't know whether we're supposed to base64 encode it or not? Maybe we could disallow this?

jpountz · 2016-05-07T15:49:06Z

Currently the uids have the following format: ${utf8-encoded type}#${utf8-encoded id}. Maybe we could do the following: if the id looks like an url-safe base64 string, then we would use a different separator (maybe \0?) so that:

either the id looks like a base 64 string (like our auto-generated ids) and the uid would be ${utf8-encoded type}\0${base64-decoded id}
otherwise, the uid would remain ${utf8-encoded type}#${utf8-encoded id}

I think we could keep everything working while saving space for auto-generated ids and not modifying the API. Actually users could even use binary ids themselves by providing base64 strings as ids. A side-effect would be that numeric ids (eg. auto-increment ids generated by a database) would also require less space since elasticsearch would base64-decode them.

mikemccand · 2016-05-07T22:07:52Z

That's a clever solution @jpountz! I think the mapping is safe, because \0 is not a character that is allowed as a type field value, right (no utf-8 encoded string has the 0 byte)? So then to users it just appears that we are particularly compact at storing ID values that are valid base64 encoded strings.

This would also mean we cannot represent uid values in the code as String anymore.

Hmm, you can also query _uid right? So we would have to munge the incoming query to match the binary form? For e.g. TermsQuery that's easy, but what about e.g. PrefixQuery or others ... hmm?

jpountz · 2016-05-08T15:44:56Z

Right: range, prefix and fuzzy queries would not work anymore. But maybe it is not something that should be supported on a uid field as it reduces our freedom to improve its efficiency for indexing and get operations (there was a similar discussion on #17994 about ip fields)? I also suspect that performance of ranges and prefixes (maybe fuzzy too) is terrible due to the cardinality of the field and the length of the terms (in the auto-generated case), so it is not something you can really rely on anyway?

mikemccand · 2016-05-08T21:39:39Z

OK I think it makes sense to restrict the queries you can run on _uid, so this will be a breaking change, but it's unlikely users really do this heavily.

clintongormley · 2016-05-09T07:39:18Z

I think the main reason one would use a range query on _uid is to partition an index for reindexing purposes. This requirement is obviated by the changes proposed in #13494 (comment) so +1 to restricting the queries that you can run on _uid

jpountz · 2016-05-09T15:08:51Z

I ran some simulations to check whether this would actually save space. For an app that would index 1M docs at a rate of about 100 docs per second per shard (the rate is important, since faster rates mean that ids have longer common prefixes and vice-versa), disk space used by the _uid field goes from 25429976 bytes to 21608441 (-15%).

This is substantial, but the reduction drops to 7% if we put the mac address before the timestamp as discussed in #18209 (17794115 bytes to 16540472), which gets quickly drowned if you have other fields. I am unsure whether the gain is worth the complexity so I will probably wait to see what we do on #18209 and whether this change would have a positive impact on indexing speed in addition to disk space.

jpountz · 2016-05-10T09:28:59Z

I did the same changes as Mike did in #18209 to simulate 100 docs per second. The classic benchmark reported the same indexing speed and the index was about 0.5% smaller. I am not sure this is worth the complexity of this change.

jpountz · 2016-05-27T09:38:39Z

I think we should revisit this once we somehow remove types and enforce ids to be unique per index (rather than per type).

bleskes · 2016-05-29T21:32:10Z

@jpountz I'm not sure exactly why types are in the way but if it's because their stringy nature, we can also assign each type a numerical id?

jpountz · 2016-05-30T06:15:43Z

@bleskes They are not exactly in the way, but prepending types to the ids makes lookups slower since the FST still needs to walk the bytes of the type before reaching bytes that make a difference. It is tempting to put them in the end but then the compression of the terms dictionary degrades considerably given that the terms dict only performs prefix-based compression (https://issues.apache.org/jira/browse/LUCENE-4702 might help a bit from that perspective but the compression ratio would still be worse than if prepending _type to the _uid). We could make things a bit better by assigning ids to types as you suggest, but I have the feeling that we reached a point where gettig rid of types, or at least decoupling them from mappings and treating them as index partitions, has become an acceptable path. Then we could directly use the _id as a primary key and I think we should reevaluate this since the space/speed benefit of not base64-ing the ids might become more interesting and worth the complexity of the change.

jpountz · 2016-08-05T13:32:49Z

I have been working on a super hacky patch (tests don't even compile) to try to estimate how much we could win by doing this: https://github.com/jpountz/elasticsearch/tree/enhancement/optional_type. This branch has an index setting called index.mapping.single_type, which when set does several things:

it forbids you from configuring another type than the value of this setting
it does not prepend <typename># to the _uid
it expects ids to be base64 strings and encodes them in binary form (so that eg. auto-generated ids would use 15 bytes rather than 20)

@mikemccand helped me benchmark the change and this resulted in a 13% improvement in indexing throughput with the NYC taxi rides dataset.

nik9000 · 2016-08-05T13:47:05Z

@mikemccand helped me benchmark the change and this resulted in a 13% improvement in indexing throughput with the NYC taxi rides dataset.

Very cool!

Indexing ids in binary form should help with indexing speed since we would have to compare fewer bytes upon sorting, should help with memory usage of the live version map since keys will be shorter, and might help with disk usage depending on how efficient the terms dictionary is at compressing terms. Since we can only expect base64 ids in the auto-generated case, this PR tries to use an encoding that makes the binary id equal to the base64-decoded id in the majority of cases (253 out of 256). It also specializes numeric ids, since this seems to be common when content that is stored in Elasticsearch comes from another database that uses eg. auto-increment ids. Another option could be to require base64 ids all the time. It would make things simpler but I'm not sure users would welcome this requirement. This PR should bring some benefits, but I expect it to be mostly useful when coupled with something like elastic#24615. Closes elastic#18154

Indexing ids in binary form should help with indexing speed since we would have to compare fewer bytes upon sorting, should help with memory usage of the live version map since keys will be shorter, and might help with disk usage depending on how efficient the terms dictionary is at compressing terms. Since we can only expect base64 ids in the auto-generated case, this PR tries to use an encoding that makes the binary id equal to the base64-decoded id in the majority of cases (253 out of 256). It also specializes numeric ids, since this seems to be common when content that is stored in Elasticsearch comes from another database that uses eg. auto-increment ids. Another option could be to require base64 ids all the time. It would make things simpler but I'm not sure users would welcome this requirement. This PR should bring some benefits, but I expect it to be mostly useful when coupled with something like #24615. Closes #18154

mikemccand added the discuss label May 5, 2016

mikemccand mentioned this issue May 5, 2016

Append-only indices #18069

Closed

clintongormley added :Core/Infra/Core Core issues without another label v5.0.0-alpha3 :Search Foundations/Mapping Index mappings, including merging and defining field types labels May 5, 2016

jpountz self-assigned this May 6, 2016

jpountz mentioned this issue May 9, 2016

Auto-generated ids should put the mac address before the timestamp #18209

Closed

clintongormley added v5.0.0-alpha4 and removed v5.0.0-alpha3 labels May 24, 2016

s1monw added stalled and removed discuss labels May 27, 2016

clintongormley added v5.0.0-alpha5 and removed v5.0.0-alpha4 labels Jun 22, 2016

clintongormley added the v5.0.0-beta1 label Jul 29, 2016

clintongormley removed the v5.0.0-alpha5 label Jul 29, 2016

jpountz mentioned this issue Aug 5, 2016

Remove support for types? #15613

Closed

clintongormley added v5.0.0 and removed v5.0.0-beta1 labels Sep 14, 2016

clintongormley added v6.0.0-alpha1 and removed v5.0.0 labels Oct 18, 2016

clintongormley added v6.0.0 and removed v6.0.0-alpha1 labels May 3, 2017

jpountz mentioned this issue Jun 22, 2017

Index ids in binary form. #25352

Merged

jpountz closed this as completed in #25352 Jul 7, 2017

colings86 added v6.0.0-beta1 and removed v6.0.0 labels Jul 31, 2017

kstaken mentioned this issue May 25, 2018

id slicer will not work with elasticsearch 6 terascope/teraslice#687

Closed

jsnoble mentioned this issue Sep 20, 2018

id slicer will not work with elasticsearch 6 terascope/elasticsearch-assets#12

Closed

javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024

_uid should be indexed in Lucene in binary form, not base64 #18154

_uid should be indexed in Lucene in binary form, not base64 #18154

Comments

mikemccand commented May 5, 2016

clintongormley commented May 5, 2016

Uh oh!

mikemccand commented May 5, 2016

Uh oh!

jpountz commented May 5, 2016

Uh oh!

mikemccand commented May 5, 2016

Uh oh!

rmuir commented May 5, 2016

Uh oh!

mikemccand commented May 5, 2016

Uh oh!

jpountz commented May 7, 2016

Uh oh!

mikemccand commented May 7, 2016

Uh oh!

jpountz commented May 8, 2016

Uh oh!

mikemccand commented May 8, 2016

Uh oh!

clintongormley commented May 9, 2016

Uh oh!

jpountz commented May 9, 2016

Uh oh!

jpountz commented May 10, 2016

Uh oh!

jpountz commented May 27, 2016

Uh oh!

bleskes commented May 29, 2016

Uh oh!

jpountz commented May 30, 2016

Uh oh!

jpountz commented Aug 5, 2016

Uh oh!

nik9000 commented Aug 5, 2016

Uh oh!

`_uid` should be indexed in Lucene in binary form, not base64 #18154

`_uid` should be indexed in Lucene in binary form, not base64 #18154