Skip to content

Terms enum for version fields #93839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 21, 2023
Merged

Conversation

cbuescher
Copy link
Member

The _terms_enum API currently supports keyword, constant_keyword and
flattened fields.
This change adds this support for the version field type as well, as it was
originally intended to be a specialization of the keyword field for handling
software version values.

Closes #83403

@cbuescher cbuescher added >enhancement :Search/Search Search-related issues that do not fall into other categories v8.8.0 labels Feb 15, 2023
@cla-checker-service
Copy link

cla-checker-service bot commented Feb 15, 2023

💚 CLA has been signed

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Feb 15, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Collaborator

Hi @cbuescher, I've created a changelog YAML for you.

The _terms_enum API currently only supports keyword, constant_keyword and
flattened field types. This change adds support for the `version` field type
that sorts according to the semantic versioning definition.

Closes elastic#83403
@cbuescher cbuescher force-pushed the terms-enum-versionField branch from c34c637 to 261545a Compare February 15, 2023 17:32
@cbuescher cbuescher changed the title Terms enum version field Terms enum for version fields Feb 16, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @cbuescher, I've created a changelog YAML for you.

@romseygeek
Copy link
Contributor

I think this can be a lot simpler if we retrieve terms from the doc values terms dictionary via SortedSetDocValuesTerms.getTerms(), as KeywordFieldType does? The version string field always has doc values enabled.
Then there's no need to pass an encoder round.

@cbuescher
Copy link
Member Author

cbuescher commented Feb 20, 2023

@romseygeek thanks for looking into this. I'm afraid we can't get away without the encoding step, that is what I initially tried and failed. We store both the indexed and the doc-value termes in their encoded form specific to the version field which preserves semantic version type ordering. We somehow have to convert that back to a string representation, and if we do that too early we loose the ordering e.g. when merging shard level terms enum results in MultiShardTermsEnum. Happy to take another look though.

@romseygeek
Copy link
Contributor

Ah ok, yes I see. I wonder if it's possible to re-use something like MappedFieldType#valueForDisplay() here then? That could also simplify the implementation in KeyedFlattenedFieldType.

@cbuescher
Copy link
Member Author

I wonder if it's possible to re-use something like MappedFieldType#valueForDisplay() here then?

Great idea, will look into that.

@@ -359,15 +361,15 @@ protected NodeTermsEnumResponse dataNodeOperation(NodeTermsEnumRequest request,
request.searchAfter()
);
if (terms != null) {
shardTermsEnums.add(terms);
shardTermsEnums.add(new ShardTermsEnum(terms, mappedFieldType::valueForDisplay));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to maintain the connection from a shard level enumeration to the field type that generated it since we can ask for terms enumerations across indices that have mixed type mappings for the same field name (i.e. keyword for older indices, version for newer ones).

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's much nicer. I left a couple more suggestions.

return current;
}

public long docFreq() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to adjust the javadoc at the top of the class to not refer to this anymore?

* @param enums TermsEnums from shards which we should merge
* @throws IOException Errors accessing data
**/
public MultiShardTermsEnum(TermsEnum[] enums) throws IOException {
public MultiShardTermsEnum(TransportTermsEnumAction.ShardTermsEnum[] enums) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be neater as a builder object, and then ShardTermsEnum can become private to MultiShardTermsEnum. So the caller is something like:

MultiShardTermsEnum.Builder termsBuilder = new MultiShardTermsEnum.Builder();
for (ShardId shardId : request.shardIds()) {
  ...
  terms.add(fieldType.getTerms( ... ), fieldType::valueForDisplay)
...
}
MultiShardTermsEnum terms = termsBuilder.build();

@cbuescher
Copy link
Member Author

@romseygeek thanks, great suggestions, I pushed an update.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @cbuescher

@cbuescher cbuescher merged commit edc7a61 into elastic:main Feb 21, 2023
cbuescher added a commit to cbuescher/elasticsearch that referenced this pull request Feb 23, 2023
A recent change (elastic#93839) introduced _terms_enum support for fields that need to
convert their internal bytes representation to thr proper string representation
for display purposes. The constant_keyword and flattened field types didn't
implement a 'valueForDisplay' method yet, so the underlying BytesRef was printed
directly in the response.

Closes elastic#94041
cbuescher added a commit that referenced this pull request Feb 23, 2023
A recent change (#93839) introduced `_terms_enum` support for fields that need to
convert their internal bytes representation to thr proper string representation
for display purposes. The `constant_keyword` and `flattened` field types didn't
implement a 'valueForDisplay' method yet, so the underlying `BytesRef` was printed
directly in the response. This change fixes that and adds tests to ensure human
readable response values for those field types.

Closes #94041
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add _terms_enum support for the version field type
3 participants