Skip to content

Text fields are stored by default in TSDB indices #106338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 26, 2024
Merged

Conversation

lkts
Copy link
Contributor

@lkts lkts commented Mar 14, 2024

Synthetic source requires text fields to be stored or have keyword
sub-field that supports synthetic source. If there are no keyword fields
users currently have to explicitly set 'store' to 'true' or get a
validation exception. This is not the best experience. It is quite
likely that setting store to true is the correct thing to do but
users still get an error and need to investigate it. With this change if
store setting is not specified in such context it will be set to
true by default. Setting it explicitly to false results in the
exception.

Closes #97039

Synthetic source requires text fields to be stored or have keyword
sub-field that supports synthetic source. If there are no keyword fields
 users currently have to explicitly set 'store' to 'true' or get a
validation exception. This is not the best experience. It is quite
likely that setting `store` to `true` is  the correct thing to do but
users still get an error and need to investigate it. With this change if
 `store` setting is not specified in such context it  will be set to
 `true` by default. Setting it explicitly to `false` results in the
 exception.

Closes elastic#97039
@lkts lkts requested a review from martijnvg March 14, 2024 00:10
Copy link
Contributor

Documentation preview:

@elasticsearchmachine elasticsearchmachine added v8.14.0 needs:triage Requires assignment of a team area label labels Mar 14, 2024
@lkts lkts added >enhancement :StorageEngine/TSDB You know, for Metrics Team:StorageEngine and removed needs:triage Requires assignment of a team area label labels Mar 14, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @lkts, I've created a changelog YAML for you.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I left a comment about a potential different direction we may want to take here.

//
// If 'store' parameter was explicitly provided we'll let it pass and
// fail in TextFieldMapper#syntheticFieldLoader later if needed.
if (store.isSet() == false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would do default values and validation via the Parameter class in the Builder constructor in the file. Something like this:

this.store = Parameter.storeParam(m -> ((TextFieldMapper) m).store, () -> syntheticSource).addValidator(value ->    {
      if (syntheticSource && value == false) {
         throw new IllegalArgumentException("If synthetic source is enabled, then store cannot be set to false");
      }
   });

The issue is that now we can't see the details we care about here in the multi fields (MultiFields.Builder only keep a map with a Function per field name around).

Note that this way does require more plumbing and additional changes.

If we like to do this approach we may need to make changes to MultiFields.Builder, so that we know whether there is a keyword field (either stored or doesn't have normalizer and ignore above) as multi field.

Copy link
Contributor Author

@lkts lkts Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this approach and there is a problem. It is not possible to have syntheticSource flag with proposed impl from gist in MappingParserContext because it will return if synthetic source is enabled for currently existing mapping or empty mapping in case of index creation. mappingLookup will only be updated as a last step after parsing and building new mapping.

Value that we need is here https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/index/mapper/MappingParser.java#L147 and it is not available to field mapper builders since builders are created in RootObjectMapper#parse which happens before that.

We could try:

  1. Creating metadata field mappers before RootObjectMapper. I assume that there is a good reason for that ordering so this is unlikely a solution.
  2. Support storing text fields by default only for TSDB indices (and any future index modes that will have synthethic source always enabled) instead of for any index with synthetic source enabled. This works since we know source is synthetic even without parsing the mapping. I think we are mostly interested in this use case anyway so this may be a good solution.
  3. Keep initial implementation from this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing down this analysis. It is kind of a pain that we know synthetic source is used after the builders are created. I don't think it is worth changing the construction of mapping builders, since it most likely is a tricky change.

Regarding option 2, we know if index mode has been set to time series that synthetic source is enabled, and this will also be the case for the new logging index mode. Given that index mode comes from index setting, we know at the time mapping builders are created what the index mode has been set to. So we can use that as well for determining the default value of the store attribute. So let's do option 2?

@martijnvg martijnvg added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Mar 14, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Mar 14, 2024
@lkts
Copy link
Contributor Author

lkts commented Mar 21, 2024

I have pushed change to use option 2 from the discussion above.

One issue i discovered is that it is not possible to implement validation using Parameter#addValidator API. The problem is that we need to provide full name of the field in the error message e.g. k8s.pod.agent.name (as we currently do). It is not possible to get this information in constructor of TextFieldMapper.Builder, this information is only exposed during build() via MapperBuilderContext#buildFullName. We can only report field name as name in this case which is obviously confusing. Additionally, performing validation in constructor changes the error returned to user from illegal_argument_exception to illegal_argument_exception nested inside mapper_parsing_exception. I don't think this is a problem by itself but it would mean that errors are different for TSDB indices and non TSDB indices and that seems undesirable.

As a result i removed validation logic and i am relying on existing code for that.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue i discovered is that it is not possible to implement validation using Parameter#addValidator API. The problem is that we need to provide full name of the field in the error message e.g. k8s.pod.agent.name (as we currently do). It is not possible to get this information in constructor of TextFieldMapper.Builder, this information is only exposed during build() via MapperBuilderContext#buildFullName. We can only report field name as name in this case which is obviously confusing.

I agree this part is confusion, so +1 to rely on the currently validation if store is specifically disabled.

This change is looking good, a left a few more (minor) comments.

the <<mapping-source-field,`_source`>> field. Accepts `true` or `false`
(default).
the <<mapping-source-field,`_source`>> field. Accepts `true` or `false` (default).
To support <<synthetic-source,synthetic `_source`>> this parameter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename To support <<synthetic-source,synthetic _source>> this parameter to If <<synthetic-source,synthetic _source>> is enabled, then this parameter?

@@ -283,12 +283,23 @@ public static class Builder extends FieldMapper.Builder {

final TextParams.Analyzers analyzers;

public Builder(String name, IndexAnalyzers indexAnalyzers) {
this(name, IndexVersion.current(), indexAnalyzers);
public Builder(String name, IndexAnalyzers indexAnalyzers, boolean isSyntheticSourceAlwaysUsed) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to synthetcSourceEnabledViaIndexMode? To indicate that synthetic source is enabled because the index mode enables it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to express that and your suggestion is so much more clear!

@@ -1258,7 +1265,7 @@ public Map<String, NamedAnalyzer> indexAnalyzers() {

@Override
public FieldMapper.Builder getMergeBuilder() {
return new Builder(simpleName(), indexCreatedVersion, indexAnalyzers).init(this);
return new Builder(simpleName(), indexCreatedVersion, indexAnalyzers, false).init(this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always providing false here is incorrect. It could disable synthetic source at some point. I think we need record in TextFieldMapper class, that synthetic source was enabled, so the right value can be provided here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can actually disable it since the default value should be already set correctly before that and init will pick it up. I do agree it is more clear to store it though.

@lkts lkts changed the title Text fields are stored by default with synthetic source Text fields are stored by default in TSDB indices Mar 21, 2024
@lkts
Copy link
Contributor Author

lkts commented Mar 21, 2024

CI uncovered another interesting problem. store parameter was not serialized here

b.store.toXContent(builder, includeDefaults);
because includeDefaults seems to be always false. During downsampling new index is created which is not a TSDB index and since store value was not serialized it defaults to false. We successfully transfer synthetic source setting from original index and as a result we now have synthetic source and text field that is not stored.

I have fixed serialization but i honestly don't know if it is a proper solution. I don't really like the state of "defaults can be interpreted differently in different contexts".

// we want to serialize 'true' value even if it is default
// since defaults differ for different index modes
// and can be interpreted differently is different context
.setSerializerCheck((includeDefaults, isConfigured, value) -> includeDefaults || isConfigured || value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A downsampled index should have index.mode=time_series. What exact failure occurs if this check is missing? Maybe we need to change something in TransportDownsampleAction instead.

lkts added a commit to lkts/elasticsearch that referenced this pull request Mar 25, 2024
Downsample index is created using temporary index service using a
static predefined subset of index settings. All other settings are later
 copied over from source index. As discovered in elastic#106338 this causes
 context like index mode to be missing during initial index creation
 process. This PR adds index mode and related required settings to
 initial set of index settings in order to have access to this
 information during initial create index operation.
lkts added a commit that referenced this pull request Mar 26, 2024
* Set index mode earlier for new downsample index

Downsample index is created using temporary index service using a
static predefined subset of index settings. All other settings are later
 copied over from source index. As discovered in #106338 this causes
 context like index mode to be missing during initial index creation
 process. This PR adds index mode and related required settings to
 initial set of index settings in order to have access to this
 information during initial create index operation.
@lkts
Copy link
Contributor Author

lkts commented Mar 26, 2024

@elasticmachine update branch

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two comments. LGTM otherwise.

@@ -249,6 +251,61 @@ public void testDefaults() throws IOException {
assertEquals(DocValuesType.NONE, fieldType.docValuesType());
}

public void testStoreParameterDefaults() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test needs to be adjusted with the serialization logic removed to address the CI failure?

the <<mapping-source-field,`_source`>> field. Accepts `true` or `false`
(default).
the <<mapping-source-field,`_source`>> field. Accepts `true` or `false` (default).
If <<synthetic-source,synthetic _source>> is enabled, then this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove If <<synthetic-source,synthetic _source>> is enabled part? The store param defaults to true for tsdb indices, when there is no suitable multi field of type keyword .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woops, thanks. This is leftover.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :StorageEngine/TSDB You know, for Metrics Team:Search Meta label for search team Team:StorageEngine v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Synthetic _source automatically sets store to true on text fields
4 participants