-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[Rollup] Use composite's missing_bucket #31402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Rollup] Use composite's missing_bucket #31402
Conversation
We can leverage the composite agg's new `missin_bucket` feature on terms groupings. This means the aggregation criteria used in the indexer will now return null buckets for missing keys. We then index these and rely on a default `null_value` on the Rollup's mapping to inject a placeholder. On the search side of the house, we can remove the placeholder when unrolling the response. By indexing null values, we can guarantee correct doc counts with "combined" jobs (where a job rolls up multiple schemas). This was previously impossible since composite would ignore documents that didn't have _all_ the keys, meaning non-overlapping schemas would cause composite to return no buckets. The docs have been adjusted to recommend a single, combined job. It also makes reference to the previous issue to help users that are upgrading (rather than just deleting the sections). Because the mapping change is incompatible with prior versions, this PR forbids 6.4.0+ jobs from being created in a pre-6.4.0 index.
Pinging @elastic/es-search-aggs |
_all_ of the keys `A, B` and `C`. | ||
There was previously a limitation in how Rollup could handle indices that had heterogeneous mappings (multiple, unrelated/non-overlapping | ||
mappings). The recommendation at the time was to configure a separate job per data "type". For example, you might configure a separate | ||
job for each Beat module that you had enabled (one for `process`, another for `filesystem`, etc). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it should be Beats module
?
@@ -85,6 +85,8 @@ | |||
public static final String ROLLUP_TEMPLATE_VERSION_FIELD = "rollup-version"; | |||
public static final String ROLLUP_TEMPLATE_VERSION_PATTERN = | |||
Pattern.quote("${rollup.dynamic_template.version}"); | |||
public static final String ROLLUP_NULL_VALUE_PLACEHOLDER = "ROLLUP_NULL_VALUE_PLACEHOLDER"; | |||
public static final String ROLLUP_NULL_VALUE = "__ROLLUP_NULL_VALUE_PLACEHOLDER__"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look like they might be for the same purpose, maybe we should add comments to them to explain why they are both needed and what for?
// Hide our `null_value` placeholder so it doesn't show up in the terms list | ||
if (bucket.getKeyAsString().equals(Rollup.ROLLUP_NULL_VALUE)) { | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this not needed for LongTerms and the histograms? do they not have the concept of a null bucket as well?
|
||
String stringVersion = (String)((Map<String, Object>) m).get(Rollup.ROLLUP_TEMPLATE_VERSION_FIELD); | ||
if (stringVersion == null) { | ||
logger.warn("Could not determine version of existing rollup metadata for index [" + indexName + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a warning enough here? I am wondering if when the stringVersion is null: a) will we end up throwing a NPE below such as in the Version.fromString()
method and b) if we will be able to "do the right thing"?
if (parsedVersion.before(Version.V_6_4_0)) { | ||
String msg = "Cannot create rollup job [" + job.getConfig().getId() + "] because the rollup index contains " + | ||
"jobs from pre-6.4.0. The mappings for these jobs are not compatible with 6.4.0+. Please specify a new rollup " + | ||
"index."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that users are going to need to recreate their existing rollup jobs and rollup indexes when they upgrade from 6.3 to 6.4+?
if (v instanceof String) { | ||
if (v == null) { | ||
// Arbitrary value to update the doc ID with for nulls | ||
docID.update(19); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disappointed you didn't make this value 42
😃
@colings86 sorry for the delay on this, I think it's ready for the next round of reviews. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why you need to translate null
values to concrete values in the rollup
index. IMO this is not needed, the issue without the missing_bucket
option is that rollup documents with a missing value are omitted. With missing_bucket
these rollup documents are taken into account and will be present in the rollup index. Now why can't you just index them with their null
values explicitly ? The count should be correct if you aggregate on another field since documents will null
values are indexed in the rollup ?
Hmm hmm... I think you're right. I was assuming we had to represent those nulls for accurate counts, but we don't care if the field is null so long as the rest of the fields are present (whereas before the entire document was omitted if one of the keys was gone). Let me think about it, but I think you're right. That also means we don't have the compatibility issue anymore if we don't have to set a |
Which also means no need for compatibility check or restart test. Non-keyword keys needed a null check for doc ID generation however, plus a test for null keys.
Ok, so I just pushed a commit which:
I left in the check for rollup version in general (since that's a useful validation) as well as the tweaks to the full cluster restart helper methods, since that will make future tests easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @polyfractal . I left some comments regarding null buckets and date histograms.
@@ -159,7 +159,7 @@ public Rounding createRounding() { | |||
vsBuilder.dateHistogramInterval(interval); | |||
vsBuilder.field(field); | |||
vsBuilder.timeZone(timeZone); | |||
|
|||
vsBuilder.missingBucket(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't allow missing bucket here. We use this config for the timestamp field which a required field for rollup documents. Moreover since it's the first field in the composite, enabling missing buckets would disable the composite optimization when the field is indexed. IMO it's ok to require the timestamp field to be present since otherwise we have no way to know if the document should be part of the rollup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah, silly me. Absolutely shouldn't allow missing on the timestamp field :)
Good catch thanks
@@ -83,14 +83,27 @@ private static CRC32 processKeys(Map<String, Object> keys, Map<String, Object> d | |||
doc.put(k + "." + RollupField.TIMESTAMP, v); | |||
doc.put(k + "." + RollupField.INTERVAL, groupConfig.getDateHisto().getInterval()); | |||
doc.put(k + "." + DateHistoGroupConfig.TIME_ZONE, groupConfig.getDateHisto().getTimeZone().toString()); | |||
docID.update(Numbers.longToBytes((Long)v), 0, 8); | |||
if (v == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not allow null here
when this particular scenario is fixed. | ||
The rollup job will automatically use a placeholder term (`__ROLLUP_NULL_VALUE_PLACEHOLDER__`) as the `null_value` for keyword fields, | ||
which allows it to handle documents that may be missing some of the grouping fields. This placeholder is then removed from search | ||
results, resulting in correct doc counts in a manner that is invisible to the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part should be updated since we don't use the placeholder term anymore.
Review comments addressed, tests back to passing too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Woo, thanks @jimczi! And good catch on the |
We can leverage the composite agg's new `missing_bucket` feature on terms groupings. This means the aggregation criteria used in the indexer will now return null buckets for missing keys. Because all buckets are now returned (even if a key is null), we can guarantee correct doc counts with "combined" jobs (where a job rolls up multiple schemas). This was previously impossible since composite would ignore documents that didn't have _all_ the keys, meaning non-overlapping schemas would cause composite to return no buckets. Note: date_histo does not use `missing_bucket`, since a timestamp is always required. The docs have been adjusted to recommend a single, combined job. It also makes reference to the previous issue to help users that are upgrading (rather than just deleting the sections).
* 6.x: Watcher: Make settings reloadable (#31746) [Rollup] Histo group config should support scaled_floats (#32048) lazy snapshot repository initialization (#31606) Add secure setting for watcher email password (#31620) Watcher: cleanup ensureWatchExists use (#31926) Add second level of field collapsing (#31808) Added lenient flag for synonym token filter (#31484) (#31970) Test: Fix a second case of bad watch creation [Rollup] Use composite's missing_bucket (#31402) Docs: Restyled cloud link in getting started Docs: Change formatting of Cloud options Re-instate link in StringFunctionUtils javadocs Correct spelling of AnalysisPlugin#requriesAnalysisSettings (#32025) Fix problematic chars in javadoc [ML] Move open job failure explanation out of root cause (#31925) [ML] Switch ML native QA tests to use a 3 node cluster (#32011)
We can leverage the composite agg's new
missing_bucket
feature on terms groupings. This means the aggregation criteria used in the indexer will now return null buckets for missing keys. We then index these and rely on a defaultnull_value
on the Rollup's mapping to inject a placeholder.On the search side of the house, we can remove the placeholder when unrolling the response.
By indexing null values, we can guarantee correct doc counts with "combined" jobs (where a job rolls up multiple schemas). This was previously impossible since composite would ignore documents that didn't have all the keys, meaning non-overlapping schemas would cause composite to return no buckets.
The documentation has been adjusted to recommend a single, combined job. It also makes reference to the previous issue to help users that are upgrading (rather than just deleting the sections).
BWC
This change is somewhat incompatible with older rollup jobs, because all jobs in a shared index share the same mapping templates. So if the template was updated, new jobs would be "fixed" but old jobs would start getting
null_values
and end up in an inconsistent state. But not updating the index meant the new jobs would also be "broken".I opted to just disallow new jobs from being created rollup indices that were pre-6.4.0. So the user will have to create a new rollup index for new jobs.
Important Note:
This PR enables
missing_bucket
on all groups (terms, histo, date_histo), but we only enable anull_value
on string terms. There's not a reasonable default missing value for numerics as all numerics are technically a real value. I want to follow this PR up with another that allows the user to configure a missing value for histo/date_histo if they desire.Otherwise, the null values will just be indexed as null and dropped as usual, so this only "fixes" strings. But that's the major problem at the moment.