Skip to content

[Rollup] Use composite's missing_bucket #31402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 13, 2018

Conversation

polyfractal
Copy link
Contributor

We can leverage the composite agg's new missing_bucket feature on terms groupings. This means the aggregation criteria used in the indexer will now return null buckets for missing keys. We then index these and rely on a default null_value on the Rollup's mapping to inject a placeholder.

On the search side of the house, we can remove the placeholder when unrolling the response.

By indexing null values, we can guarantee correct doc counts with "combined" jobs (where a job rolls up multiple schemas). This was previously impossible since composite would ignore documents that didn't have all the keys, meaning non-overlapping schemas would cause composite to return no buckets.

The documentation has been adjusted to recommend a single, combined job. It also makes reference to the previous issue to help users that are upgrading (rather than just deleting the sections).

BWC
This change is somewhat incompatible with older rollup jobs, because all jobs in a shared index share the same mapping templates. So if the template was updated, new jobs would be "fixed" but old jobs would start getting null_values and end up in an inconsistent state. But not updating the index meant the new jobs would also be "broken".

I opted to just disallow new jobs from being created rollup indices that were pre-6.4.0. So the user will have to create a new rollup index for new jobs.

Important Note:
This PR enables missing_bucket on all groups (terms, histo, date_histo), but we only enable a null_value on string terms. There's not a reasonable default missing value for numerics as all numerics are technically a real value. I want to follow this PR up with another that allows the user to configure a missing value for histo/date_histo if they desire.

Otherwise, the null values will just be indexed as null and dropped as usual, so this only "fixes" strings. But that's the major problem at the moment.

We can leverage the composite agg's new `missin_bucket` feature on
terms groupings.  This means the aggregation criteria used in the indexer
will now return null buckets for missing keys.  We then index these
and rely on a default `null_value` on the Rollup's mapping to inject
a placeholder.

On the search side of the house, we can remove the placeholder when
unrolling the response.

By indexing null values, we can guarantee correct doc counts with
"combined" jobs (where a job rolls up multiple schemas).  This was
previously impossible since composite would ignore documents that
didn't have _all_ the keys, meaning non-overlapping schemas would
cause composite to return no buckets.

The docs have been adjusted to recommend a single, combined job.  It
also makes reference to the previous issue to help users that are upgrading
(rather than just deleting the sections).

Because the mapping change is incompatible with prior versions,
this PR forbids 6.4.0+ jobs from being created in a pre-6.4.0 index.
@polyfractal polyfractal added >enhancement review v7.0.0 :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v6.4.0 labels Jun 18, 2018
@polyfractal polyfractal requested a review from colings86 June 18, 2018 17:03
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

_all_ of the keys `A, B` and `C`.
There was previously a limitation in how Rollup could handle indices that had heterogeneous mappings (multiple, unrelated/non-overlapping
mappings). The recommendation at the time was to configure a separate job per data "type". For example, you might configure a separate
job for each Beat module that you had enabled (one for `process`, another for `filesystem`, etc).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it should be Beats module?

@@ -85,6 +85,8 @@
public static final String ROLLUP_TEMPLATE_VERSION_FIELD = "rollup-version";
public static final String ROLLUP_TEMPLATE_VERSION_PATTERN =
Pattern.quote("${rollup.dynamic_template.version}");
public static final String ROLLUP_NULL_VALUE_PLACEHOLDER = "ROLLUP_NULL_VALUE_PLACEHOLDER";
public static final String ROLLUP_NULL_VALUE = "__ROLLUP_NULL_VALUE_PLACEHOLDER__";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like they might be for the same purpose, maybe we should add comments to them to explain why they are both needed and what for?

// Hide our `null_value` placeholder so it doesn't show up in the terms list
if (bucket.getKeyAsString().equals(Rollup.ROLLUP_NULL_VALUE)) {
return null;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not needed for LongTerms and the histograms? do they not have the concept of a null bucket as well?


String stringVersion = (String)((Map<String, Object>) m).get(Rollup.ROLLUP_TEMPLATE_VERSION_FIELD);
if (stringVersion == null) {
logger.warn("Could not determine version of existing rollup metadata for index [" + indexName + "]");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a warning enough here? I am wondering if when the stringVersion is null: a) will we end up throwing a NPE below such as in the Version.fromString() method and b) if we will be able to "do the right thing"?

if (parsedVersion.before(Version.V_6_4_0)) {
String msg = "Cannot create rollup job [" + job.getConfig().getId() + "] because the rollup index contains " +
"jobs from pre-6.4.0. The mappings for these jobs are not compatible with 6.4.0+. Please specify a new rollup " +
"index.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that users are going to need to recreate their existing rollup jobs and rollup indexes when they upgrade from 6.3 to 6.4+?

if (v instanceof String) {
if (v == null) {
// Arbitrary value to update the doc ID with for nulls
docID.update(19);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disappointed you didn't make this value 42 😃

@polyfractal
Copy link
Contributor Author

@colings86 sorry for the delay on this, I think it's ready for the next round of reviews.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why you need to translate null values to concrete values in the rollup index. IMO this is not needed, the issue without the missing_bucket option is that rollup documents with a missing value are omitted. With missing_bucket these rollup documents are taken into account and will be present in the rollup index. Now why can't you just index them with their null values explicitly ? The count should be correct if you aggregate on another field since documents will null values are indexed in the rollup ?

@polyfractal
Copy link
Contributor Author

Hmm hmm... I think you're right. I was assuming we had to represent those nulls for accurate counts, but we don't care if the field is null so long as the rest of the fields are present (whereas before the entire document was omitted if one of the keys was gone).

Let me think about it, but I think you're right. That also means we don't have the compatibility issue anymore if we don't have to set a null_value

Which also means no need for compatibility check or restart test.
Non-keyword keys needed a null check for doc ID generation however,
plus a test for null keys.
@polyfractal
Copy link
Contributor Author

Ok, so I just pushed a commit which:

  • Removed null_value
  • Removed the rollup version compatibility check (since it was the use of null_value which made it incompatible)
  • Removed the full cluster restart test, since there's not incompatibility to check anymore
  • Added some logic to handle null numeric/histo keys which I missed before... those would have NPEd. Added a test to verify that behavior

I left in the check for rollup version in general (since that's a useful validation) as well as the tweaks to the full cluster restart helper methods, since that will make future tests easier.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @polyfractal . I left some comments regarding null buckets and date histograms.

@@ -159,7 +159,7 @@ public Rounding createRounding() {
vsBuilder.dateHistogramInterval(interval);
vsBuilder.field(field);
vsBuilder.timeZone(timeZone);

vsBuilder.missingBucket(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't allow missing bucket here. We use this config for the timestamp field which a required field for rollup documents. Moreover since it's the first field in the composite, enabling missing buckets would disable the composite optimization when the field is indexed. IMO it's ok to require the timestamp field to be present since otherwise we have no way to know if the document should be part of the rollup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, silly me. Absolutely shouldn't allow missing on the timestamp field :)

Good catch thanks

@@ -83,14 +83,27 @@ private static CRC32 processKeys(Map<String, Object> keys, Map<String, Object> d
doc.put(k + "." + RollupField.TIMESTAMP, v);
doc.put(k + "." + RollupField.INTERVAL, groupConfig.getDateHisto().getInterval());
doc.put(k + "." + DateHistoGroupConfig.TIME_ZONE, groupConfig.getDateHisto().getTimeZone().toString());
docID.update(Numbers.longToBytes((Long)v), 0, 8);
if (v == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not allow null here

when this particular scenario is fixed.
The rollup job will automatically use a placeholder term (`__ROLLUP_NULL_VALUE_PLACEHOLDER__`) as the `null_value` for keyword fields,
which allows it to handle documents that may be missing some of the grouping fields. This placeholder is then removed from search
results, resulting in correct doc counts in a manner that is invisible to the user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part should be updated since we don't use the placeholder term anymore.

@polyfractal
Copy link
Contributor Author

Review comments addressed, tests back to passing too.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@polyfractal
Copy link
Contributor Author

Woo, thanks @jimczi! And good catch on the null_value, I'm happy we didn't need to do that. :)

@polyfractal polyfractal merged commit b7f07f0 into elastic:master Jul 13, 2018
polyfractal added a commit that referenced this pull request Jul 13, 2018
We can leverage the composite agg's new `missing_bucket` feature on
terms groupings.  This means the aggregation criteria used in the indexer
will now return null buckets for missing keys.  

Because all buckets are now returned (even if a key is null),
we can guarantee correct doc counts with
"combined" jobs (where a job rolls up multiple schemas).  This was
previously impossible since composite would ignore documents that
didn't have _all_ the keys, meaning non-overlapping schemas would
cause composite to return no buckets.

Note: date_histo does not use `missing_bucket`, since a timestamp is
always required.

The docs have been adjusted to recommend a single, combined job.  It
also makes reference to the previous issue to help users that are upgrading
(rather than just deleting the sections).
dnhatn added a commit that referenced this pull request Jul 13, 2018
* 6.x:
  Watcher: Make settings reloadable (#31746)
  [Rollup] Histo group config should support scaled_floats (#32048)
  lazy snapshot repository initialization (#31606)
  Add secure setting for watcher email password (#31620)
  Watcher: cleanup ensureWatchExists use (#31926)
  Add second level of field collapsing (#31808)
  Added lenient flag for synonym token filter (#31484) (#31970)
  Test: Fix a second case of bad watch creation
  [Rollup] Use composite's missing_bucket (#31402)
  Docs: Restyled cloud link in getting started
  Docs: Change formatting of Cloud options
  Re-instate link in StringFunctionUtils javadocs
  Correct spelling of AnalysisPlugin#requriesAnalysisSettings (#32025)
  Fix problematic chars in javadoc
  [ML] Move open job failure explanation out of root cause (#31925)
  [ML] Switch ML native QA tests to use a 3 node cluster (#32011)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants