-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Implement stats aggregation for string terms #47468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
StringStatsAggregatorTests#testSingleValuedFieldFormatter fails because of elastic#47469
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, think it looks good!
Needs some documentation + doc tests. Let me know if you have questions about the doc tests, they are a bit funky :)
|
||
public class DataScienceAggregationBuilders { | ||
public class AnalyticsAggregationBuilders { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, good catch, thanks for the fix :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the embarrassing typo below heh :)
} | ||
|
||
static class Fields { | ||
public static final String COUNT = "count"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use ParseField
for these. ParseField
has some extra functionality to handle deprecations/renaming, in case we ever decide to change the string values.
case avg_length: return this.getAvgLength(); | ||
case entropy: return this.getEntropy(); | ||
default: | ||
throw new IllegalArgumentException("Unknown value [" + name + "] in common stats aggregation"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"common stats"
left over from a different agg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it should read string stats
. Fixed.
...ytics/src/main/java/org/elasticsearch/xpack/analytics/stringstats/StringStatsAggregator.java
Show resolved
Hide resolved
// Parse string chars and count occurrences | ||
for (Character c : valueStr.toCharArray()) { | ||
LongArray occ = charOccurrences.get(c); | ||
final long overSize = BigArrays.overSize(bucket + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a terribly expensive call, but we should be able to move this up and out of the valuesCount
loop I think? That way we only calculate the bigarrays size once instead of for each character in each value?
@elasticmachine run elasticsearch-ci/packaging-sample-matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM, pending docs. /cc @not-napoleon who I volunteered to review the docs while I'm out 😁
@elasticmachine run elasticsearch-ci/default-distro |
@elasticmachine run elasticsearch-ci/default-distro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of nits, but looks good to me overall.
* @return A map with the character as key and the probability of | ||
* this character to occur as value. The map is ordered by frequency descending. | ||
*/ | ||
public Map<String, Double> getDistribution() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Does this need to be public? Looked to me like it was only called within the package
|
||
public Object value(String name) { | ||
Metrics metrics = Metrics.valueOf(name); | ||
switch (metrics) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bit of a nit, but I don't love switching on an enum. If someone later adds a field to the enum, they need to remember to also update this switch. IMHO, a better solution would be to put a method on the enum getFieldValue(InternalStringStats stats)
which could then call the appropriate getter and return the value. That way, any new enum value would need to implement the method for it to compile.
|
||
@Override | ||
public ScoreMode scoreMode() { | ||
return valuesSource != null && valuesSource.needsScores() ? ScoreMode.COMPLETE : ScoreMode.COMPLETE_NO_SCORES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some parenthesis around the predicate here. Having to remember that &&
is higher precedence than ?:
is unnecessary cognitive load, so parenthesis will make it more readable, even if they are technically redundant.
@elasticmachine run elasticsearch-ci/bwc |
Backport of #47468 to 7.x This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following: min_length: The length of the shortest term max_length: The length of the longest term avg_length: The average length of all terms distribution: The probability distribution of all characters appearing in all terms entropy: The total Shannon entropy value calculated for all terms This aggregation has been implemented as an analytics plugin.
This PR adds a new metric aggregation called
string_stats
that operates on string terms of a document and returns the following:min_length
: The length of the shortest termmax_length
: The length of the longest termavg_length
: The average length of all termsdistribution
: The probability distribution of all characters appearing in all termsentropy
: The total Shannon entropy value calculated for all termsThis aggregation has been implemented as an analytics plugin.