Skip to content

SQL: Add PIVOT support #46489

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 23, 2019
Merged

SQL: Add PIVOT support #46489

merged 10 commits into from
Sep 23, 2019

Conversation

costin
Copy link
Member

@costin costin commented Sep 9, 2019

Add initial PIVOT support for transforming a regular table into a
statistics table around an arbitrary pivoting column:

SELECT * FROM
(SELECT languages, country, salary, FROM mp)
PIVOT (AVG(salary) FOR countries IN ('NL', 'DE', 'ES', 'RO', 'US'))

In the current implementation PIVOT allows only one aggregation however
this restriction is likely to be lifted in the future.

Add initial PIVOT support for transforming a regular table into a
statistics table around an arbitrary pivoting column:

SELECT * FROM
 (SELECT languages, country, salary, FROM mp)
 PIVOT (AVG(salary) FOR countries IN ('NL', 'DE', 'ES', 'RO', 'US'))

In the current implementation PIVOT allows only one aggregation however
this restriction is likely to be lifted in the future.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@costin
Copy link
Member Author

costin commented Sep 9, 2019

@elasticmachine run elasticsearch-ci/2

@@ -3,7 +3,8 @@
//
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file shouldn't be here - I'll remove it on the next commit once the feedback lands to avoid an extra build.

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work overall!

I left some comments and here are a couple of more general ones:

@@ -294,7 +294,7 @@ public void testConstantFolding() {
// check now with an alias
result = new ConstantFolding().rule(new Alias(EMPTY, "a", exp));
assertEquals("a", Expressions.name(result));
assertEquals(5, ((Literal) result).value());
//assertEquals(5, ((Literal) result).value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove comment please.

@@ -76,6 +76,9 @@ public static SearchSourceBuilder sourceBuilder(QueryContainer container, QueryB
// set page size
if (size != null) {
int sz = container.limit() > 0 ? Math.min(container.limit(), size) : size;
// now take into account the the minimum page (if set)
int minSize = container.minPageSize();
sz = minSize > 0 ? (Math.max(sz / minSize, 1) * minSize) : sz;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add some comment which explains this calculation?

;


averageWithOneValueAndAlias
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


averageWithScalarOverAggregateAndFoldedValue
schema::status:s|client_ip:s
SELECT status, client_ip FROM logs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we include the pivoted value here?

Alias a = (Alias) e;
return a.child().foldable() ? Literal.of(a.name(), a.child()) : a;
}
// if (e instanceof Alias) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed?

if (groupingSet == null) {
AttributeSet columnSet = Expressions.references(singletonList(column));
// grouping can happen only on "primitive" fields, thus exclude multi-fields or nested docs
// the verifier enforces this rule so it does not catches folks by surprise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catches -> catch

UnresolvedAttribute column = new UnresolvedAttribute(source(pivotClause.column), visitQualifiedName(pivotClause.column));
List<NamedExpression> values = namedValues(pivotClause.aggs);
if (values.size() > 1) {
throw new ParsingException(source(pivotClause.aggs), "PIVOT currently supports only one aggregation, found [{}]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ParsingException at this point is "late", but on the other hand SqlIllegalArgumentException doesn't seem appropriate. Should we introduce a new one, something with 'InvalidorRestricted` in the name?

Not necessary to do as part of this PR though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering there's no typed consumer of these exceptions, the fact that it's a parsing exception vs restricted or whatever we get to call it, is about irrelevant semantics.
Further more if the client sees the message wrapped (inside the UI for example), the exception type is even further minimized.
My point is, the message sent to the user is important, not the exception in which is wrapped.

out.add(value.toAttribute().withDataType(agg.dataType()).withId(id));
}
}
// for multiple args, concat the function and the value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do that also for one aggregate function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this follows the Oracle convention which is the only one that supports multiple aggs inside the same pivot. Currently this branch is disabled due to a corner-case optimization that results in missing IDs.

@costin
Copy link
Member Author

costin commented Sep 18, 2019

@astefan @matriv I've updated the PR, please take another look at it.

Thanks to a number of queries suggested by @astefan I concluded that supporting folding expressions inside values or functions inside columns leads to invalid queries due to the optimizer kicking in and creating different naming expressions which, due to the difference in ID, stops matching.

As such I've beefed up the Verifier and to prevent such cases - once the NamedExpression/ExpressionId refactoring will take place things, I'll revisit this.
Currently the folding of expressions leads to bugs since despite the name not changing, since most expressions are named, different ids are created leading to different identities.

}

public void testPivotWithNull() {
assertEquals("1:85: Null not allowed as a PIVOT value",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefan Added test for forbidding null - it looks like it doesn't work since In doesn't allow it and it just removes it. Which kinda makes sense (null means the thing is missing).
This could be improved by either handling nulls as a separate filter or, preferably inside In directly.
@matriv thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When null is inside the value list of IN it will always yield NULL as a result which in the case of IN used as a filter in a normal WHERE clause means false and therefore it's eliminated. e.g.:

postgres=# select null in (null);
 ?column?
----------

(1 row)

If for pivot this case should yield true It should be treated differently but in my opinion not inside IN but somewhere else (e.g.: QueryFolder where we have the context information of IN used in a PIVOT query).

}

public void testPivotWithFunctionInput() {
assertEquals("1:37: No functions allowed (yet); encountered [YEAR(date)]",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No functions allowed due to namedExpression issues.

@costin
Copy link
Member Author

costin commented Sep 18, 2019

Added a minor update essentially preventing InnerAggs to be used. While the simple case can be addressed - SELECT .. PIVOT (SUM_OF_SQUARES(x)...) - the issue appears when the InnerAgg is hidden:
SELECT .. PIVOT (ROUND(SUM_OF_SQUARES(x))...).
Again, I'm expecting the NamedExpression refactor to address this by using function equality instead of ids.

Until then, I've prevented the use of such aggs inside the verifier.

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you for the feedback addressing and once again great work!

@@ -152,6 +152,30 @@ SELECT * FROM (SELECT languages, gender, salary FROM test_emp) PIVOT (AVG(salary
null |48396.28571428572|62140.666666666664
;

averageWithTwoValuesAndOrderDesc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AndLimit?

else if (namedExpression.foldable()) {
rawValues.add(Literal.of(namedExpression.name(), namedExpression));
}
// TOOD: same as above
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo -> TODO

@@ -203,18 +203,4 @@ null |31070.0
3 |26830.0
4 |24646.0
5 |23353.0
;

innerAggPivot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this? Isn't this the case that is supported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, I got confused and thought that we'd support this more trivial case.

@costin costin merged commit d912637 into elastic:master Sep 23, 2019
costin added a commit that referenced this pull request Sep 23, 2019
Add initial PIVOT support for transforming a regular table into a
statistics table around an arbitrary pivoting column:

SELECT * FROM
 (SELECT languages, country, salary, FROM mp)
 PIVOT (AVG(salary) FOR countries IN ('NL', 'DE', 'ES', 'RO', 'US'))

In the current implementation PIVOT allows only one aggregation however
this restriction is likely to be lifted in the future.
Also not all aggregations are working, in particular MatrixStats are not yet supported.

(cherry picked from commit d912637)
@costin costin deleted the sql/piv branch September 23, 2019 18:04
palesz pushed a commit that referenced this pull request Dec 7, 2020
* Remove the limitation of not being able to use `InnerAggregate`
inside PIVOTs (aggregations using extended and matrix stats)
* The limitation was introduced as part of the original `PIVOT` 
implementation in #46489, but after #49693 it could be lifted.
* Test that the `PIVOT` results in the same query as the 
`GROUP BY`. This should hold across all the 
`AggregateFunction`s we have.
palesz pushed a commit to palesz/elasticsearch that referenced this pull request Dec 7, 2020
* Remove the limitation of not being able to use `InnerAggregate`
inside PIVOTs (aggregations using extended and matrix stats)
* The limitation was introduced as part of the original `PIVOT`
implementation in elastic#46489, but after elastic#49693 it could be lifted.
* Test that the `PIVOT` results in the same query as the
`GROUP BY`. This should hold across all the
`AggregateFunction`s we have.
(cherry-pick 67704b0)
palesz pushed a commit that referenced this pull request Dec 7, 2020
* Remove the limitation of not being able to use `InnerAggregate`
inside PIVOTs (aggregations using extended and matrix stats)
* The limitation was introduced as part of the original `PIVOT`
implementation in #46489, but after #49693 it could be lifted.
* Test that the `PIVOT` results in the same query as the
`GROUP BY`. This should hold across all the
`AggregateFunction`s we have.

(cherry-picked from  67704b0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants