Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Handling of DATE_ADD functions while transpiling to spark #4939

Closed
sidhant-sundial opened this issue Apr 4, 2025 · 4 comments
Closed

Comments

@sidhant-sundial
Copy link

sidhant-sundial commented Apr 4, 2025

Read Dialect: presto
Write Dialect: Spark

For the query: `SELECT date_add('day', -1, TIMESTAMP '2020-03-01 00:00:00 UTC')`



sqlglot.transpile(query, read="presto", write="spark")[0]

outputs: `SELECT DATE_ADD(CAST('2020-03-01 00:00:00 UTC' AS TIMESTAMP), -1)`


This shall return a result of the type DATE however, when running the above date_add with three arguments, spark uses the timestampadd function internally and outputs the result with a timestamp clause.

I checked the sqlglot code and seems like an explicit condition has been added to handle the DAY unit only.

isinstance(expression, exp.TsOrDsAdd) and expression.text("unit").upper() == "DAY"
from the PR: #3609

This looks to be incorrect. Can you please explain why was this check added? Could be that we are missing something. Otherwise, we should remove this condition from sqlglot.

Apache Spark uses timestamp_add internally to compute date_add with three arguments.

The change has been in since at least 2022, by looking at the base grammar: apache/spark@6df10ce

@georgesittas
Copy link
Collaborator

Hi @sidhant-sundial, the section you linked is irrelevant to this issue. The _dateadd_sql function in spark.py is used to generate exp.TsOrDsAdd and exp.TimestampAdd.

This transpilation issue happens because the parser does not fully capture Presto's DATE_ADD semantics. As a result, Spark unfortunately can't see that the return type is a function of the date/time value's type, in order to cast the result appropriately.

@georgesittas
Copy link
Collaborator

@sidhant-sundial since this requires (a) type information to fix completely and (b) quite a bit of work to implement & ensure we don't break stuff, combined with the fact that it's low priority for the core team, I'm going to close as not planned for now. We welcome well-tested PRs if you want to take a stab at it.

@georgesittas georgesittas closed this as not planned Won't fix, can't repro, duplicate, stale Apr 7, 2025
@sidhant-sundial
Copy link
Author

@sidhant-sundial since this requires (a) type information to fix completely and (b) quite a bit of work to implement & ensure we don't break stuff, combined with the fact that it's low priority for the core team, I'm going to close as not planned for now. We welcome well-tested PRs if you want to take a stab at it.

Can you point me to a code entry point where I can start from? I can take a stab post that. Just the file name/function name will help. Thanks!

@georgesittas
Copy link
Collaborator

I would start at presto.py:L317:

            "DATE_ADD": lambda args: exp.DateAdd(
                this=seq_get(args, 2), expression=seq_get(args, 1), unit=seq_get(args, 0)
            ),

The issue is that DateAdd expression does not contain sufficient information to transpile Presto's DATE_ADD, as explained here. You can't solve the problem without type information, so even if you address this correctly it will only work for cases where the argument's type is statically inferrable (e.g. a CAST expression), otherwise you'd need to run annotate_types with full schema information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants