-
Notifications
You must be signed in to change notification settings - Fork 28.6k
fix(type): add bool and List[bool] for join's on input #38168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
@@ -2044,7 +2044,7 @@ def crossJoin(self, other: "DataFrame") -> "DataFrame": | |||
def join( | |||
self, | |||
other: "DataFrame", | |||
on: Optional[Union[str, List[str], Column, List[Column]]] = None, | |||
on: Optional[Union[str, List[str], bool, List[bool], Column, List[Column]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually an error from IDE (which assumes that the built-in functions always return bool
). The expected types here are correct in fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HyukjinKwon thanks for following up, just to clarify, do you mean df.name == df2.name
will return type Column
? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I found something interesting.
Intellij IDEA has no issue with pandas join
's on
.
But seems only has issue for pyspark.
Here is the way how pandas implement:
- https://github.com/pandas-dev/pandas/blob/main/pandas/core/frame.py#L9837-L9840
- https://github.com/pandas-dev/pandas/blob/main/pandas/_typing.py#L121
However, if it is still an IDE issue, the IDE definitely fix it for sure. 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that's interesting. cc @zero323 fyi.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon As far as I can tell, there is no real difference here. The difference between Pandas and Spark check, is most likely related to the missing stubs for the former one. If I use environment without pandas-stubs
things type check in PyCharm
If I choose one with pandas-stubs
installed, I get
which is the same category of failure as for PySpark code.
Given mypy as a reference, this is an expected false positive (see python/mypy#2783).
On a side note ‒ Pandas and PySpark joins shown on the screenshots are not even remotely equivalent.
What changes were proposed in this pull request?
I think
join
'son
input can bebool
orList[bool]
type. For example, the demo in the comment is a valid demo:spark/python/pyspark/sql/dataframe.py
Lines 2107 to 2108 in 309638e
Why are the changes needed?
I originally got this typing error in my IDE:
The command joins two table successfully on different columns, however, the typing definition is wrong I think.
Does this PR introduce any user-facing change?
Yes, I am using
pyspark==3.3.0
.How was this patch tested?
After adding
bool
andList[bool]
, the typing error is gone.