Skip to content

fix(type): add bool and List[bool] for join's on input #38168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

hongbo-miao
Copy link

@hongbo-miao hongbo-miao commented Oct 9, 2022

What changes were proposed in this pull request?

I think join's on input can be bool or List[bool] type. For example, the demo in the comment is a valid demo:

>>> df.join(df2, df.name == df2.name, 'outer').select(
... df.name, df2.height).sort(desc("name")).show()

Why are the changes needed?

I originally got this typing error in my IDE:

image

The command joins two table successfully on different columns, however, the typing definition is wrong I think.

Does this PR introduce any user-facing change?

Yes, I am using pyspark==3.3.0.

How was this patch tested?

After adding bool and List[bool], the typing error is gone.

@hongbo-miao hongbo-miao changed the title fix(type): add bool for join's on input fix(type): add bool and List[bool] for join's on input Oct 9, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -2044,7 +2044,7 @@ def crossJoin(self, other: "DataFrame") -> "DataFrame":
def join(
self,
other: "DataFrame",
on: Optional[Union[str, List[str], Column, List[Column]]] = None,
on: Optional[Union[str, List[str], bool, List[bool], Column, List[Column]]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually an error from IDE (which assumes that the built-in functions always return bool). The expected types here are correct in fact.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HyukjinKwon thanks for following up, just to clarify, do you mean df.name == df2.name will return type Column? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Author

@hongbo-miao hongbo-miao Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I found something interesting.

Intellij IDEA has no issue with pandas join's on.
image
But seems only has issue for pyspark.

Here is the way how pandas implement:

However, if it is still an IDE issue, the IDE definitely fix it for sure. 😃

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that's interesting. cc @zero323 fyi.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon As far as I can tell, there is no real difference here. The difference between Pandas and Spark check, is most likely related to the missing stubs for the former one. If I use environment without pandas-stubs things type check in PyCharm

without pandas stubs

If I choose one with pandas-stubs installed, I get

with pandas stubs

which is the same category of failure as for PySpark code.

Given mypy as a reference, this is an expected false positive (see python/mypy#2783).

On a side note ‒ Pandas and PySpark joins shown on the screenshots are not even remotely equivalent.

@srowen srowen closed this Nov 29, 2022
@hongbo-miao hongbo-miao deleted the patch-1 branch November 29, 2022 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants