Skip to content

ENH: Implement cross method for Merge Operations #37864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Nov 26, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
71edcce
First cross merge draft for merge operation
phofl Nov 15, 2020
cc5d779
Merge branch 'master' of https://github.com/pandas-dev/pandas
phofl Nov 15, 2020
f573ca4
Fix variable assignment
phofl Nov 15, 2020
0acdd00
Adress review comments
phofl Nov 18, 2020
949185e
Change function signature
phofl Nov 18, 2020
2d5ccaa
Add cross functionality for join
phofl Nov 18, 2020
9098852
Change docs
phofl Nov 18, 2020
60f6b25
Add asvs
phofl Nov 18, 2020
0601243
Move import
phofl Nov 18, 2020
c46eab3
Assign value
phofl Nov 18, 2020
6274120
Reduce asvs
phofl Nov 18, 2020
891785d
Merge branch 'master' of https://github.com/pandas-dev/pandas into 5401
phofl Nov 18, 2020
1f0a1c8
Remove whitespaces
phofl Nov 18, 2020
d7c1156
Move import
phofl Nov 18, 2020
7c8d37a
Adress review
phofl Nov 19, 2020
651540e
Fix doc checks
phofl Nov 19, 2020
f7cdd4d
Add docstring
phofl Nov 19, 2020
a4d24a9
Change example
phofl Nov 19, 2020
741b4b7
Fix typos and rename variables
phofl Nov 21, 2020
0ff78fc
Check unmodified inputs
phofl Nov 21, 2020
94316f3
Add examples
phofl Nov 22, 2020
94b1367
Add tests
phofl Nov 22, 2020
77a9e23
Fix doc
phofl Nov 22, 2020
4597642
Change signature
phofl Nov 23, 2020
67a67a6
Move test
phofl Nov 23, 2020
f731081
Delete import
phofl Nov 23, 2020
4fcde78
Create new file
phofl Nov 23, 2020
4589651
Raise if duplicate on column
phofl Nov 24, 2020
d964ef1
Revert "Raise if duplicate on column"
phofl Nov 25, 2020
a1eeaa4
Merge branch 'master' of https://github.com/pandas-dev/pandas into 5401
phofl Nov 25, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,12 +205,14 @@
The join is done on columns or indexes. If joining columns on
columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes
on indexes or indexes on a column or columns, the index will be passed on.
When performing a cross merge, no column specifications to merge on are
allowed.

Parameters
----------%s
right : DataFrame or named Series
Object to merge with.
how : {'left', 'right', 'outer', 'inner'}, default 'inner'
how : {'left', 'right', 'outer', 'inner', 'cross'}, default 'inner'
Type of merge to be performed.

* left: use only keys from left frame, similar to a SQL left outer join;
Expand All @@ -221,6 +223,8 @@
join; sort keys lexicographically.
* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys.
* cross: creates the karthesian product from both frames, preserves the order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cartesian

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

German influence, sorry :)

of the left keys.
on : label or list
Column or index level names to join on. These must be found in both
DataFrames. If `on` is None and not merging on indexes then this defaults
Expand Down
36 changes: 36 additions & 0 deletions pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -591,6 +591,8 @@ def __init__(
):
_left = _validate_operand(left)
_right = _validate_operand(right)
if how == "cross":
_left, _right, how, on = self._create_cross_configuration(_left, _right, on)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return the new column here (in addition to the other values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self.left = self.orig_left = _left
self.right = self.orig_right = _right
self.how = how
Expand Down Expand Up @@ -690,8 +692,15 @@ def get_result(self):

self._maybe_restore_index_levels(result)

self._maybe_drop_cross_column(result)

return result.__finalize__(self, method="merge")

def _maybe_drop_cross_column(self, result: "DataFrame"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass the col in here (type the output)

cross_col = getattr(self, "_cross", None)
if cross_col is not None:
result.drop(columns=cross_col, inplace=True)

def _indicator_pre_merge(
self, left: "DataFrame", right: "DataFrame"
) -> Tuple["DataFrame", "DataFrame"]:
Expand Down Expand Up @@ -1200,7 +1209,34 @@ def _maybe_coerce_merge_keys(self):
typ = rk.categories.dtype if rk_is_cat else object
self.right = self.right.assign(**{name: self.right[name].astype(typ)})

def _create_cross_configuration(
self, _left, _right, on
) -> Tuple["DataFrame", "DataFrame", str, str]:
if on is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this validation should be done first (IOW maybe move create_cross_configuration lower where you are calling)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, that I understand you correctly: Calling the function _create_cross_configuration should be done after calling _validate_specification? in this case I have moved the parts around

raise MergeError(
"Can not pass any merge columns when using cross as merge method"
)
cross_col = f"{max([*_left.columns, *_right.columns])}_cross"
_left = _left.copy()
_right = _right.copy()
_left.insert(loc=0, value=1, column=cross_col)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use .assign (IOW put it at the end)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

_right.insert(loc=0, value=1, column=cross_col)
how = "inner"
on = cross_col
self._cross = cross_col
return _left, _right, how, on

def _validate_specification(self):
if hasattr(self, "_cross"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of checking the attribute, rather can you check if how is 'cross'?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on answer above, this is done

if (
self.left_index
or self.right_index
or self.right_on is not None
or self.left_on is not None
):
raise MergeError(
"Can not pass any merge columns when using cross as merge method"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you say that left_on,right_on,on must be None, and left_index,right_index must be False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, done

)
# Hm, any way to make this logic less complicated??
if self.on is None and self.left_on is None and self.right_on is None:

Expand Down
31 changes: 31 additions & 0 deletions pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -2337,3 +2337,34 @@ def test_merge_join_cols_error_reporting_on_and_index(func, kwargs):
)
with pytest.raises(MergeError, match=msg):
getattr(pd, func)(left, right, on="a", **kwargs)


@pytest.mark.parametrize(
("input_col", "output_cols"), [("b", ["a", "b"]), ("a", ["a_x", "a_y"])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry maybe i wasn't clear, can you make a new file called test_merge_cross.py and put these there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah ok, created the file and moved the tests

)
def test_merge_cross(input_col, output_cols):
# GH#5401
left = DataFrame({"a": [1, 3]})
right = DataFrame({input_col: [3, 4]})
result = merge(left, right, how="cross")
expected = DataFrame({output_cols[0]: [1, 1, 3, 3], output_cols[1]: [3, 4, 3, 4]})
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
"kwargs",
[
{"left_index": True},
{"right_index": True},
{"on": "a"},
{"left_on": "a"},
{"right_on": "b"},
],
)
def test_merge_cross_error_reporting(kwargs):
# GH#5401
left = DataFrame({"a": [1, 3]})
right = DataFrame({"b": [3, 4]})
msg = "Can not pass any merge columns when using cross as merge method"
with pytest.raises(MergeError, match=msg):
merge(left, right, how="cross", **kwargs)