more flexible type of input arguments for px functions #1768

emmanuelle · 2019-09-13T13:28:11Z

This is a first a draft for accepting more flexible types of input arguments for px functions, eg numpy arrays, dataframe columns, mixed type etc. At the moment the transformation done by build_or_augment_dataframe is done argument by argument, so having different arguments with different types is ok.

Todo:

write more tests
decide what to do with arguments which are lists, such as hover_data, custom_data, dimensions. At the moment they are just skipped so they should be column names. I did this because I wanted to discuss the names they should have, should it be customdata_0, customdata_1 etc. ?

For the cases which are taken into account, please see the test file.

Closes #1767 , follow-up on plotly/plotly_express#87 (thank you @malmaud !)

emmanuelle · 2019-09-13T13:32:18Z

One other thing to fix : px.scatter(tips, x="bla", y='wrong') does not produce a meaningful error message any more, have to fix this.

packages/python/plotly/plotly/express/_core.py

packages/python/plotly/plotly/tests/test_core/test_px/test_px_input.py

emmanuelle · 2019-09-16T02:39:31Z

TODO
Additional cases to take into account: check that error messages make sense when:

having different lengths for different kw arguments (or dimensions not compatible with the given DataFrame)
what happens with 2D arrays passed as parameters

At the moment the initial DataFrame is modified in-place, probably we should not do that (unless the DataFrame is very big??).

emmanuelle · 2019-09-16T14:57:05Z

next step: start from empty dataframe and add columns

nicolaskruchten · 2019-09-16T15:32:41Z

We should also move the "non-existent column" check and error message into this new function, which I think will help with some of this logic.

packages/python/plotly/plotly/express/_core.py

emmanuelle · 2019-09-16T19:36:48Z

@nicolaskruchten this is ready for a round of review. In the meantime I'll write a better docstring for the helper function.

packages/python/plotly/plotly/express/_core.py

nicolaskruchten · 2019-09-16T20:16:01Z

packages/python/plotly/plotly/express/_core.py

+        df = pd.DataFrame()
+
+    # Retrieve labels (to change column names)
+    labels = args.get("labels")  # labels or None


is it important to move the (re-)labelling logic into this function? Right now it all mostly happens later with various helper functions. If we're moving this logic into this function then should we refactor it out of where it is right now...?

yes, it's better to leave the labelling logic later as it was before, this should be resolved now.

packages/python/plotly/plotly/express/_core.py

packages/python/plotly/plotly/tests/test_core/test_px/test_px_input.py

emmanuelle · 2019-09-17T14:16:17Z

For the docstrings of px functions I chose to use the name array_like which is commonly used in numpy to refer to both arrays and lists, and in the broad sense dataframes as well. This is hopefully ready for another round of review.

nicolaskruchten · 2019-09-24T20:10:41Z

packages/python/plotly/plotly/express/_core.py

+    (pandas series type).
+    """
+    df = args["data_frame"]
+    used_col_names = set()


here we call it used_col_names but elsewhere it's reserved_names

nicolaskruchten · 2019-09-24T20:10:45Z

packages/python/plotly/plotly/express/_core.py

@@ -754,6 +754,215 @@ def apply_default_cascade(args):
        args["marginal_x"] = None


+def _name_heuristic(argument, field_name, used_col_names):


here we call it used_col_names but elsewhere it's reserved_names

right, there were a few iterations, hence the different names. I will change this.

nicolaskruchten · 2019-09-24T20:13:57Z

packages/python/plotly/plotly/express/_core.py

+                col_name = argument
+                if isinstance(argument, int):
+                    col_name = _name_heuristic(argument, field, reserved_names)
+                    if field_name not in array_attrables:


could we remove this if/else block as well as the continue and let it "fall through" to the bottom of the loop where we have the same logic?

nicolaskruchten · 2019-09-24T21:50:27Z

packages/python/plotly/plotly/express/_core.py

@@ -813,6 +813,12 @@ def build_dataframe(args, attrables, array_attrables):
    array_attrables : list
        argument names corresponding to iterables, such as `hover_data`, ...
    """
+    args = dict(input_args)


I don't think this makes a deep copy...

>>> x= [1,2,3] >>> y = dict(a=x) >>> z = dict(y) >>> z["a"] is x True >>> z["a"][0]=100 >>> z {'a': [100, 2, 3]} >>> x [100, 2, 3]

But also think this is only an issue for array_attrables where we write into args[something][something] so if we just copy those that's enough:

>>> x = [1,2,3] >>> y = list(x) >>> y is x False >>> y[0] = 100 >>> x [1, 2, 3]

oh no! ~~I knew it does a deep copy for a list so I had wrongly assumed I could transpose this pattern to a dict.~~

nicolaskruchten · 2019-09-24T21:51:15Z

packages/python/plotly/plotly/express/_core.py

+        if field in array_attrables and isinstance(
+            args[field], pd.core.indexes.base.Index
+        ):
+            args[field] = list(args[field])


following from comment above... if we just always do this, instead of in an if then I think we're safe (for now!)

good idea! Then we don't have to do a deepcopy of the whole args, it's better not to copy the dataframe in particular.

jonmmease

This looks great @emmanuelle. I left some minor comments inline.

I did find the automatic naming behavior a little confusing. What is the simplest way to explain when the .name property of an input series is kept vs. when the arg name is used?

jonmmease · 2019-09-26T10:08:24Z

packages/python/plotly/plotly/express/_core.py

+                continue
+            if isinstance(arg, str):
+                reserved_names.add(arg)
+            if isinstance(arg, pd.core.series.Series):


pd.core.series.Series -> pd.Series. I believe pd.core is considered private by pandas

ok thanks, I got the names by asking type(df) but more readable names are better!

jonmmease · 2019-09-26T10:13:44Z

packages/python/plotly/plotly/express/_core.py

+                    )
+                col_name = argument
+                if isinstance(argument, int):
+                    col_name = _name_heuristic(argument, field, reserved_names)


Do we need the if here. Can we always wrap argument in _name_heuristic(...)?

turns out we could write simpler code here. Thanks !

jonmmease · 2019-09-26T10:15:20Z

packages/python/plotly/plotly/express/_core.py

+        for arg in names:
+            if arg is None:
+                continue
+            if isinstance(arg, str):


Small thing. Since it looks like these three if branches are mutually exclusive, I'd prefer if elif elif to make it clear than there isn't any multi-layer logic going on.

In general I agree except for cases where the first if results in a continue or errors out ;)

jonmmease · 2019-09-26T10:16:49Z

packages/python/plotly/plotly/express/_core.py

+        for arg in names:
+            if arg is None:
+                continue
+            if isinstance(arg, str):


It looks like this logic skips over int column labels. Is that intentional?

not really... but this list of reserved names is tested against the names of keyword arguments which will never be ints (I think) so it should be ok. On the other hand it makes no harm to add int names to the list...

jonmmease · 2019-09-26T11:26:12Z

packages/python/plotly/plotly/express/_core.py

+                df_output[col_name] = df_input[argument]
+            # ----------------- argument is a column / array / list.... -------
+            else:
+                is_index = isinstance(argument, pd.core.indexes.range.RangeIndex)


pd.core.indexes.range.RangeIndex -> pd.RangeIndex

jonmmease · 2019-09-26T11:40:01Z

packages/python/plotly/plotly/express/_core.py

+            if argument is None:
+                continue
+            # Case of multiindex
+            if isinstance(argument, pd.core.indexes.multi.MultiIndex):


pd.core.indexes.multi.MultiIndex -> pd.MultiIndex

nicolaskruchten · 2019-09-26T13:14:29Z

What is the simplest way to explain when the .name property of an input series is kept vs. when the arg name is used?

I believe that with what we landed on, the simplest explanation is that .name is used only to check if the input is the same object as the correspondingly-named column in data_frame (if data_frame is provided) otherwise it's ignored. Doing more than that caused all sorts of weird(er) conflicts!

emmanuelle · 2019-09-26T14:17:25Z

Thanks a lot for your review @jonmmease, we have iterated a lot on this piece of code and having a fresh eye is wonderful. I think I addressed all your comments, which helped to improve the readability of the code.
Regarding when to use name, as explained by @nicolaskruchten we only used names of series equal to columns of the dataframe argument. With the possibility to pass arguments coming from different dataframes, handling conflicts was very tricky and hard to explain. Another possibility would have been to take the name whenever it exists and raise an error when a conflict arises, but we preferred to have a strict approach in the beginning and it is still possible in future versions to see if we want to change this behaviour.

nicolaskruchten · 2019-09-28T00:35:22Z

Let's do a pass where we ensure all error messages have correct spacing between words:

ValueError: All arguments should have the same length.The length of argument color is 3, whereas thelength of previous arguments ['x', 'y', 'facet_row', 'facet_col'] is 4 ... "length.The length" and "thelength".

Similar: String or int arguments are only possible when aDataFrame or an array is provided in the data_frameargument. No DataFrame was provided, but argument 'hover_data_0'is of type str or int.

nicolaskruchten · 2019-09-28T00:43:55Z

packages/python/plotly/plotly/express/_core.py

+                            keep_name = df_provided and argument is df_input.index
+                        else:
+                            # we use getattr/hasattr because of index
+                            keep_name = hasattr(


df = px.data.tips() px.scatter(df, x=df["size"], y=df.tip)

This has a weird behaviour: it doesn't pick up the name "size" even though it is a valid column... This is almost certainly because df.size doesn't return df["size"]. Thanks Pandas, again :(

I think the solution is to look up columns via df_input["name"] and not getattr and just special-case the index thing.

actually it was already a special case, so the modification was easy

nicolaskruchten · 2019-09-28T00:57:17Z

OK, other than the spaces in the error messages and the df.size thing I think we're good! and we finally have the start of a decent PX test suite: yay!

nicolaskruchten · 2019-09-28T23:51:57Z

packages/python/plotly/plotly/express/_core.py

-                            keep_name = hasattr(
-                                df_input, col_name
-                            ) and argument is getattr(df_input, col_name)
+                            keep_name = col_name in df_input and argument.equals(


Is it intentional to move from is to .equals() here?

(also we should remove the comment above now ;)

yes it was intentional, I think it's enough that the values are all equal. What do you think?

ok I reverted to is

nicolaskruchten · 2019-10-04T01:02:09Z

💃 💃 💃 thanks so much for getting this thing over the finish line!

emmanuelle · 2019-10-04T01:31:42Z

yeehaw ! Merging :-)

jason-curtis · 2019-10-13T01:03:34Z

Would love to consume this from PyPI! Is there a release planned?

nicolaskruchten · 2019-10-13T14:31:26Z

Version 4.2 should come out this week, yes!

more flexible type of input arguments for px functions

152f87f

wrong col name case

fe4eda3

nicolaskruchten reviewed Sep 13, 2019

View reviewed changes

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved

nicolaskruchten reviewed Sep 13, 2019

View reviewed changes

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved

nicolaskruchten reviewed Sep 13, 2019

View reviewed changes

packages/python/plotly/plotly/tests/test_core/test_px/test_px_input.py Show resolved Hide resolved

emmanuelle added 7 commits September 13, 2019 11:45

corner case of functions grabbing all cols

86937ea

better behavior of index, more tests

1ad0f5c

comment code

7d0e985

black

915a5a1

debugging

5d5ab81

relax column ordering in tests

ea0fa6a

tests

e5f6953

array arguments

29d7e18

emmanuelle added 3 commits September 16, 2019 14:09

move column checks

4a028a2

case of dimensions

ab35b42

clean code + black

e4b8835

emmanuelle commented Sep 16, 2019

View reviewed changes

packages/python/plotly/plotly/express/_core.py Outdated Show resolved Hide resolved

emmanuelle changed the title ~~[WIP] more flexible type of input arguments for px functions~~ more flexible type of input arguments for px functions Sep 16, 2019

nicolaskruchten reviewed Sep 16, 2019

View reviewed changes

emmanuelle added 5 commits September 16, 2019 17:03

deduplicated labels logics

3b0d21a

better handling of dimensions

2a6ff71

corner case when column was modified

72b5d1c

modified docs

19431dd

black

7297a7f

nicolaskruchten reviewed Sep 24, 2019

View reviewed changes

do not modify input arguments

ef29378

nicolaskruchten reviewed Sep 24, 2019

View reviewed changes

emmanuelle added 5 commits September 24, 2019 21:12

corrected bug

5e40653

name consistency

78564cc

name consistency

2534ca9

qa

c88660f

if args[field] is a dict it should stay a dict

b5adbcf

jonmmease reviewed Sep 26, 2019

View reviewed changes

addressed Jon's comments

b375f91

nicolaskruchten reviewed Sep 28, 2019

View reviewed changes

emmanuelle added 2 commits September 28, 2019 17:21

spaces in error messages

c59a4d5

case of size column

1774ade

nicolaskruchten reviewed Sep 28, 2019

View reviewed changes

emmanuelle added 2 commits September 30, 2019 12:44

qa

16d95af

revert to is

5c95786

emmanuelle merged commit 4a30dc6 into master Oct 4, 2019

emmanuelle deleted the px-input-arguments branch October 4, 2019 01:50

MarcoGorelli mentioned this pull request Oct 29, 2024

plotly with pandas>=3.0 fails to raise name conflict error #4837

Closed

		@@ -754,6 +754,215 @@ def apply_default_cascade(args):
		args["marginal_x"] = None


		def _name_heuristic(argument, field_name, used_col_names):

more flexible type of input arguments for px functions #1768

more flexible type of input arguments for px functions #1768

Conversation

emmanuelle commented Sep 13, 2019

emmanuelle commented Sep 13, 2019

emmanuelle commented Sep 16, 2019

emmanuelle commented Sep 16, 2019

nicolaskruchten commented Sep 16, 2019

emmanuelle commented Sep 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emmanuelle commented Sep 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emmanuelle Sep 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmmease left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaskruchten commented Sep 26, 2019

emmanuelle commented Sep 26, 2019

nicolaskruchten commented Sep 28, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaskruchten commented Sep 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolaskruchten commented Oct 4, 2019

emmanuelle commented Oct 4, 2019

jason-curtis commented Oct 13, 2019

nicolaskruchten commented Oct 13, 2019

emmanuelle Sep 25, 2019 •

edited

Loading

nicolaskruchten commented Sep 28, 2019 •

edited

Loading