-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
more flexible type of input arguments for px functions #1768
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
One other thing to fix : |
TODO
At the moment the initial DataFrame is modified in-place, probably we should not do that (unless the DataFrame is very big??). |
next step: start from empty dataframe and add columns |
We should also move the "non-existent column" check and error message into this new function, which I think will help with some of this logic. |
@nicolaskruchten this is ready for a round of review. In the meantime I'll write a better docstring for the helper function. |
df = pd.DataFrame() | ||
|
||
# Retrieve labels (to change column names) | ||
labels = args.get("labels") # labels or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it important to move the (re-)labelling logic into this function? Right now it all mostly happens later with various helper functions. If we're moving this logic into this function then should we refactor it out of where it is right now...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's better to leave the labelling logic later as it was before, this should be resolved now.
For the docstrings of px functions I chose to use the name array_like which is commonly used in numpy to refer to both arrays and lists, and in the broad sense dataframes as well. This is hopefully ready for another round of review. |
(pandas series type). | ||
""" | ||
df = args["data_frame"] | ||
used_col_names = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we call it used_col_names
but elsewhere it's reserved_names
@@ -754,6 +754,215 @@ def apply_default_cascade(args): | |||
args["marginal_x"] = None | |||
|
|||
|
|||
def _name_heuristic(argument, field_name, used_col_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we call it used_col_names
but elsewhere it's reserved_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, there were a few iterations, hence the different names. I will change this.
col_name = argument | ||
if isinstance(argument, int): | ||
col_name = _name_heuristic(argument, field, reserved_names) | ||
if field_name not in array_attrables: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we remove this if/else
block as well as the continue
and let it "fall through" to the bottom of the loop where we have the same logic?
@@ -813,6 +813,12 @@ def build_dataframe(args, attrables, array_attrables): | |||
array_attrables : list | |||
argument names corresponding to iterables, such as `hover_data`, ... | |||
""" | |||
args = dict(input_args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this makes a deep copy...
>>> x= [1,2,3]
>>> y = dict(a=x)
>>> z = dict(y)
>>> z["a"] is x
True
>>> z["a"][0]=100
>>> z
{'a': [100, 2, 3]}
>>> x
[100, 2, 3]
But also think this is only an issue for array_attrables
where we write into args[something][something]
so if we just copy those that's enough:
>>> x = [1,2,3]
>>> y = list(x)
>>> y is x
False
>>> y[0] = 100
>>> x
[1, 2, 3]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh no! I knew it does a deep copy for a list so I had wrongly assumed I could transpose this pattern to a dict
.
if field in array_attrables and isinstance( | ||
args[field], pd.core.indexes.base.Index | ||
): | ||
args[field] = list(args[field]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following from comment above... if we just always do this, instead of in an if
then I think we're safe (for now!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! Then we don't have to do a deepcopy of the whole args
, it's better not to copy the dataframe
in particular.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @emmanuelle. I left some minor comments inline.
I did find the automatic naming behavior a little confusing. What is the simplest way to explain when the .name
property of an input series is kept vs. when the arg name is used?
continue | ||
if isinstance(arg, str): | ||
reserved_names.add(arg) | ||
if isinstance(arg, pd.core.series.Series): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.core.series.Series
-> pd.Series
. I believe pd.core
is considered private by pandas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks, I got the names by asking type(df)
but more readable names are better!
) | ||
col_name = argument | ||
if isinstance(argument, int): | ||
col_name = _name_heuristic(argument, field, reserved_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the if
here. Can we always wrap argument
in _name_heuristic(...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out we could write simpler code here. Thanks !
for arg in names: | ||
if arg is None: | ||
continue | ||
if isinstance(arg, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thing. Since it looks like these three if
branches are mutually exclusive, I'd prefer if
elif
elif
to make it clear than there isn't any multi-layer logic going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I agree except for cases where the first if
results in a continue
or errors out ;)
for arg in names: | ||
if arg is None: | ||
continue | ||
if isinstance(arg, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this logic skips over int
column labels. Is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really... but this list of reserved names is tested against the names of keyword arguments which will never be ints (I think) so it should be ok. On the other hand it makes no harm to add int names to the list...
df_output[col_name] = df_input[argument] | ||
# ----------------- argument is a column / array / list.... ------- | ||
else: | ||
is_index = isinstance(argument, pd.core.indexes.range.RangeIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.core.indexes.range.RangeIndex
-> pd.RangeIndex
if argument is None: | ||
continue | ||
# Case of multiindex | ||
if isinstance(argument, pd.core.indexes.multi.MultiIndex): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.core.indexes.multi.MultiIndex
-> pd.MultiIndex
I believe that with what we landed on, the simplest explanation is that |
Thanks a lot for your review @jonmmease, we have iterated a lot on this piece of code and having a fresh eye is wonderful. I think I addressed all your comments, which helped to improve the readability of the code. |
Let's do a pass where we ensure all error messages have correct spacing between words:
Similar: |
keep_name = df_provided and argument is df_input.index | ||
else: | ||
# we use getattr/hasattr because of index | ||
keep_name = hasattr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df = px.data.tips()
px.scatter(df, x=df["size"], y=df.tip)
This has a weird behaviour: it doesn't pick up the name "size" even though it is a valid column... This is almost certainly because df.size
doesn't return df["size"]
. Thanks Pandas, again :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the solution is to look up columns via df_input["name"]
and not getattr
and just special-case the index
thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually it was already a special case, so the modification was easy
OK, other than the spaces in the error messages and the |
keep_name = hasattr( | ||
df_input, col_name | ||
) and argument is getattr(df_input, col_name) | ||
keep_name = col_name in df_input and argument.equals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional to move from is
to .equals()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also we should remove the comment above now ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it was intentional, I think it's enough that the values are all equal. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I reverted to is
💃 💃 💃 thanks so much for getting this thing over the finish line! |
yeehaw ! Merging :-) |
Would love to consume this from PyPI! Is there a release planned? |
Version 4.2 should come out this week, yes! |
This is a first a draft for accepting more flexible types of input arguments for
px
functions, eg numpy arrays, dataframe columns, mixed type etc. At the moment the transformation done bybuild_or_augment_dataframe
is done argument by argument, so having different arguments with different types is ok.Todo:
hover_data
,custom_data
,dimensions
. At the moment they are just skipped so they should be column names. I did this because I wanted to discuss the names they should have, should it becustomdata_0
,customdata_1
etc. ?For the cases which are taken into account, please see the test file.
Closes #1767 , follow-up on plotly/plotly_express#87 (thank you @malmaud !)