-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Variable labels as a dataframe field #11179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This would involve attaching additional meta-data to the
|
I see something similar was raised in #39, which was closed in favor of the general issue of allowing metadata for a DataFrame in #2485. Let me first be more explicit about the use case and then try to answer some of your questions. I'll take columns from different sources or create new ones. Exactly what they mean or how they were created doesn't fit into the name. In Stata I'd add a longer description to document this and a quick df = pd.DataFrame({'x': [3, 1], 'y' : [8, 2], 'z' : [1.1, 2.0]})
df.set_variable_labels({'x': 'This is variable x', 'y': 'This is another variable'}) # No need to specify all columns
df.info() # Gives the table without labels (the same info given in current pandas)
df.info(labels=True) # Gives a table with the labels
df.variable_labels # Gives the column dictionary Like I said before, I could carry around this metadata in a separate dictionary, but I think it would be nice to have in the
I'm guessing the big issue here is persistence, but at this time I don't have enough knowledge of |
Having column names have an additional property of having a label name seems like an interesting feature from Stata. But as a pandas user I don't think I would use it. I like to keep my column names simple. The column names in a DataFrame are also Series objects, and having just one name for them works well for me, and how I use pandas. For example, I would take the variable names in the first example: make, price, and mpg, and would change them to make_model, price_dollars, mileage_mpg. I do lose track of my column names from time to time. But when that happens I just create a col_names variable with the DataFrame.columns method. |
Hei, I like this idea :) So I created a small code snippet. I'm new to open source, so please share some suggestions.
|
(I came across this issue from a comment on my Stack Exchange post on this issue.) I am new to Pandas, and am using it for the first time in a Juypter Notebook. I love the way Pandas displays table data, and developers have clearly into making the display nice (different formatting styles, shaded table rows, etc). So it seemed obvious to me that there must be a way to have column labels (for display only) that are different from the dictionary keys. I was surprised to find that this feature didn't exist. Here is why I think it would be a really nice feature.
This issue seems especially important in Jupyter Notebooks, which are designed for presentation purposes as well as actual computing. |
@donnaaboise as you can see from the comments above, I don't think we would object to having this, but practically its quite a lot of work and lots of unanswered questions.
|
Perhaps this additional meta data doesn't need to be stored at all, but only recognized in a formatter. For example, it is nice that columns can be formatted independently, i.e.
Could this style also accept header labels? Something like :
If one simply types
at a command prompt, the variable names are printed instead (no labels). Only when a style is specified are labels used instead (if desired). |
you can already just rename things (and then chain with
|
yes - this seems like a very good approximation. The only minor drawback I can see is that the format dictionary passed to chained
|
I believe the current |
I use both Stata and Pandas. Many Stata users save variable labels to describe the columns in a clearer way than the names. Running this in Stata
gives something like
For me (maybe for others too) it would be useful to have an optional field in a DataFrame with a column label dictionary. The keys would be the columns (not necessarily all of them) and the values the string labels.
This is used in the pandas.io.stata.StataReader field
variable_labels
(see the docs], that allows you to import these labels when one reads in a Stata.dta
file.I know I could just carry around a dictionary with this information, but I think it's cleaner and less error prone to set it and save it within a DataFrame.
Additionally, storing this would allow doing a cycle on Stata/Pandas without loss of information, since the
to_stata
would check if this field exists. (to_stata
might already have the option to pass thevariable_labels
dictionary as an option, but I didn't see it documented at least)My coding prowess is quite limited, but I'd be happy to at least write test code and help out if somebody starts out.
The text was updated successfully, but these errors were encountered: