Variable labels as a dataframe field #11179

cdagnino · 2015-09-23T20:49:39Z

I use both Stata and Pandas. Many Stata users save variable labels to describe the columns in a clearer way than the names. Running this in Stata

sysuse auto.dta
describe

gives something like

variable name	storage type	variable label
make	str18	Make and Model
price	int	Price
mpg	int	Mileage (mpg)

For me (maybe for others too) it would be useful to have an optional field in a DataFrame with a column label dictionary. The keys would be the columns (not necessarily all of them) and the values the string labels.

This is used in the pandas.io.stata.StataReader field variable_labels(see the docs], that allows you to import these labels when one reads in a Stata .dta file.

I know I could just carry around a dictionary with this information, but I think it's cleaner and less error prone to set it and save it within a DataFrame.

Additionally, storing this would allow doing a cycle on Stata/Pandas without loss of information, since the to_stata would check if this field exists. (to_stata might already have the option to pass the variable_labels dictionary as an option, but I didn't see it documented at least)

My coding prowess is quite limited, but I'd be happy to at least write test code and help out if somebody starts out.

The text was updated successfully, but these errors were encountered:

jreback · 2015-09-24T10:16:11Z

This would involve attaching additional meta-data to the Index object, specifically a matching list / dict of value -> description. But this would then raise quite a few issues. Maybe you can provide some pseudo examples of what you think about the following.

what would the Index constructor look-like. What would be a natural way to specify these? e.g.
i = Index([1,2,3],desc=[....])?
When/how/what would you repr these? E.g. you are showing basically df.info(). We already have a pretty complicated repr, e.g. (and this is not even a mult-index)

In [3]: df = DataFrame([[1,2]],index=Index([1],name='foo'),columns=Index(['A','B'],name='bar'))

In [4]: df
Out[4]: 
bar  A  B
foo      
1    1  2

aside from 'desc' or 'labels' of the data, how is this useful? These are certainly not applicable to say 'quantities/units' (which is much more of a dtype specification.

cdagnino · 2015-09-24T21:54:56Z

I see something similar was raised in #39, which was closed in favor of the general issue of allowing metadata for a DataFrame in #2485.

Let me first be more explicit about the use case and then try to answer some of your questions.

I'll take columns from different sources or create new ones. Exactly what they mean or how they were created doesn't fit into the name. In Stata I'd add a longer description to document this and a quick describe is good for refreshing memory. In pandas I'm thinking something like:

df = pd.DataFrame({'x': [3, 1], 'y' : [8, 2], 'z' : [1.1, 2.0]})
df.set_variable_labels({'x': 'This is variable x', 'y': 'This is another variable'}) # No need to specify all columns
df.info()   # Gives the table without labels (the same info given in current pandas)
df.info(labels=True)   # Gives a table with the labels
df.variable_labels  # Gives the column dictionary

Like I said before, I could carry around this metadata in a separate dictionary, but I think it would be nice to have in the DataFrame, especially if it can persist after doing some changes.

I don't think it's worth it to get it into the repr, but rather it could go as an option in the df.info()
It looks to me that adding this to the (column) Index object would be a lot of work. I was hoping there could be a way of assigning it with @property and then just appending it to the original df.info(). After modifying the DataFrame, the variable_labels dictionary could have some keys (columns) that don't exist anymore, but I don't think that would be a problem.

I'm guessing the big issue here is persistence, but at this time I don't have enough knowledge of Pandas internals to say anything more helpful.

mbirdi · 2015-10-25T19:31:15Z

Having column names have an additional property of having a label name seems like an interesting feature from Stata. But as a pandas user I don't think I would use it. I like to keep my column names simple. The column names in a DataFrame are also Series objects, and having just one name for them works well for me, and how I use pandas.

For example, I would take the variable names in the first example: make, price, and mpg, and would change them to make_model, price_dollars, mileage_mpg.

I do lose track of my column names from time to time. But when that happens I just create a col_names variable with the DataFrame.columns method.

msampathkumar · 2017-04-24T06:15:17Z

Hei, I like this idea :)

So I created a small code snippet. I'm new to open source, so please share some suggestions.

from pandas import DataFrame

class myDataFrame(DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.columns_labels = {}

    def columns_description(self):
        print('\t'.join(['Column', 'Type', 'Description']))
        for each in zip(df.dtypes.index, df.dtypes):
            each = list(map(str, each)) + [self.columns_labels.get(each[0], '')]
            print('\t'.join(each))
            
    def update_columns_description(self, input_dict):
        for key in input_dict:
            if key in df.columns:
                self.columns_labels[key] = input_dict[key]

df = myDataFrame({'x': [3, 1], 'y' : [8, 2], 'z' : [1.1, 2.0]})
df.columns_description()
df.update_columns_description({'x': "I'm not so sure", 'y':'Hi there!', 'z': 'want to grab some coffe with me :)'})
df.columns_description()

donnaaboise · 2018-01-13T20:46:51Z

(I came across this issue from a comment on my Stack Exchange post on this issue.)

I am new to Pandas, and am using it for the first time in a Juypter Notebook. I love the way Pandas displays table data, and developers have clearly into making the display nice (different formatting styles, shaded table rows, etc). So it seemed obvious to me that there must be a way to have column labels (for display only) that are different from the dictionary keys. I was surprised to find that this feature didn't exist.

Here is why I think it would be a really nice feature.

The ability to manipulate data using short (single variable?) dictionary keys makes mathematical expressions much cleaner. I would much rather use df["e"] in a mathematical expression than
df["Efficiency (%)"].
On the other hand, "e" makes for a bad column header for tables used for presentation purposes, or even just to remember what each column is.

This issue seems especially important in Jupyter Notebooks, which are designed for presentation purposes as well as actual computing.

jreback · 2018-01-13T21:45:40Z

@donnaaboise as you can see from the comments above, I don't think we would object to having this, but practically its quite a lot of work and lots of unanswered questions.

how would the 'labels' be specified (maybe thru an alternative index)
these would naturally have to propagate, this would lead to quite a lot of complexity, just having name propagate properly is hard
how would conflicts between the index and the 'label' be handled? what if they had the some overlapping values?

donnaaboise · 2018-01-13T22:23:30Z

Perhaps this additional meta data doesn't need to be stored at all, but only recognized in a formatter. For example, it is nice that columns can be formatted independently, i.e.

df.style.format({'e' : '{:8.2f}%', 't' : '{:12.3f}'})

Could this style also accept header labels? Something like :

df.style.format(formatstr={'e' : '{:8.2f}%', 't' : '{:12.3f}'}, labels={'e' : 'Efficiency (%)', 't' : 'Time'})

If one simply types

df

at a command prompt, the variable names are printed instead (no labels). Only when a style is specified are labels used instead (if desired).

jreback · 2018-01-13T22:34:24Z

you can already just rename things (and then chain with .style)

In [5]: df
Out[5]: 
   A  B
0  1  4
1  2  5
2  3  6

In [6]: df.rename(columns={'A':'A long version', 'B': 'B long version'})
Out[6]: 
   A long version  B long version
0               1               4
1               2               5
2               3               6

donnaaboise · 2018-01-13T23:12:18Z

yes - this seems like a very good approximation. The only minor drawback I can see is that the format dictionary passed to chained style.format now has to use the long names to format columns. But these can be accessed through a renaming dictionary. Something like :

di = {'e' : 'Efficiency', 't' : 'Time'}
fstr = {di["e"] : '{:8.2f}%', di["t"] : '{:12.3f}'}
df.rename(columns=di).style.format(fstr)

mroeschke · 2021-04-19T00:36:00Z

I believe the current _metadata attribute might be able to solve this issue (https://pandas.pydata.org/pandas-docs/stable/development/extending.html#define-original-properties); therefore, I think this issue is solved by this feature. Happy to reopen this issue if _metadata doesn't completely address this use case

jreback added API Design Needs Discussion Requires discussion from core team before further action labels Sep 24, 2015

jreback added this to the Someday milestone Oct 26, 2015

pdeffebach mentioned this issue Aug 18, 2018

Continue adding Metadata to dataframes JuliaData/DataFrames.jl#1458

Closed

mario-bermonti mentioned this issue Jun 10, 2020

ENH: Make metadata from read_spss available #34682

Closed

mroeschke closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable labels as a dataframe field #11179

Variable labels as a dataframe field #11179

cdagnino commented Sep 23, 2015

jreback commented Sep 24, 2015

cdagnino commented Sep 24, 2015

mbirdi commented Oct 25, 2015

msampathkumar commented Apr 24, 2017

donnaaboise commented Jan 13, 2018

jreback commented Jan 13, 2018

donnaaboise commented Jan 13, 2018

jreback commented Jan 13, 2018

donnaaboise commented Jan 13, 2018

mroeschke commented Apr 19, 2021

Variable labels as a dataframe field #11179

Variable labels as a dataframe field #11179

Comments

cdagnino commented Sep 23, 2015

jreback commented Sep 24, 2015

cdagnino commented Sep 24, 2015

mbirdi commented Oct 25, 2015

msampathkumar commented Apr 24, 2017

donnaaboise commented Jan 13, 2018

jreback commented Jan 13, 2018

donnaaboise commented Jan 13, 2018

jreback commented Jan 13, 2018

donnaaboise commented Jan 13, 2018

mroeschke commented Apr 19, 2021