Skip to content

Commit 34a82fd

Browse files
Slep007 - feature names, their generation and the API (#17)
* initial 007 * fix typo * fix code block * rewrite the SLEP * clarify the flexibility on metaestimators * add motivation and clarifications * apply more comments * address andy's comments * add examples * add redundant prefix example, clarify O(1) issue * put slep under review * address Nicolas's suggestions * Update slep007/proposal.rst Co-Authored-By: Andreas Mueller <[email protected]> * change the title * shorted example * address Nicolas's comments, remove onetoone mapping * address Nicolas's comments * trying to address Guillaume's comments * imagine -> include Co-authored-by: Andreas Mueller <[email protected]>
1 parent 02ce4db commit 34a82fd

File tree

2 files changed

+292
-4
lines changed

2 files changed

+292
-4
lines changed

slep007/proposal.rst

+288
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
.. _slep_007:
2+
3+
===========================================
4+
Feature names, their generation and the API
5+
===========================================
6+
7+
:Author: Adrin Jalali
8+
:Status: Under Review
9+
:Type: Standards Track
10+
:Created: 2019-04
11+
12+
Abstract
13+
########
14+
15+
This SLEP proposes the introduction of the ``feature_names_in_`` attribute for
16+
all estimators, and the ``feature_names_out_`` attribute for all transformers.
17+
We here discuss the generation of such attributes and their propagation through
18+
pipelines. Since for most estimators there are multiple ways to generate
19+
feature names, this SLEP does not intend to define how exactly feature names
20+
are generated for all of them.
21+
22+
Motivation
23+
##########
24+
25+
``scikit-learn`` has been making it easier to build complex workflows with the
26+
``ColumnTransformer`` and it has been seeing widespread adoption. However,
27+
using it results in pipelines where it's not clear what the input features to
28+
the final predictor are, even more so than before. For example, after fitting
29+
the following pipeline, users should ideally be able to inspect the features
30+
going into the final predictor::
31+
32+
33+
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
34+
35+
# We will train our classifier with the following features:
36+
# Numeric Features:
37+
# - age: float.
38+
# - fare: float.
39+
# Categorical Features:
40+
# - embarked: categories encoded as strings {'C', 'S', 'Q'}.
41+
# - sex: categories encoded as strings {'female', 'male'}.
42+
# - pclass: ordinal integers {1, 2, 3}.
43+
44+
# We create the preprocessing pipelines for both numeric and categorical data.
45+
numeric_features = ['age', 'fare']
46+
numeric_transformer = Pipeline(steps=[
47+
('imputer', SimpleImputer(strategy='median')),
48+
('scaler', StandardScaler())])
49+
50+
categorical_features = ['embarked', 'sex', 'pclass']
51+
categorical_transformer = Pipeline(steps=[
52+
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
53+
('onehot', OneHotEncoder(handle_unknown='ignore'))])
54+
55+
preprocessor = ColumnTransformer(
56+
transformers=[
57+
('num', numeric_transformer, numeric_features),
58+
('cat', categorical_transformer, categorical_features)])
59+
60+
# Append classifier to preprocessing pipeline.
61+
# Now we have a full prediction pipeline.
62+
clf = Pipeline(steps=[('preprocessor', preprocessor),
63+
('classifier', LogisticRegression())])
64+
65+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
66+
67+
clf.fit(X_train, y_train)
68+
69+
70+
However, it's impossible to interpret or even sanity-check the
71+
``LogisticRegression`` instance that's produced in the example, because the
72+
correspondence of the coefficients to the input features is basically
73+
impossible to figure out.
74+
75+
This proposal suggests adding two attributes to fitted estimators:
76+
``feature_names_in_`` and ``feature_names_out_``, such that in the
77+
abovementioned example ``clf[-1].feature_names_in_`` and
78+
``clf[-2].feature_names_out_`` will be::
79+
80+
['num__age',
81+
'num__fare',
82+
'cat__embarked_C',
83+
'cat__embarked_Q',
84+
'cat__embarked_S',
85+
'cat__embarked_missing',
86+
'cat__sex_female',
87+
'cat__sex_male',
88+
'cat__pclass_1',
89+
'cat__pclass_2',
90+
'cat__pclass_3']
91+
92+
Ideally the generated feature names describe how a feature is generated at each
93+
stage of a pipeline. For instance, ``cat__sex_female`` shows that the feature
94+
has been through a categorical preprocessing pipeline, was originally the
95+
column ``sex``, and has been one hot encoded and is one if it was originally
96+
``female``. However, this is not always possible or desirable especially when a
97+
generated column is based on many columns, since the generated feature names
98+
will be too long, for example in ``PCA``. As a rule of thumb, the following
99+
types of transformers may generate feature names which corresponds to the
100+
original features:
101+
102+
- Leave columns unchanged, *e.g.* ``StandardScaler``
103+
- Select a subset of columns, *e.g.* ``SelectKBest``
104+
- create new columns where each column depends on at most one input column,
105+
*e.g* ``OneHotEncoder``
106+
- Algorithms that create combinations of a fixed number of features, *e.g.*
107+
``PolynomialFeatures``, as opposed to all of
108+
them where there are many. Note that verbosity considerations and
109+
``verbose_feature_names`` as explained later can apply here.
110+
111+
This proposal talks about how feature names are generated and not how they are
112+
propagated.
113+
114+
verbose_feature_names
115+
*********************
116+
117+
``verbose_feature_names`` controls the verbosity of the generated feature names
118+
and it can be ``True`` or ``False``. Alternative solutions could include:
119+
120+
- an integer: fine tuning the verbosity of the generated feature names.
121+
- a ``callable`` which would give further flexibility to the user to generate
122+
user defined feature names.
123+
124+
These alternatives may be discussed and implemented in the future if deemed
125+
necessary.
126+
127+
Scope
128+
#####
129+
130+
The API for input and output feature names includes a ``feature_names_in_``
131+
attribute for all estimators, and a ``feature_names_out_`` attribute for any
132+
estimator with a ``transform`` method, *i.e.* they expose the generated feature
133+
names via the ``feature_names_out_`` attribute.
134+
135+
Note that this SLEP also applies to `resamplers
136+
<https://github.com/scikit-learn/enhancement_proposals/pull/15>`_ the same way
137+
as transformers.
138+
139+
Input Feature Names
140+
###################
141+
142+
The input feature names are stored in a fitted estimator in a
143+
``feature_names_in_`` attribute, and are taken from the given input data, for
144+
instance a ``pandas`` data frame. This attribute will be ``None`` if the input
145+
provides no feature names.
146+
147+
Output Feature Names
148+
####################
149+
150+
A fitted estimator exposes the output feature names through the
151+
``feature_names_out_`` attribute. Here we discuss more in detail how these
152+
feature names are generated. Since for most estimators there are multiple ways
153+
to generate feature names, this SLEP does not intend to define how exactly
154+
feature names are generated for all of them. It is instead a guideline on how
155+
they could generally be generated. Furthermore, that specific behavior of a
156+
given estimator may be tuned via the ``verbose_feature_names`` parameter, as
157+
detailed below.
158+
159+
As detailed bellow, some generated output features names are the same or a
160+
derived from the input feature names. In such cases, if no input feature names
161+
are provided, ``x0`` to ``xn`` are assumed to be their names.
162+
163+
Feature Selector Transformers
164+
*****************************
165+
166+
This includes transformers which output a subset of the input features, w/o
167+
changing them. For example, if a ``SelectKBest`` transformer selects the first
168+
and the third features, and no names are provided, the ``feature_names_out_``
169+
will be ``[x0, x2]``.
170+
171+
Feature Generating Transformers
172+
*******************************
173+
174+
The simplest category of transformers in this section are the ones which
175+
generate a column based on a single given column. The generated output column
176+
in this case is a sensible transformation of the input feature name. For
177+
instance, a ``LogTransformer`` can do ``'age' -> 'log(age)'``, and a
178+
``OneHotEncoder`` could do ``'gender' -> 'gender_female', 'gender_fluid',
179+
...``. An alternative is to leave the feature names unchanged when each output
180+
feature corresponds to exactly one input feature. Whether or not to modify the
181+
feature name, *e.g.* ``log(x0)`` vs. ``x0`` may be controlled via the
182+
``verbose_feature_names`` to the constructor. The default value of
183+
``verbose_feature_names`` can be different depending on the transformer. For
184+
instance, ``StandardScaler`` can have it as ``False``, whereas
185+
``LogTransformer`` could have it as ``True`` by default.
186+
187+
Transformers where each output feature depends on a fixed number of input
188+
features may generate descriptive names as well. For instance, a
189+
``PolynomialTransformer`` on a small subset of features can generate an output
190+
feature name such as ``x[0] * x[2] ** 3``.
191+
192+
And finally, the transformers where each output feature depends on many or all
193+
input features, generate feature names which has the form of ``name0`` to
194+
``namen``, where ``name`` represents the transformer. For instance, a ``PCA``
195+
transformer will output ``[pca0, ..., pcan]``, ``n`` being the number of PCA
196+
components.
197+
198+
Meta-Estimators
199+
***************
200+
201+
Meta estimators can choose to prefix the output feature names given by the
202+
estimators they are wrapping or not.
203+
204+
By default, ``Pipeline`` adds no prefix, *i.e* its ``feature_names_out_`` is
205+
the same as the ``feature_names_out_`` of the last step, and ``None`` if the
206+
last step is not a transformer.
207+
208+
``ColumnTransformer`` by default adds a prefix to the output feature names,
209+
indicating the name of the transformer applied to them. If a column is in the output
210+
as a part of ``passthrough``, it won't be prefixed since no operation has been
211+
applied on it.
212+
213+
This is the default behavior, and it can be tuned by constructor parameters if
214+
the meta estimator allows it. For instance, a ``verbose_feature_names=False``
215+
may indicate that a ``ColumnTransformer`` should not prefix the generated
216+
feature names with the name of the step.
217+
218+
Examples
219+
########
220+
221+
Here we include some examples to demonstrate the behavior of output feature
222+
names::
223+
224+
100 features (no names) -> PCA(n_components=3)
225+
feature_names_out_: [pca0, pca1, pca2]
226+
227+
228+
100 features (no names) -> SelectKBest(k=3)
229+
feature_names_out_: [x2, x17, x42]
230+
231+
232+
[f1, ..., f100] -> SelectKBest(k=3)
233+
feature_names_out_: [f2, f17, f42]
234+
235+
236+
[cat0] -> OneHotEncoder()
237+
feature_names_out_: [cat0_cat, cat0_dog, ...]
238+
239+
240+
[f1, ..., f100] -> Pipeline(
241+
[SelectKBest(k=30),
242+
PCA(n_components=3)]
243+
)
244+
feature_names_out_: [pca0, pca1, pca2]
245+
246+
247+
[model, make, numeric0, ..., numeric100] ->
248+
ColumnTransformer(
249+
[('cat', Pipeline(SimpleImputer(), OneHotEncoder()),
250+
['model', 'make']),
251+
('num', Pipeline(SimpleImputer(), PCA(n_components=3)),
252+
['numeric0', ..., 'numeric100'])]
253+
)
254+
feature_names_out_: ['cat_model_100', 'cat_model_200', ...,
255+
'cat_make_ABC', 'cat_make_XYZ', ...,
256+
'num_pca0', 'num_pca1', 'num_pca2']
257+
258+
However, the following examples produce a somewhat redundant feature names,
259+
and hence the relevance of ``verbose_feature_names=False``::
260+
261+
[model, make, numeric0, ..., numeric100] ->
262+
ColumnTransformer([
263+
('ohe', OneHotEncoder(), ['model', 'make']),
264+
('pca', PCA(n_components=3), ['numeric0', ..., 'numeric100'])
265+
])
266+
feature_names_out_: ['ohe_model_100', 'ohe_model_200', ...,
267+
'ohe_make_ABC', 'ohe_make_XYZ', ...,
268+
'pca_pca0', 'pca_pca1', 'pca_pca2']
269+
270+
If desired, the user can remove the prefixes::
271+
272+
[model, make, numeric0, ..., numeric100] ->
273+
make_column_transformer(
274+
(OneHotEncoder(), ['model', 'make']),
275+
(PCA(n_components=3), ['numeric0', ..., 'numeric100']),
276+
verbose_feature_names=False
277+
)
278+
feature_names_out_: ['model_100', 'model_200', ...,
279+
'make_ABC', 'make_XYZ', ...,
280+
'pca0', 'pca1', 'pca2']
281+
282+
Backward Compatibility
283+
######################
284+
285+
All estimators should implement the ``feature_names_in_`` and
286+
``feature_names_out_`` API. This is checked in ``check_estimator``, and the
287+
transition is done with a ``FutureWarning`` for at least two versions to give
288+
time to third party developers to implement the API.

under_review.rst

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
SLEPs under review
22
==================
33

4-
No SLEP is currently under review.
4+
.. No SLEP is currently under review.
55
66
.. Uncomment below when a SLEP is under review
77
8-
.. .. toctree::
9-
.. :maxdepth: 1
8+
.. toctree::
9+
:maxdepth: 1
1010

11-
.. slepXXX/proposal
11+
slep007/proposal

0 commit comments

Comments
 (0)