-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: Add string extension type #27949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 12 commits
c24b5b6
3ecb5cc
59a7d39
7c07070
9e1a73b
16ccad8
1027463
aafb53b
9cdfe2f
ab49169
978fb55
aebc688
d90d0ad
41dc0f9
b783559
13cdddd
78c2eaa
726d0af
69d24e5
9cd9945
070fb76
2b90639
381c889
bf82aad
79bd87a
2af8c81
fd24274
0635ede
d3311ee
dce9258
0524f7e
292a8f3
2c88e3b
1b8c83a
f1dad2a
be95ecb
903ea2f
0e1f479
c168ecf
d06ba73
3ba27c3
fe8ee77
d9f63aa
d3c49e2
dcb84f9
43b51cd
4fd2d11
713f807
777b295
8714a53
41f234c
dc9ef3c
9419af2
462b29d
0391563
129fe29
6aebd8c
2ee5e30
7e92cde
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,8 +6,61 @@ | |
Working with text data | ||
====================== | ||
|
||
.. _text.types: | ||
|
||
Text Data Types | ||
--------------- | ||
|
||
.. versionadded:: 1.0.0 | ||
|
||
There are two main ways to store text data | ||
|
||
1. ``object`` -dtype NumPy array. | ||
2. As an :class:`arrays.StringArray` extension type. | ||
|
||
We recommend using :class:`arrays.StringArray` to store text data. | ||
|
||
Prior to pandas 1.0, ``object`` dtype was the only option. This was unfortunate | ||
for many reasons: | ||
|
||
1. You can accidentally store a *mixture* of strings and non-strings in an | ||
``object`` dtype array. It's better to have a dedicated dtype. | ||
2. ``object`` dtype breaks dtype-specific operations like ``select_dtypes``. | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
There isn't a clear way to select *just* text while excluding non-text | ||
but still object-dtype columns. | ||
3. When reading code, the contents of an ``object`` dtype array is less clear | ||
than ``string``. | ||
|
||
For backwards-compatibility, ``object`` dtype remains the default type we | ||
infer a list of strings to | ||
|
||
.. ipython:: python | ||
|
||
pd.Series(['a', 'b', 'c']) | ||
|
||
To explicitly request ``string`` dtype, specify the ``dtype`` | ||
|
||
.. ipython:: python | ||
|
||
pd.Series(['a', 'b', 'c'], dtype="string") | ||
WillAyd marked this conversation as resolved.
Show resolved
Hide resolved
|
||
pd.Series(['a', 'b', 'c'], dtype=pd.StringDtype()) | ||
|
||
Or ``astype`` after the ``Series`` or ``DataFrame`` is created | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure of the convention, should Series and DataFrame be ":class: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't think we have a formal policy. I vaguely recall a discussion somewhere about doing it ~once per paragraph? |
||
|
||
.. ipython:: python | ||
|
||
s = pd.Series(['a', 'b', 'c']) | ||
s | ||
s.astype("string") | ||
|
||
Everything that follows in the rest of this document applies equally to | ||
``string`` and ``object`` dtype. | ||
|
||
.. _text.string_methods: | ||
|
||
String Methods | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
-------------- | ||
|
||
Series and Index are equipped with a set of string processing methods | ||
that make it easy to operate on each element of the array. Perhaps most | ||
importantly, these methods exclude missing/NA values automatically. These are | ||
|
@@ -16,7 +69,8 @@ the equivalent (scalar) built-in string methods: | |
|
||
.. ipython:: python | ||
|
||
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) | ||
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'], | ||
dtype="string") | ||
s.str.lower() | ||
s.str.upper() | ||
s.str.len() | ||
|
@@ -90,7 +144,7 @@ Methods like ``split`` return a Series of lists: | |
|
||
.. ipython:: python | ||
|
||
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h']) | ||
s2 = pd.Series(['a_b_c', 'c_d_e', np.nan, 'f_g_h'], dtype="string") | ||
s2.str.split('_') | ||
|
||
Elements in the split lists can be accessed using ``get`` or ``[]`` notation: | ||
|
@@ -106,6 +160,9 @@ It is easy to expand this to return a DataFrame using ``expand``. | |
|
||
s2.str.split('_', expand=True) | ||
|
||
When original ``Series`` has :class:`StringDtype`, the output columns will all | ||
be :class:`StringDtype` as well. | ||
|
||
It is also possible to limit the number of splits: | ||
|
||
.. ipython:: python | ||
|
@@ -125,7 +182,8 @@ i.e., from the end of the string to the beginning of the string: | |
.. ipython:: python | ||
|
||
s3 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', | ||
'', np.nan, 'CABA', 'dog', 'cat']) | ||
'', np.nan, 'CABA', 'dog', 'cat'], | ||
dtype="string") | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
s3 | ||
s3.str.replace('^.a|dog', 'XX-XX ', case=False) | ||
|
||
|
@@ -136,7 +194,7 @@ following code will cause trouble because of the regular expression meaning of | |
.. ipython:: python | ||
|
||
# Consider the following badly formatted financial data | ||
dollars = pd.Series(['12', '-$10', '$10,000']) | ||
dollars = pd.Series(['12', '-$10', '$10,000'], dtype="string") | ||
|
||
# This does what you'd naively expect: | ||
dollars.str.replace('$', '') | ||
|
@@ -174,15 +232,17 @@ positional argument (a regex object) and return a string. | |
def repl(m): | ||
return m.group(0)[::-1] | ||
|
||
pd.Series(['foo 123', 'bar baz', np.nan]).str.replace(pat, repl) | ||
pd.Series(['foo 123', 'bar baz', np.nan], | ||
dtype="string").str.replace(pat, repl) | ||
|
||
# Using regex groups | ||
pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)" | ||
|
||
def repl(m): | ||
return m.group('two').swapcase() | ||
|
||
pd.Series(['Foo Bar Baz', np.nan]).str.replace(pat, repl) | ||
pd.Series(['Foo Bar Baz', np.nan], | ||
dtype="string").str.replace(pat, repl) | ||
|
||
.. versionadded:: 0.20.0 | ||
|
||
|
@@ -221,7 +281,7 @@ The content of a ``Series`` (or ``Index``) can be concatenated: | |
|
||
.. ipython:: python | ||
|
||
s = pd.Series(['a', 'b', 'c', 'd']) | ||
s = pd.Series(['a', 'b', 'c', 'd'], dtype="string") | ||
s.str.cat(sep=',') | ||
|
||
If not specified, the keyword ``sep`` for the separator defaults to the empty string, ``sep=''``: | ||
|
@@ -234,7 +294,7 @@ By default, missing values are ignored. Using ``na_rep``, they can be given a re | |
|
||
.. ipython:: python | ||
|
||
t = pd.Series(['a', 'b', np.nan, 'd']) | ||
t = pd.Series(['a', 'b', np.nan, 'd'], dtype="string") | ||
t.str.cat(sep=',') | ||
t.str.cat(sep=',', na_rep='-') | ||
|
||
|
@@ -279,7 +339,8 @@ the ``join``-keyword. | |
.. ipython:: python | ||
:okwarning: | ||
|
||
u = pd.Series(['b', 'd', 'a', 'c'], index=[1, 3, 0, 2]) | ||
u = pd.Series(['b', 'd', 'a', 'c'], index=[1, 3, 0, 2], | ||
dtype="string") | ||
s | ||
u | ||
s.str.cat(u) | ||
|
@@ -295,7 +356,8 @@ In particular, alignment also means that the different lengths do not need to co | |
|
||
.. ipython:: python | ||
|
||
v = pd.Series(['z', 'a', 'b', 'd', 'e'], index=[-1, 0, 1, 3, 4]) | ||
v = pd.Series(['z', 'a', 'b', 'd', 'e'], index=[-1, 0, 1, 3, 4], | ||
dtype="string") | ||
s | ||
v | ||
s.str.cat(v, join='left', na_rep='-') | ||
|
@@ -351,7 +413,8 @@ of the string, the result will be a ``NaN``. | |
.. ipython:: python | ||
|
||
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, | ||
'CABA', 'dog', 'cat']) | ||
'CABA', 'dog', 'cat'], | ||
dtype="string") | ||
|
||
s.str[0] | ||
s.str[1] | ||
|
@@ -382,7 +445,8 @@ DataFrame with one column per group. | |
|
||
.. ipython:: python | ||
|
||
pd.Series(['a1', 'b2', 'c3']).str.extract(r'([ab])(\d)', expand=False) | ||
pd.Series(['a1', 'b2', 'c3'], | ||
dtype="string").str.extract(r'([ab])(\d)', expand=False) | ||
|
||
Elements that do not match return a row filled with ``NaN``. Thus, a | ||
Series of messy strings can be "converted" into a like-indexed Series | ||
|
@@ -395,14 +459,16 @@ Named groups like | |
|
||
.. ipython:: python | ||
|
||
pd.Series(['a1', 'b2', 'c3']).str.extract(r'(?P<letter>[ab])(?P<digit>\d)', | ||
expand=False) | ||
pd.Series(['a1', 'b2', 'c3'], | ||
dtype="string").str.extract(r'(?P<letter>[ab])(?P<digit>\d)', | ||
expand=False) | ||
|
||
and optional groups like | ||
|
||
.. ipython:: python | ||
|
||
pd.Series(['a1', 'b2', '3']).str.extract(r'([ab])?(\d)', expand=False) | ||
pd.Series(['a1', 'b2', '3'], | ||
dtype="string").str.extract(r'([ab])?(\d)', expand=False) | ||
|
||
can also be used. Note that any capture group names in the regular | ||
expression will be used for column names; otherwise capture group | ||
|
@@ -413,20 +479,23 @@ with one column if ``expand=True``. | |
|
||
.. ipython:: python | ||
|
||
pd.Series(['a1', 'b2', 'c3']).str.extract(r'[ab](\d)', expand=True) | ||
pd.Series(['a1', 'b2', 'c3'], | ||
dtype="string").str.extract(r'[ab](\d)', expand=True) | ||
|
||
It returns a Series if ``expand=False``. | ||
|
||
.. ipython:: python | ||
|
||
pd.Series(['a1', 'b2', 'c3']).str.extract(r'[ab](\d)', expand=False) | ||
pd.Series(['a1', 'b2', 'c3'], | ||
dtype="string").str.extract(r'[ab](\d)', expand=False) | ||
|
||
Calling on an ``Index`` with a regex with exactly one capture group | ||
returns a ``DataFrame`` with one column if ``expand=True``. | ||
|
||
.. ipython:: python | ||
|
||
s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"]) | ||
s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], | ||
dtype="string") | ||
s | ||
s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True) | ||
|
||
|
@@ -471,7 +540,8 @@ Unlike ``extract`` (which returns only the first match), | |
|
||
.. ipython:: python | ||
|
||
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"]) | ||
s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], | ||
dtype="string") | ||
s | ||
two_groups = '(?P<letter>[a-z])(?P<digit>[0-9])' | ||
s.str.extract(two_groups, expand=True) | ||
|
@@ -489,7 +559,7 @@ When each subject string in the Series has exactly one match, | |
|
||
.. ipython:: python | ||
|
||
s = pd.Series(['a3', 'b3', 'c2']) | ||
s = pd.Series(['a3', 'b3', 'c2'], dtype="string") | ||
s | ||
|
||
then ``extractall(pat).xs(0, level='match')`` gives the same result as | ||
|
@@ -510,7 +580,7 @@ same result as a ``Series.str.extractall`` with a default index (starts from 0). | |
|
||
pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups) | ||
|
||
pd.Series(["a1a2", "b1", "c1"]).str.extractall(two_groups) | ||
pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups) | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
|
||
Testing for Strings that match or contain a pattern | ||
|
@@ -521,13 +591,15 @@ You can check whether elements contain a pattern: | |
.. ipython:: python | ||
|
||
pattern = r'[0-9][a-z]' | ||
pd.Series(['1', '2', '3a', '3b', '03c']).str.contains(pattern) | ||
pd.Series(['1', '2', '3a', '3b', '03c'], | ||
dtype="string").str.contains(pattern) | ||
|
||
Or whether elements match a pattern: | ||
|
||
.. ipython:: python | ||
|
||
pd.Series(['1', '2', '3a', '3b', '03c']).str.match(pattern) | ||
pd.Series(['1', '2', '3a', '3b', '03c'], | ||
dtype="string").str.match(pattern) | ||
|
||
The distinction between ``match`` and ``contains`` is strictness: ``match`` | ||
relies on strict ``re.match``, while ``contains`` relies on ``re.search``. | ||
|
@@ -537,7 +609,8 @@ an extra ``na`` argument so missing values can be considered True or False: | |
|
||
.. ipython:: python | ||
|
||
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) | ||
s4 = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'], | ||
dtype="string") | ||
s4.str.contains('A', na=False) | ||
|
||
.. _text.indicator: | ||
|
@@ -550,7 +623,7 @@ For example if they are separated by a ``'|'``: | |
|
||
.. ipython:: python | ||
|
||
s = pd.Series(['a', 'a|b', np.nan, 'a|c']) | ||
s = pd.Series(['a', 'a|b', np.nan, 'a|c'], dtype="string") | ||
s.str.get_dummies(sep='|') | ||
|
||
String ``Index`` also supports ``get_dummies`` which returns a ``MultiIndex``. | ||
|
Uh oh!
There was an error while loading. Please reload this page.